Formatting Text with Perl
By Gerry Patterson
But I Just Love My Text Editor
People get awfully fond of their favourite text editor and/or word processor. In fact some people get more fond of their word processor than than their dogs. And that's a big wrap from a dog owner. Many people use their editor or word processor as the principal means of interacting with a computer and with other people. So it's not surprising that most people prefer to be familiar with such an important tool. It's no fun having to re-learn all those basic key-strokes which become second nature after continual use of a Word Processor.
I have been using vi for so long I don't have to think about the common navigation commands. Most users who try using this quirky little editor will gain the impression that it comes from another era. Something like vi could only evolve in an environment that utilised qwerty keyboards as the primary means of input. GUI editors generally have a bland personality, because many of them evolved in the last two decades of the twentieth century, which were very competitive environments for those types of software. At this time the PC market was expanding rapidly and the emphasis was on establishing a market niche. To a certain extent, this resulted in a dumbed-down interface. The emphasis was on ease-of-use and a look and feel that was idiomatic for the age of the mouse. It is unlikely that an editor like vi, with it's terse command syntax, would have evolved in this environment. Vi has been built with many assumptions. One example is the inclusion of the command set used by it's predecessor, ed (or more correctly ex as the extended version of ed is called). This assumes that the user is familiar with ed. It is also obvious that the designers of ed and vi assumed that a user would be familiar with regular expressions.
Many years ago, I was engaged in a conversation/debate with a younger programmer who had recently completed a course on Unix. He was bad-mouthing my favourite editor. The conversation turned to macros and programmability and he praised emacs. Today's generation of programmers probably find it rather strange that anyone could get excited about which is the better editor, vi or emacs. It probably seems like arguing about the number of angels that can dance on a pin. Still when we got down to specifics, and here I am going to have to invent something because I have actually forgotten what the specific example was. But let's suppose he said something like "There is no command in vi to read the rest of this file and print the second last word on the lines that end with a semicolon, whereas I could write a macro in emacs to do this and that and blah blah yada yada yada".
I couldn't allow this to go with out a response. "Oh but vi can do it too!", I replied, "you can just enter a command like this ..." And I typed the following:
!Gawk '/;$/{print $(NF-1)}'
He was amazed that a single command could transform the file exactly as requested. He shook his head and muttered "They never showed me anything like that on the Unix course!". I was feeling so smug about striking another blow in the great editor war, that I missed the opportunity to tell him that vi can also handle macros. Ok, I cheated. Since I learned touch typing I can type commands quickly, and he may not have noticed that the line I typed look more like a cat walked on the keyboard than a command. And strictly speaking it was not really a vi command at all, although it is true that vi has an exceptionally rich command set. Still if you don't want to use one of the visual commands, there are the cryptic but very versatile ex commands. And one of the most useful commands in visual mode is the '!'. This gives you access to a full suite of shell commands. It means that, with a little imagination you can transform huge chunks of text with instructions like the awk one-liner above, while remaining in the same edit session. Now before you label me as vi-bigot, I should also boast that I have used emacs. Ok, it was really micro-emacs, and I only used it because it was all that was available with the Mark Williams C suite for Atari computers. But I have used it ... (sort of)
Doing it with !nroff
Anyway, in the mid-nineties, while the rest of the world was rushing to embrace the latest GUI Mail User Agent (MUA), I stuck to my preferred MUA (elm in Unix). At the time there was a macro virus which attacked Microsoft Word documents, which I considered to be a harbinger. For this reason I was reluctant to rush out and adopt the latest GUI MUA. My caution proved well-founded. Since then Microsoft Outlook has become probably the most common means of transmitting malicious macros and scripts. I can state emphatically that it is impossible to attack a computer via elm. I feel almost as confident with descendants of elm like mutt (so called because it is a mongrel of an MUA). There is one drawback about elm (and mutt) if you are using vi. Because vi is a programmer's editor it does not handle word wrap. Now before any remaining emacs afficiandos get all excited and start waving their hands in the air, let's not forget about the amazing '!'. There are some native Unix commands that can cope with word wrap. One that immediately springs to mind is nroff. So let's assume that I have written some text. In the process of writing text, I often press carriage return when I approach the end of the line. Sometimes I don't. Then I might change my mind about what I have written. Add some text, delete some text. Maybe chop it up and move some sentences around. The whole thing ends up looking an awful mess. But let's suppose that I have an nroff macro like tmac.f. This will actually fix up my ugly paragraph. When I get to the end of the paragraph, I can switch to visual mode and enter the following command:
!{nroff -mf
And the text is perfectly word wrapped. So that's how I wrote text back in the nineties. On one particular system I even mapped function keys on the keyboard to run nroff commands at a single key press. And if I really had to write a document with a mixture of fonts, because I wished to impress someone with a letter sent by ordinary mail, then I would just use Microsoft Word. E-mails, however, are easier to just type and send. And back in the mid-nineties there was no point trying to tart them up. By the time the message had passed through numerous mail gateways it could be horribly mangled if it did not start out as plain text word-wrapped at column 72. I was confident that my e-mails would arrive in a predictable, readable state no matter what MUA the intended recipient was using. If you look closer at the tmac.f macro you will see that it also handles a primitive hanging indent paragraph.
Doing it with !perl
There are a few annoying features about the tmac.f macro, however. It tends to always add an extra-line at the end of the text that it formats. And sometimes it splits hyphenated words across a line feed, but if it later re-joins them it puts an extra space after the hyphen. By and large I was willing to put up with these minor inconveniences. Now that I have relented and started running my own web-site, I am writing less plain text and more HTML. However, I still use vi. This is because:
- I like to understand how things work. And the best way to gain understanding is through practical experience. If I relied on an HTML editor to do the editing, I would not be learning about how HTML actually works.
- I have better control over the HTML. Once I understand how it works I can make minor adjustments to fine tune it. When using some HTML editors you may make what you think is a minor change. And the editor makes gross changes that you don't find out about until you look at the finished product with a text editor.
- The HTML I cut by hand is faster, more compact and more readable. By this I mean that the actual HTML code is more readable.
A lot of this comes down to personal preference. And if you are someone who prefers to use an HTML editor, than you probably didn't get as far as this sentence because you stopped reading this document several paragraphs earlier. On the other hand if, like me you are late adopter of the technology, you might want to familiarise yourself with the way it works. There are some good HTML tutorials on the web. I liked Writing HTML by Alan Levine.
Still HTML poses some special considerations for text formatting. Eventually I decided to write my own formatting program. There was no point in trying to use nroff. No doubt an nroff guru could manage it , but I required something flexible and powerful, that was easy to program. Needless to say I chose perl as the programming language. At first I thought I would call the program fmt. But someone has already written a program with this name which performs a similar function (it is bundled in the GNU textutils distribution). The textutils fmt is very basic however, so I still wrote my little text formatter and called it fm. The script has some special features:
- The default word-wrap column is 72, unless an HTML tag is discovered. In which case the default will be 78.
- The default can be over-ridden with -r option
- It uses a type of smart indenting, except that tabs are converted to 4 spaces.
- A -l option can be used to specify the left margin.
- It can be used to produced ordered lists and unordered lists. This would be applicable only to non-HTML plain text.
- Anything in between <pre> ... </pre> tags is left alone. You may also want to do the same for the <tt> ... </tt> tags.
- There is a -p switch which allows tags to be embedded in pre-formatted text without being counted as part of the line length. Strictly speaking I should do the same for the & operators
- If a <li> tag is encountered at the start of line, it throws a newline.
- Lines that begin and end with HTML tags are left alone.
- Anything in between <!-- nofm --> ... <!-- fm --> tags is left alone. This allows me to isolate a section of HTML from the action of the text formatter.
Summary
In summary, this utility allows me to cut HTML code with vi and make it more readable when viewed with a text editor. In fact I used it to format this HTML you are reading. It can also format plain text, and I use it to write e-mail in mutt. Usually I only call it from vi. If you should type fm -h at the command line the script should print:usage: /usr/local/bin/fm [-r R] [-l L] [-o] [-u] where -l L = left margin -r R = right margin -t expand tabs to 8 chars (rather than 4) -o produce ordered list -u unordered list -h print this screenIf you find this script useful you are welcome to take a copy of it. If you have anything to say about it, positive or negative, please send an e-mail to feedback. Next month I will look at moving HTML code into production from a test-bed environment, and searching Apache log files to extract statistics about activity on your web site.