Friday, November 9, 2007

Parsing Text

One of powerful benefits of UNIX related operating systems (as opposed to Windows) is the built-in powerful text parsing commands and interfaces. We rely on these to combine, sort, and extract text from all sorts of sources. Here are a few of my favorites:

l
Grep: most useful for extracting lines out of a command or file
Examples:

cat /etc/passwd | grep John
This method uses "standard in" as the file being used

grep John /etc/passwd
This method uses a real file for processing

Frequently Used Options:
-i: case insensitive
-v: "reverse" grep: print everything except a matching line
-n: print line numbers

EGrep: "Extended" Grep

cat /etc/passwd | egrep [0-9]{3}-[0-9]{3}-[0-9]{4}
Easy syntax for using basic GNU regular expressions
The above example extracts lines that contain a phone number

Question: Why use EGrep when the exact same command works with Grep?
Answer: Depending on the flavor of UNIX you are using , Grep may behave differently. EGrep should always work the same. For example: While Linux's Grep allows for GNU regular expressions, AIX's Grep does not.

AWK: "Aho, Weinberger, & Kernighan" for trivia buffs
Very powerful beyond the scope of this discussion; I like to use it to extract certain words out of a line:

Example: cat /etc/passwd | awk -F: '{print $1}'
This prints out the first column in a colon-delimited file (such as /etc/passwd)

SED: Stream Editor
This is a great command for advanced text-parsing and manipulation. I'd recommend keeping a book or web tutorial handy for this- it's way too dense for this discussion.
See http://www.grymoire.com/Unix/Sed.html for some additional background on this tool.

Perl: You can use this extremely powerful programming language to parse text. Complete scripts can be used as well as the 'oneliner' variety:

Example: cat /etc/passwd | perl -ne 'split(":"); print "$_[0]\n";'
This prints out the first column in a colon-delimited file (same as the AWK example)

No comments: