Although this is first and foremost a book about text processing with Ruby, it’s also about an approach to text processing that fits in with existing tools. If you’re not particularly experienced with the Unix command line and the utilities it offers, this list might help you find the right tool for the job.
There are only 16 commands here, but together they form a considerable arsenal and—along with Ruby—will provide you with the tools you need for virtually all text processing tasks.
These commands are all part of GNU’s coreutils project and are invariably packaged with Linux distributions. Mac OS X ships with virtually all of them, and those that it doesn’t can be installed using Homebrew:[17]
| $ brew install coreutils |
Windows users should install Cygwin[18] to get them.
The rest of the chapter gives a summary of each of these 16 commands.
cat |
Outputs the content of the filenames passed to it. Its name comes from the word concatenate, since it concatenates the files one after another. If no files are given, it outputs standard input.
| ||||||||||||||||||||||||
tac | Exactly like cat, but outputs the lines of the files in reverse—that is, starting from the last line of the first file and working backward, then the last line of the second, and so on.
| ||||||||||||||||||||||||
shuf | Randomizes the order of lines in standard input or in the specified file. If the -n option is specified, will output at most that many lines; otherwise, all of the lines in the input will be shuffled. Display five random dictionary words:
|
head | Outputs the first n lines of a file, where n is by default 10 but can be altered by passing a number to head. Output the first ten lines of foo.txt:
Output the first two lines of foo.txt:
Output the first five lines of foo.txt:
| ||||||
tail | Similar to head, but with the last n lines rather than the first. Output the last ten lines of foo.txt:
Output the last two lines of foo.txt:
Output the last five lines of foo.txt:
| ||||||
split | Takes a file and splits it into new files every n lines, where n is by default 1,000. Split a file every 1,000 lines:
Split a file every 500 lines:
Split a file every 100 bytes:
|
grep | Outputs only those lines that match a given pattern. Accepts regular expressions, allowing complex patterns to be matched, and supports recursing through the filesystem itself—allowing you to search across whole directories of files. Show lines in standard input that match a given pattern:
Show lines that don’t match a given pattern:
Match against a regular expression:
Case-insensitive matching:
Search through all files in the current directory and below:
| ||||||||||||||||||||||
cut | Splits lines into fields, allowing you to process delimited data and only output particular columns. cut splits on tab by default but can be configured to separate fields by any character using the -d option. Split fields on the space character and output the fourth field:
Output multiple fields, illustrating both ranges and comma-separated lists of fields:
|
tr | Performs a substitution on the input, allowing you to replace certain characters with others. tr is flexible about how characters are specified, allowing you to define ranges and use character classes. Convert uppercase input to lowercase:
The same, but accounting for non-ASCII characters:
Delete numbers from the input:
Delete anything that isn’t a letter from the input (the -c stands for complement):
Compress multiple-space characters into one:
| ||||||||||||||||||||
wc | Counts the number of characters (with the -c option), words (with the -w option), or lines (with the -l option) in a given file or in standard input. If no options are specified, will output all three metrics. Show statistics for a file. The first column shows lines, the second words, and the third characters:
Display the number of lines in a file:
Display the number of characters in standard input:
|
sort | Sorts input or the content of files. Takes options to treat sort data as numeric, to sort insensitively to case, and to ignore leading whitespace, among other things. Sort a file alphabetically:
Sort input numerically:
Sort input in reverse order:
| ||||||||||||||||||||||||||||||||||||||||
column | Converts data into columnar format. Very useful for performing alignment that would otherwise take painstaking manual adjustment. Display a delimited file as an aligned table, with a header row:
|
uniq | Outputs its input, but for all consecutively identical lines, outputs those lines only once. So if a line containing foo was followed by three identical lines all containing foo, these four lines would be compressed to one in uniq’s output. Compress consecutively identical lines:
Display only lines that are unique across the whole file, by using sort to ensure identical lines always appear together:
Display lines that are unique across the whole file, along with a count of how many times those lines occurred:
|
paste | Nothing to do with the clipboard—as its name might suggest to modern ears, at least. paste joins together two files so that line one from file two is placed on the same line as line one from file one, joined by a tab. This effectively creates tabular data. Join two files horizontally:
Join two files vertically:
Separate fields with spaces, rather than tabs:
|
join | Joins two files together. Unlike paste, which does this based on the position of the lines in each file, join functions much more like a join in a relational database: it looks for fields with the same values and joins based on that equality. Join two files based on the equality of values in the first column:
Output only certain fields:
|
comm | Given two sorted files, displays the lines that occur only in file one, the lines that occur only in file two, and the lines that occur in both. Can be configured to show any number of these columns. (For example, just the lines that are in both or just the lines that are unique to one or more files, but not the lines that are in both, etc.) Display lines that occur only in file one, that occur only in file two, and that occur in both:
Display only the lines that are in both files:
Display only the lines that occur only in file one:
|
By composing together these various commands in different ways, you’ll be able to perform many text processing tasks without having to reach for anything else. When you find them limiting, you can reach for Ruby to fill in the gaps.
18.222.177.10