Appendix 2
Useful Shell Commands

Although this is first and foremost a book about text processing with Ruby, it’s also about an approach to text processing that fits in with existing tools. If you’re not particularly experienced with the Unix command line and the utilities it offers, this list might help you find the right tool for the job.

There are only 16 commands here, but together they form a considerable arsenal and—along with Ruby—will provide you with the tools you need for virtually all text processing tasks.

These commands are all part of GNU’s coreutils project and are invariably packaged with Linux distributions. Mac OS X ships with virtually all of them, and those that it doesn’t can be installed using Homebrew:[17]

 
$ ​brew​​ ​​install​​ ​​coreutils

Windows users should install Cygwin[18] to get them.

The rest of the chapter gives a summary of each of these 16 commands.

cat

Outputs the content of the filenames passed to it. Its name comes from the word concatenate, since it concatenates the files one after another. If no files are given, it outputs standard input.

 
$ ​cat​​ ​​foo.txt
 
foo
 
$ ​cat​​ ​​bar.txt
 
bar
 
$ ​cat​​ ​​foo.txt​​ ​​bar.txt
 
foo
 
bar

tac

Exactly like cat, but outputs the lines of the files in reverse—that is, starting from the last line of the first file and working backward, then the last line of the second, and so on.

 
$ ​cat​​ ​​foo.txt
 
foo
 
bar
 
$ ​tac​​ ​​foo.txt
 
bar
 
foo
 
$ ​cat​​ ​​baz.txt
 
baz
 
$ ​tac​​ ​​foo.txt​​ ​​baz.txt
 
bar
 
foo
 
baz

shuf

Randomizes the order of lines in standard input or in the specified file. If the -n option is specified, will output at most that many lines; otherwise, all of the lines in the input will be shuffled.

Display five random dictionary words:

 
$ ​shuf​​ ​​-n​​ ​​5​​ ​​/usr/share/dict/words
 
gasification
 
merrymake
 
thingum
 
chiliastic
 
zygose

head

Outputs the first n lines of a file, where n is by default 10 but can be altered by passing a number to head.

Output the first ten lines of foo.txt:

 
$ ​head​​ ​​foo.txt

Output the first two lines of foo.txt:

 
$ ​head​​ ​​-2​​ ​​foo.txt

Output the first five lines of foo.txt:

 
$ ​head​​ ​​-5​​ ​​foo.txt

tail

Similar to head, but with the last n lines rather than the first.

Output the last ten lines of foo.txt:

 
$ ​tail​​ ​​foo.txt

Output the last two lines of foo.txt:

 
$ ​tail​​ ​​-2​​ ​​foo.txt

Output the last five lines of foo.txt:

 
$ ​tail​​ ​​-5​​ ​​foo.txt

split

Takes a file and splits it into new files every n lines, where n is by default 1,000.

Split a file every 1,000 lines:

 
$ ​split​​ ​​big.txt

Split a file every 500 lines:

 
$ ​split​​ ​​big.txt​​ ​​-l​​ ​​500

Split a file every 100 bytes:

 
$ ​split​​ ​​big.txt​​ ​​-b​​ ​​100

grep

Outputs only those lines that match a given pattern. Accepts regular expressions, allowing complex patterns to be matched, and supports recursing through the filesystem itself—allowing you to search across whole directories of files.

Show lines in standard input that match a given pattern:

 
$ ​echo​​ ​​"foo bar"​​ ​​|​​ ​​grep​​ ​​foo
 
foo

Show lines that don’t match a given pattern:

 
$ ​echo​​ ​​"foo bar"​​ ​​|​​ ​​grep​​ ​​-v​​ ​​foo
 
bar

Match against a regular expression:

 
$ ​echo​​ ​​"foo bar"​​ ​​|​​ ​​grep​​ ​​'^f'
 
foo

Case-insensitive matching:

 
$ ​echo​​ ​​"foo bar"​​ ​​|​​ ​​grep​​ ​​-i​​ ​​FOO
 
foo

Search through all files in the current directory and below:

 
$ ​grep​​ ​​-r​​ ​​foo​​ ​​.
 
./foo.txt: foo
 
./bar.txt: foo

cut

Splits lines into fields, allowing you to process delimited data and only output particular columns. cut splits on tab by default but can be configured to separate fields by any character using the -d option.

Split fields on the space character and output the fourth field:

 
$ ​date
 
Tue 30 Jun 2015 11:37:52 BST
 
 
$ ​date​​ ​​|​​ ​​cut​​ ​​-d​​ ​​' '​​ ​​-f​​ ​​4
 
2015

Output multiple fields, illustrating both ranges and comma-separated lists of fields:

 
$ ​date​​ ​​|​​ ​​cut​​ ​​-d​​ ​​' '​​ ​​-f​​ ​​1-3,5
 
Tue 30 Jun 11:37:52

tr

Performs a substitution on the input, allowing you to replace certain characters with others. tr is flexible about how characters are specified, allowing you to define ranges and use character classes.

Convert uppercase input to lowercase:

 
$ ​echo​​ ​​'HELLO WORLD'​​ ​​|​​ ​​tr​​ ​​A-Z​​ ​​a-z

The same, but accounting for non-ASCII characters:

 
$ ​echo​​ ​​'HËLLØ WÔRLD'​​ ​​|​​ ​​
 
​​tr​​ ​​'[:upper:]'​​ ​​'[:lower:]'
 
hëllø wôrld

Delete numbers from the input:

 
$ ​echo​​ ​​'HELLO 123 WORLD'​​ ​​|​​ ​​tr​​ ​​-d​​ ​​0-9
 
HELLO WORLD

Delete anything that isn’t a letter from the input (the -c stands for complement):

 
$ ​echo​​ ​​'HELLO %^!@$()'​​ ​​|​​ ​​tr​​ ​​-cd​​ ​​a-zA-Z
 
HELLO

Compress multiple-space characters into one:

 
$ ​echo​​ ​​'Hello world'​​ ​​|​​ ​​tr​​ ​​-s​​ ​​' '
 
Hello world

wc

Counts the number of characters (with the -c option), words (with the -w option), or lines (with the -l option) in a given file or in standard input. If no options are specified, will output all three metrics.

Show statistics for a file. The first column shows lines, the second words, and the third characters:

 
$ ​wc​​ ​​foo.txt
 
103 392 3944 foo.txt

Display the number of lines in a file:

 
$ ​wc​​ ​​-l​​ ​​foo.txt
 
103

Display the number of characters in standard input:

 
$ ​echo​​ ​​"Hello world"​​ ​​|​​ ​​wc​​ ​​-c
 
12

sort

Sorts input or the content of files. Takes options to treat sort data as numeric, to sort insensitively to case, and to ignore leading whitespace, among other things.

Sort a file alphabetically:

 
$ ​cat​​ ​​foo.txt
 
foo
 
bar
 
 
$ ​sort​​ ​​foo.txt
 
bar
 
foo

Sort input numerically:

 
$ ​echo​​ ​​"12 111 1"​​ ​​|​​ ​​sort
 
1
 
111
 
12
 
 
$ ​echo​​ ​​"12 111 1"​​ ​​|​​ ​​sort​​ ​​-n
 
1
 
12
 
111

Sort input in reverse order:

 
$ ​echo​​ ​​"ant mole zebra"​​ ​​|​​ ​​sort​​ ​​-r
 
zebra
 
mole
 
ant

column

Converts data into columnar format. Very useful for performing alignment that would otherwise take painstaking manual adjustment.

Display a delimited file as an aligned table, with a header row:

 
$ ​cat​​ ​​people.txt
 
Samantha 57 Pianist
 
Alice 31 Biochemist
 
Terence 90 Retired
 
Alex 20 Student
 
 
$ ​(​​ ​​echo​​ ​​"NAME AGE JOB"​​;​​ ​​cat​​ ​​people.txt​​ ​​)​​ ​​|​​ ​​column​​ ​​-t
 
NAME AGE JOB
 
Samantha 57 Pianist
 
Alice 31 Biochemist
 
Terence 90 Retired
 
Alex 20 Student

uniq

Outputs its input, but for all consecutively identical lines, outputs those lines only once. So if a line containing foo was followed by three identical lines all containing foo, these four lines would be compressed to one in uniq’s output.

Compress consecutively identical lines:

 
$ ​cat​​ ​​foo.txt
 
foo
 
foo
 
bar
 
foo
 
 
$ ​uniq​​ ​​foo.txt
 
foo
 
bar
 
foo

Display only lines that are unique across the whole file, by using sort to ensure identical lines always appear together:

 
$ ​sort​​ ​​foo.txt​​ ​​|​​ ​​uniq
 
bar
 
foo

Display lines that are unique across the whole file, along with a count of how many times those lines occurred:

 
$ ​sort​​ ​​foo.txt​​ ​​|​​ ​​uniq​​ ​​-c
 
1 bar
 
3 foo

paste

Nothing to do with the clipboard—as its name might suggest to modern ears, at least. paste joins together two files so that line one from file two is placed on the same line as line one from file one, joined by a tab. This effectively creates tabular data.

Join two files horizontally:

 
$ ​cat​​ ​​first-names
 
Avdi
 
Katrina
 
David
 
 
$ ​cat​​ ​​last-names
 
Grimm
 
Owen
 
Brady
 
 
$ ​paste​​ ​​first-names​​ ​​last-names
 
Avdi Grimm
 
Katrina Owen
 
David Brady

Join two files vertically:

 
$ ​paste​​ ​​-s​​ ​​first-names​​ ​​last-names
 
Avdi Katrina David
 
Grimm Owen Brady

Separate fields with spaces, rather than tabs:

 
$ ​paste​​ ​​-d​​' '​​ ​​first-names​​ ​​last-names
 
Avdi Grimm
 
Katrina Owen
 
David Brady

join

Joins two files together. Unlike paste, which does this based on the position of the lines in each file, join functions much more like a join in a relational database: it looks for fields with the same values and joins based on that equality.

Join two files based on the equality of values in the first column:

 
$ ​cat​​ ​​users
 
 
[email protected] Alice Jones
 
 
$ ​cat​​ ​​orders
 
[email protected] Chips $1.95
 
 
 
$ ​join​​ ​​users​​ ​​orders
 
[email protected] Bob Smith Chips $1.95
 
[email protected] Alice Jones Beer $3.50

Output only certain fields:

 
$ ​join​​ ​​-o​​ ​​1.2,2.2​​ ​​users​​ ​​orders
 
Bob Chips
 
Alice Beer

comm

Given two sorted files, displays the lines that occur only in file one, the lines that occur only in file two, and the lines that occur in both. Can be configured to show any number of these columns. (For example, just the lines that are in both or just the lines that are unique to one or more files, but not the lines that are in both, etc.)

Display lines that occur only in file one, that occur only in file two, and that occur in both:

 
$ ​cat​​ ​​1.txt
 
bar
 
foo
 
 
$ ​cat​​ ​​2.txt
 
bar
 
baz
 
 
$ ​comm​​ ​​1.txt​​ ​​2.txt
 
bar
 
baz
 
foo

Display only the lines that are in both files:

 
$ ​comm​​ ​​-1​​ ​​-2​​ ​​1.txt​​ ​​2.txt
 
bar

Display only the lines that occur only in file one:

 
$ ​comm​​ ​​-2​​ ​​-3​​ ​​1.txt​​ ​​2.txt
 
foo

By composing together these various commands in different ways, you’ll be able to perform many text processing tasks without having to reach for anything else. When you find them limiting, you can reach for Ruby to fill in the gaps.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.177.10