Compressing Whitespace

Problem

You have runs of whitespace in a file (perhaps it is fixed length, space padded) and you need to compress the spaces down to a single character or delimiter.

Solution

Use tr or awk as appropriate.

Discussion

If you are trying to compress runs of whitespace down to a single character, you can use tr, but be aware that you may damage the file if it is not well formed. For example, if fields are delimited by multiple whitespace characters but internally have spaces, compressing multiple spaces down to one space will remove that distinction. Imagine if the _ characters in the following example were spaces instead. Note the → denotes a literal tab character in the output.

$ cat data_file
Header1             Header2              Header3
Rec1_Field1         Rec1_Field2          Rec1_Field3
Rec2_Field1         Rec2_Field2          Rec2_Field3
Rec3_Field1         Rec3_Field2          Rec3_Field3

$ cat data_file | tr -s ' ' '	'
Header1 → Header2 → Header3
Rec1_Field1 → Rec1_Field2 → Rec1_Field3
Rec2_Field1 → Rec2_Field2 → Rec2_Field3
Rec3_Field1 → Rec3_Field2 → Rec3_Field3

If your field delimiter is more than a single character, tr won’t work since it translates single characters from its first set into the matching single character in the second set. You can use awk to combine or convert field separators. awk’s internal field separator FS accepts regular expressions, so you can separate on pretty much anything. There is a handy trick to this as well. An assignment to any field causes awk to reassemble the record using the output field separator OFS. So assigning field one to itself and then printing the record has the effect of translating FS to OFS without you having to worry about how many records there are in the data.

In this example, multiple spaces delimit fields, but fields also have internal spaces, so the more simple case of awk 'BEGIN { OFS = " "} {$1=$1; print }' data_file1 won’t work. Here is a data file:

$ cat data_file1
Header1             Header2              Header3
Rec1 Field1         Rec1 Field2          Rec1 Field3
Rec2 Field1         Rec2 Field2          Rec2 Field3
Rec3 Field1         Rec3 Field2          Rec3 Field

In the next example, we assign two spaces to FS and tab to OFS. We then make an assignment ($1 = $1) so awk rebuilds the record, but that results in strings of tabs replacing the double spaces, so we use gsub to squash the tabs, then we print. Note the → denotes a literal tab character in the output. The output is a little hard to read, so there is a hex dump as well. Recall that ASCII tab is 09 while ASCII space is 20.

$ awk 'BEGIN { FS = " "; OFS = "	" } { $1 = $1; gsub(/	+ ?/, "	"); print }' 
 data_file1
Header1 → Header2 → Header3
Rec1 Field1 → Rec1 Field2 → Rec1 Field3
Rec2 Field1 → Rec2 Field2 → Rec2 Field3
Rec3 Field1 → Rec3 Field2 → Rec3 Field3


$ awk 'BEGIN { FS = " "; OFS = "	" } { $1 = $1; gsub(/	+ ?/, "	"); print }' 
 data_file1 | hexdump -C
00000000 48 65 61 64 65 72 31 09  48 65 61 64 65 72 32 09 |Header1.Header2.|
00000010 48 65 61 64 65 72 33 0a  52 65 63 31 20 46 69 65 |Header3.Rec1 Fie|
00000020 6c 64 31 09 52 65 63 31  20 46 69 65 6c 64 32 09 |ld1.Rec1 Field2.|
00000030 52 65 63 31 20 46 69 65  6c 64 33 0a 52 65 63 32 |Rec1 Field3.Rec2|
00000040 20 46 69 65 6c 64 31 09  52 65 63 32 20 46 69 65 | Field1.Rec2 Fie|
00000050 6c 64 32 09 52 65 63 32  20 46 69 65 6c 64 33 0a |ld2.Rec2 Field3.|
00000060 52 65 63 33 20 46 69 65  6c 64 31 09 52 65 63 33 |Rec3 Field1.Rec3|
00000070 20 46 69 65 6c 64 32 09  52 65 63 33 20 46 69 65 | Field2.Rec3 Fie|
00000080 6c 64 0a                                         |ld.|
00000083

You can also use awk to trim leading and trailing whitespace in the same way, but as noted previously, this will replace your field separators unless they are already spaces:

# Remove leading and trailing whitespace,
# but also replace TAB field separators with spaces
$ awk '{ $1 = $1; print }' white_space

See Also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.238.20