You have runs of whitespace in a file (perhaps it is fixed length, space padded) and you need to compress the spaces down to a single character or delimiter.
If you are trying to compress runs of whitespace down to a single
character, you can use tr, but be aware that you
may damage the file if it is not well formed. For example, if fields are
delimited by multiple whitespace characters but internally have spaces,
compressing multiple spaces down to one space will remove that
distinction. Imagine if the _
characters
in the following example were spaces instead. Note
the → denotes a literal tab character in the output.
$ cat data_file Header1 Header2 Header3 Rec1_Field1 Rec1_Field2 Rec1_Field3 Rec2_Field1 Rec2_Field2 Rec2_Field3 Rec3_Field1 Rec3_Field2 Rec3_Field3 $ cat data_file | tr -s ' ' ' ' Header1 → Header2 → Header3 Rec1_Field1 → Rec1_Field2 → Rec1_Field3 Rec2_Field1 → Rec2_Field2 → Rec2_Field3 Rec3_Field1 → Rec3_Field2 → Rec3_Field3
If your field delimiter is more than a single character,
tr won’t work since it translates single
characters from its first set into the matching
single character in the second set. You can use
awk to combine or convert field separators. awk’s internal
field separator FS
accepts regular
expressions, so you can separate on pretty much anything. There is a
handy trick to this as well. An assignment to any field causes
awk to reassemble the record using the output field separator OFS
. So assigning field one to itself and then
printing the record has the effect of translating FS
to OFS
without you having to worry about how many records there are in the
data.
In this example, multiple spaces delimit fields, but fields also
have internal spaces, so the more simple case of awk 'BEGIN { OFS = " "} {$1=$1; print }'
data_file1
won’t work. Here is a data file:
$ cat data_file1 Header1 Header2 Header3 Rec1 Field1 Rec1 Field2 Rec1 Field3 Rec2 Field1 Rec2 Field2 Rec2 Field3 Rec3 Field1 Rec3 Field2 Rec3 Field
In the next example, we assign two spaces to FS
and tab to OFS
. We then make an assignment ($1 = $1
) so awk rebuilds
the record, but that results in strings of tabs replacing the double
spaces, so we use gsub to squash the tabs, then we print. Note the → denotes a
literal tab character in the output. The output is a little hard to
read, so there is a hex dump as well. Recall that ASCII tab is 09
while ASCII space is 20
.
$ awk 'BEGIN { FS = " "; OFS = " " } { $1 = $1; gsub(/ + ?/, " "); print }' data_file1 Header1 → Header2 → Header3 Rec1 Field1 → Rec1 Field2 → Rec1 Field3 Rec2 Field1 → Rec2 Field2 → Rec2 Field3 Rec3 Field1 → Rec3 Field2 → Rec3 Field3 $ awk 'BEGIN { FS = " "; OFS = " " } { $1 = $1; gsub(/ + ?/, " "); print }' data_file1 | hexdump -C 00000000 48 65 61 64 65 72 31 09 48 65 61 64 65 72 32 09 |Header1.Header2.| 00000010 48 65 61 64 65 72 33 0a 52 65 63 31 20 46 69 65 |Header3.Rec1 Fie| 00000020 6c 64 31 09 52 65 63 31 20 46 69 65 6c 64 32 09 |ld1.Rec1 Field2.| 00000030 52 65 63 31 20 46 69 65 6c 64 33 0a 52 65 63 32 |Rec1 Field3.Rec2| 00000040 20 46 69 65 6c 64 31 09 52 65 63 32 20 46 69 65 | Field1.Rec2 Fie| 00000050 6c 64 32 09 52 65 63 32 20 46 69 65 6c 64 33 0a |ld2.Rec2 Field3.| 00000060 52 65 63 33 20 46 69 65 6c 64 31 09 52 65 63 33 |Rec3 Field1.Rec3| 00000070 20 46 69 65 6c 64 32 09 52 65 63 33 20 46 69 65 | Field2.Rec3 Fie| 00000080 6c 64 0a |ld.| 00000083
You can also use awk to trim leading and trailing whitespace in the same way, but as noted previously, this will replace your field separators unless they are already spaces:
# Remove leading and trailing whitespace, # but also replace TAB field separators with spaces $ awk '{ $1 = $1; print }' white_space
Effective awk Programming by Arnold Robbins (O’Reilly)
sed & awk by Arnold Robbins and Dale Dougherty (O’Reilly)
“tr Escape Sequences” in Appendix A
“Table of ASCII Values” in Appendix A
3.144.238.20