Processing Files with No Line Breaks

Problem

You have a large file with no line breaks, and you need to process it.

Solution

Pre-process the file and add line breaks in appropriate places. For example, Open- Office.org’s Open Document Format (ODF) files are basically zipped XML files. It is possible to unzip them and grep the XML, which we did a lot while writing this book. See Comparing Two Documents for a more comprehensive treatment of ODF files. In this example, we insert a newline after every closing angle bracket (>). That makes it much easier to process the file using grep or other textutils. Note that we must enter a backslash followed immediately by the Enter key to embed an escaped newline in the sed script:

$ wc -l content.xml
1 content.xml


$ sed -e 's/>/>
/g' content.xml | wc -l
1687

If you have fixed-length records with no newlines, do this instead, where 48 is the length of the record.

$ cat fixed-length
Line_1_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_2_ _
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_3_ _
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_4_ _
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_5_ _
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_6_ _
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_7_ _
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_8_ _
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_9_ _
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_10_
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_11_
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_12_
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ


$ wc -l fixed-length
  1 fixed-length

$ sed 's/.{48}/&
/g;' fixed-length
Line_1_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_2_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_3_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_4_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_5_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_6_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_7_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_8_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_9_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_10_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_11_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_12_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ


$ perl -pe 's/(.{48})/$1
/g;' fixed-length
Line_1_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_2_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_3_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_4_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_5_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_6_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_7_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_8_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_9_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_10_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_11_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
Line_12_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ

Discussion

This happens often when people create output programatically, especially using canned modules and especially with HTML or XML output.

Note the sed substitutions have an odd construct that allows an embedded newline. In sed, a literal ampersand (&) on the righthand side (RHS) of a substitution is replaced by the entire expression matched on the lefthand side (LHS), and the trailing on the first line escapes the newline so the shell accepts it, but it’s still in the sed RHS substitution. This is because sed doesn’t recognize as a metacharacter on the RHS of s///.

See Also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.89.85