Pre-process the file and add line breaks in appropriate places. For example, Open- Office.org’s Open Document Format (ODF) files are basically zipped XML files. It is possible to unzip them and grep the XML, which we did a lot while writing this book. See Comparing Two Documents for a more comprehensive treatment of ODF files. In this example, we insert a newline after every closing angle bracket (>). That makes it much easier to process the file using grep or other textutils. Note that we must enter a backslash followed immediately by the Enter key to embed an escaped newline in the sed script:
$ wc -l content.xml 1 content.xml $ sed -e 's/>/> /g' content.xml | wc -l 1687
If you have fixed-length records with no newlines, do this instead, where 48 is the length of the record.
$ cat fixed-length Line_1_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_2_ _ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_3_ _ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_4_ _ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_5_ _ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_6_ _ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_7_ _ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_8_ _ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_9_ _ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_10_ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_11_ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZLine_12_ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ $ wc -l fixed-length 1 fixed-length $ sed 's/.{48}/& /g;' fixed-length Line_1_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_2_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_3_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_4_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_5_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_6_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_7_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_8_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_9_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_10_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_11_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_12_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ $ perl -pe 's/(.{48})/$1 /g;' fixed-length Line_1_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_2_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_3_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_4_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_5_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_6_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_7_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_8_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_9_ _aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_10_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_11_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ Line_12_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZZZ
This happens often when people create output programatically, especially using canned modules and especially with HTML or XML output.
Note the sed substitutions have an odd construct that allows an embedded newline. In
sed, a literal ampersand (&) on the righthand
side (RHS) of a substitution is replaced by the entire expression
matched on the lefthand side (LHS), and the trailing on the first line escapes the newline
so the shell accepts it, but it’s still in the sed
RHS substitution. This is because sed doesn’t
recognize
as a metacharacter on
the RHS of s///
.
Effective awk Programming by Arnold Robbins (O’Reilly)
sed & awk by Arnold Robbins and Dale Dougherty (O’Reilly)
3.147.89.85