Processing Fixed-Length Records

Problem

You need to read and process data that is in a fixed-length (also called fixed-width) form.

Solution

Use Perl or gawk 2.13 or greater. Given a file like:

$ cat fixed-length_file
Header1-----------Header2-------------------------Header3---------
Rec1 Field1       Rec1 Field2                     Rec1 Field3
Rec2 Field1       Rec2 Field2                     Rec2 Field3
Rec3 Field1       Rec3 Field2                     Rec3 Field3

You can process it using GNU’s gawk, by setting FIELDWIDTHS to the correct field lengths, setting OFS as desired, and making an assignment so gawk rebuilds the record (see the awk trick in Trimming Whitespace). However, gawk does not remove the spaces used in padding the original record, so we use two gsubs to do that, one for all the internal fields and the other for the last field in each record. Finally, we just print. Note the → denotes a literal tab character in the output. The output is a little hard to read, so there is a hex dump as well. Recall that ASCII tab is 09 while ASCII space is 20.

$ gawk ' BEGIN { FIELDWIDTHS = "18 32 16"; OFS = "	" } { $1 = $1; gsub(/ +	/, "
t"); gsub(/ +$/, ""); print }' fixed-length_file
Header1----------- → Header2------------------------- → Header3---------
Rec1 Field1 → Rec1 Field2 → Rec1 Field3
Rec2 Field1 → Rec2 Field2 → Rec2 Field3
Rec3 Field1 → Rec3 Field2 → Rec3 Field3

$ gawk ' BEGIN { FIELDWIDTHS = "18 32 16"; OFS = "	" } { $1 = $1; gsub(/ +	/, "
t"); gsub(/ +$/, ""); print }' fixed-length_file | hexdump -C
00000000 48 65 61 64 65 72 31 2d 2d 2d 2d 2d 2d 2d 2d 2d |Header1---------|
00000010 2d 2d 09 48 65 61 64 65 72 32 2d 2d 2d 2d 2d 2d |--.Header2------|
00000020 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d |----------------|
00000030 2d 2d 2d 09 48 65 61 64 65 72 33 2d 2d 2d 2d 2d |---.Header3-----|
00000040 2d 2d 2d 2d 0a 52 65 63 31 20 46 69 65 6c 64 31 |----.Rec1 Field1|
00000050 09 52 65 63 31 20 46 69 65 6c 64 32 09 52 65 63 |.Rec1 Field2.Rec|
00000060 31 20 46 69 65 6c 64 33 0a 52 65 63 32 20 46 69 |1 Field3.Rec2 Fi|
00000070 65 6c 64 31 09 52 65 63 32 20 46 69 65 6c 64 32 |eld1.Rec2 Field2|
00000080 09 52 65 63 32 20 46 69 65 6c 64 33 0a 52 65 63 |.Rec2 Field3.Rec|
00000090 33 20 46 69 65 6c 64 31 09 52 65 63 33 20 46 69 |3 Field1.Rec3 Fi|
000000a0 65 6c 64 32 09 52 65 63 33 20 46 69 65 6c 64 33 |eld2.Rec3 Field3|
000000b0 0a                                              |.|
000000b1

If you don’t have gawk, you can use Perl, which is more straightforward anyway. We use a non-printing while input loop (-n), unpack each record ($_) as it’s read, and turn the resulting list back into a scalar by joining the elements with a tab. We then print each record, adding a newline at the end:

$ perl -ne 'print join("	", unpack("A18 A32 A16", $_) ) . "
";' fixed-length_file
Header1----------- → Header2------------------------- → Header3---------
Rec1 Field1 → Rec1 Field2 → Rec1 Field3
Rec2 Field1 → Rec2 Field2 → Rec2 Field3
Rec3 Field1 → Rec3 Field2 → Rec3 Field3
$ perl -ne 'print join("	", unpack("A18 A32 A16", $_) ) . "
";' fixed-length_file |
hexdump -C
00000000 48 65 61 64 65 72 31 2d 2d 2d 2d 2d 2d 2d 2d 2d |Header1---------|
00000010 2d 2d 09 48 65 61 64 65 72 32 2d 2d 2d 2d 2d 2d |--.Header2------|
00000020 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d |----------------|
00000030 2d 2d 2d 09 48 65 61 64 65 72 33 2d 2d 2d 2d 2d |---.Header3-----|
00000040 2d 2d 2d 2d 0a 52 65 63 31 20 46 69 65 6c 64 31 |----.Rec1 Field1|
00000050 09 52 65 63 31 20 46 69 65 6c 64 32 09 52 65 63 |.Rec1 Field2.Rec|
00000060 31 20 46 69 65 6c 64 33 0a 52 65 63 32 20 46 69 |1 Field3.Rec2 Fi|
00000070 65 6c 64 31 09 52 65 63 32 20 46 69 65 6c 64 32 |eld1.Rec2 Field2|
00000080 09 52 65 63 32 20 46 69 65 6c 64 33 0a 52 65 63 |.Rec2 Field3.Rec|
00000090 33 20 46 69 65 6c 64 31 09 52 65 63 33 20 46 69 |3 Field1.Rec3 Fi|
000000a0 65 6c 64 32 09 52 65 63 33 20 46 69 65 6c 64 33 |eld2.Rec3 Field3|
000000b0 0a                                              |.|
000000b1

See the Perl documentation for the pack and unpack template formats.

Discussion

Anyone with any Unix background will automatically use some kind of delimiter in output, since the textutils toolchain is never far from mind, so fixed-length (also called fixed-width) records are rare in the Unix world. They are very common in the mainframe world however, so they will occasionally crop up in large applications that originated on big iron, such as some applications from SAP. As we’ve just seen, it’s no problem to handle.

One caveat to this recipe is that it requires each record to end in a newline. Many old mainframe record formats don’t, in which case you can use Processing Files with No Line Breaks to add newlines to the end of each record before processing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.46.227