cvt_file.pl serves as a tool that converts entire files from one format to another. Another common datafile operation is to manipulate its columns. This is necessary, for example, when importing a file into a program that doesn’t understand how to extract or rearrange input columns for itself. To work around this problem, you can rearrange the datafile instead.
Recall that this chapter began with a description of a scenario involving a 12-column CSV file somedata.csv from which only columns 2, 11, 5, and 9 were needed. You can convert the file to tab-delimited format like this:
%cvt_file.pl --iformat=csv somedata.csv > somedata.txt
But then what? If you just want to knock out a short script to extract those specific four columns, that’s fairly easy: write a loop that reads input lines and writes only the columns you want in the proper order. But that would be a special-purpose script, useful only within a highly limited context. With just a little more effort, it’s possible to write a more general utility yank_col.pl that enables you to extract any set of columns. With such a tool, you’d specify the column list on the command line like this:
%yank_col.pl --columns=2,11,5,9 somedata.txt > tmp.txt
Because the script doesn’t use a hardcoded column list, it can
be used to pull out an arbitrary set of columns in any order. Columns
can be specified as a comma-separated list of column numbers or column
ranges. (For example, --columns=1,10,4-7
means
columns 1, 10, 4, 5, 6, and 7.) yank_col.pl looks like this:
#!/usr/bin/perl # yank_col.pl - extract columns from input # Example: yank_col.pl --columns=2,11,5,9 filename # Assumes tab-delimited, linefeed-terminated input lines. use strict; use warnings; use Getopt::Long; $Getopt::Long::ignorecase = 0; # options are case sensitive my $prog = "yank_col.pl"; my $usage = <<EOF; Usage: $prog [options] [data_file] Options: --help Print this message --columns=column-list Specify columns to extract, as a comma-separated list of column positions EOF my $help; my $columns; GetOptions ( "help" => $help, # print help message "columns=s" => $columns # specify column list ) or die "$usage "; die "$usage " if defined $help; my @col_list = split (/,/, $columns) if defined ($columns); @col_list or die "$usage "; # nonempty column list is required # make sure column specifiers are positive integers, and convert from # 1-based to 0-based values my @tmp; for (my $i = 0; $i < @col_list; $i++) { if ($col_list[$i] =~ /^d+$/) # single column number { die "Column specifier $col_list[$i] is not a positive integer " unless $col_list[$i] > 0; push (@tmp, $col_list[$i] - 1); } elsif ($col_list[$i] =~ /^(d+)-(d+)$/) # column range m-n { my ($begin, $end) = ($1, $2); die "$col_list[$i] is not a valid column specifier " unless $begin > 0 && $end > 0 && $begin <= $end; while ($begin <= $end) { push (@tmp, $begin - 1); ++$begin; } } else { die "$col_list[$i] is not a valid column specifier "; } } @col_list = @tmp; while (<>) # read input { chomp; my @val = split (/ /, $_, 10000); # split, preserving all fields # extract desired columns, mapping undef to empty string (can # occur if an index exceeds number of columns present in line) @val = map { defined ($_) ? $_ : "" } @val[@col_list]; print join (" ", @val) . " "; }
The input processing loop converts each line to an array of
values, and then pulls out from the array the values corresponding to
the requested columns. To avoid looping through the array, it uses
Perl’s notation that allows a list of subscripts to be specified all
at once to request multiple array elements. For example, if @col_list
contains the values 2
, 6
, and
3
, these two expressions are
equivalent:
($val[2] , $val[6], $val[3]) @val[@col_list]
What if you want to extract columns from a file that’s not in tab-delimited format, or produce output in another format? In that case, combine yank_col.pl with the cvt_file.pl script discussed in Converting Datafiles from One Format to Another. Suppose that you want to pull out all but the password column from the colon-delimited /etc/passwd file and write the result in CSV format. Use cvt_file.pl both to preprocess /etc/passwd into tab-delimited format for yank_col.pl and to postprocess the extracted columns into CSV format:
%cvt_file.pl --idelim=":" /etc/passwd
| yank_col.pl --columns=1,3-7
| cvt_file.pl --oformat=csv > passwd.csv
If you don’t want to type all of that as one long command, use temporary files for the intermediate steps:
%cvt_file.pl --idelim=":" /etc/passwd > tmp1.txt
%yank_col.pl --columns=1,3-7 tmp1.txt > tmp2.txt
%cvt_file.pl --oformat=csv tmp2.txt > passwd.csv
%rm tmp1.txt tmp2.txt
3.15.226.120