Extracting and Rearranging Datafile Columns

Problem

You want to pull out columns from a datafile or rearrange them into a different order.

Solution

Use a utility that can produce columns from a file on demand.

Discussion

cvt_file.pl serves as a tool that converts entire files from one format to another. Another common datafile operation is to manipulate its columns. This is necessary, for example, when importing a file into a program that doesn’t understand how to extract or rearrange input columns for itself. To work around this problem, you can rearrange the datafile instead.

Recall that this chapter began with a description of a scenario involving a 12-column CSV file somedata.csv from which only columns 2, 11, 5, and 9 were needed. You can convert the file to tab-delimited format like this:

%cvt_file.pl --iformat=csv somedata.csv > somedata.txt

But then what? If you just want to knock out a short script to extract those specific four columns, that’s fairly easy: write a loop that reads input lines and writes only the columns you want in the proper order. But that would be a special-purpose script, useful only within a highly limited context. With just a little more effort, it’s possible to write a more general utility yank_col.pl that enables you to extract any set of columns. With such a tool, you’d specify the column list on the command line like this:

%yank_col.pl --columns=2,11,5,9 somedata.txt > tmp.txt

Because the script doesn’t use a hardcoded column list, it can be used to pull out an arbitrary set of columns in any order. Columns can be specified as a comma-separated list of column numbers or column ranges. (For example, --columns=1,10,4-7 means columns 1, 10, 4, 5, 6, and 7.) yank_col.pl looks like this:

#!/usr/bin/perl
# yank_col.pl - extract columns from input

# Example: yank_col.pl --columns=2,11,5,9 filename

# Assumes tab-delimited, linefeed-terminated input lines.

use strict;
use warnings;
use Getopt::Long;
$Getopt::Long::ignorecase = 0; # options are case sensitive

my $prog = "yank_col.pl";
my $usage = <<EOF;
Usage: $prog [options] [data_file]

Options:
--help
    Print this message
--columns=column-list
    Specify columns to extract, as a comma-separated list of column positions
EOF

my $help;
my $columns;

GetOptions (
  "help"      => $help,      # print help message
  "columns=s" => $columns    # specify column list
) or die "$usage
";

die  "$usage
" if defined $help;

my @col_list = split (/,/, $columns) if defined ($columns);
@col_list or die "$usage
";      # nonempty column list is required

# make sure column specifiers are positive integers, and convert from
# 1-based to 0-based values

my @tmp;
for (my $i = 0; $i < @col_list; $i++)
{
  if ($col_list[$i] =~ /^d+$/)       # single column number
  {
    die "Column specifier $col_list[$i] is not a positive integer
"
        unless $col_list[$i] > 0;
    push (@tmp, $col_list[$i] - 1);
  }
  elsif ($col_list[$i] =~ /^(d+)-(d+)$/)  # column range m-n
  {
    my ($begin, $end) = ($1, $2);
    die "$col_list[$i] is not a valid column specifier
"
        unless $begin > 0 && $end > 0 && $begin <= $end;
    while ($begin <= $end)
    {
      push (@tmp, $begin - 1);
      ++$begin;
    }
  }
  else
  {
    die "$col_list[$i] is not a valid column specifier
";
  }
}
@col_list = @tmp;

while (<>)            # read input
{
  chomp;
  my @val = split (/	/, $_, 10000);  # split, preserving all fields
  # extract desired columns, mapping undef to empty string (can
  # occur if an index exceeds number of columns present in line)
  @val = map { defined ($_) ? $_ : "" } @val[@col_list];
  print join ("	", @val) . "
";
}

The input processing loop converts each line to an array of values, and then pulls out from the array the values corresponding to the requested columns. To avoid looping through the array, it uses Perl’s notation that allows a list of subscripts to be specified all at once to request multiple array elements. For example, if @col_list contains the values 2, 6, and 3, these two expressions are equivalent:

($val[2] , $val[6], $val[3])
@val[@col_list]

What if you want to extract columns from a file that’s not in tab-delimited format, or produce output in another format? In that case, combine yank_col.pl with the cvt_file.pl script discussed in Converting Datafiles from One Format to Another. Suppose that you want to pull out all but the password column from the colon-delimited /etc/passwd file and write the result in CSV format. Use cvt_file.pl both to preprocess /etc/passwd into tab-delimited format for yank_col.pl and to postprocess the extracted columns into CSV format:

%cvt_file.pl --idelim=":" /etc/passwd 
               | yank_col.pl --columns=1,3-7 
               | cvt_file.pl --oformat=csv > passwd.csv

If you don’t want to type all of that as one long command, use temporary files for the intermediate steps:

%cvt_file.pl --idelim=":" /etc/passwd > tmp1.txt
% yank_col.pl --columns=1,3-7 tmp1.txt > tmp2.txt
% cvt_file.pl --oformat=csv tmp2.txt > passwd.csv
% rm tmp1.txt tmp2.txt
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.137.58