Chapter 22. Flexible Command Line Tokenizer

 

Always design a thing by considering it in its next larger context—a chair in a room, a room in a house, a house in an environment, an environment in a city plan.

 
 --Eliel Saarinen—“Time,” July 2, 1956

Command line utilities have always been a favorite among tools developers, generally because of how quick they are to make. Command line utilities do not require that code and time be spent on a graphical user interface, which dramatically reduces development time. These tools can also have complex configuration options that are hidden from the user unless explicitly specified, making the tool easier to learn and operate. The one disadvantage that command line utilities have is that they must parse the command line parameters and act on them accordingly. This can be quite a nuisance, especially when the only input validation is done by the user before the parameters are parsed by the utility. It can be difficult to correctly parse a parameter string, including fault tolerance for data input errors.

A tokenizer is code that extracts tokens (substrings) from a given string. The tokens in the string can be separated by one or more character delimiters. This chapter discusses a reusable and flexible command line tokenizer that can break an arbitrary parameter string into name-value pairs.

Formatting Styles

When parsing command line parameters, developers generally come up with unique ways to express parameter syntax. This has led to some confusion about consistency and has brought forth the emergence of a number of formatting styles from the UNIX and Windows worlds.

In order to build a tokenizer that favors a variety of standards, a number of formatting styles have been merged into a common syntax for parsing.

The tokenizer syntax supports three styles of prefixes to signify a parameter. A parameter can be prefixed with a forward slash (/), a hyphen (-), or a double hypen ().

Some examples include:

/name

-value

—screenMode

Parameters typically have values associated with them, but if they do not then true is used as a default value just to show that a particular parameter was specified. Parameter values come after the parameter token and can be prefixed with a space ( ), an equals sign (=), or a colon (:).

Some examples include:

/name Graham

-value=54

—screenMode:normal

Parameter values can also be surrounded by either single or double quotes to preserve white space.

/name "Graham Wihlidal"

—screenMode = 'normal'

Visualizing a generic syntax expression for the above styles results in the following:

{-,/,—}param{ ,=,:}((",')value(",'))

Using the above syntax expression will allow us to parse a variety of formatting styles.

Implementation

The real magic behind this tokenizer is from the regular expression capabilities of .NET. There were a couple versions of this source code before regular expressions were used, and this version is by far the shortest in length and the most maintainable.

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;

namespace ConsoleTokenizerLibrary
{
    public sealed class ConsoleTokenizer
    {
        private readonly Dictionary<string, string> _parameters
                                              = new Dictionary<string, string>();

        private readonly List<string> _files = new List<string>();

        public Dictionary<string, string> Parameters
        {
            get { return _parameters; }
        }

        public List<string> Files
        {
            get { return _files; }
        }

A C# indexer operator has been provided to pull tokens from the parameter list. This is merely an alternate way of obtaining these tokens with shorter code. Files must still be accessed normally through the property.

         public string this[string token]
         {
             get { return _parameters[token]; }
         }

This constructor takes a single string and breaks it into an array of arguments using a regular expression. The arguments array is then passed into the Tokenize() method.

         public ConsoleTokenizer(string arguments)
         {
             Regex tokenizer = new Regex(@"(['""][^""]+['""])s*|([^s]+)s*",
                                         RegexOptions.IgnoreCase |
                                         RegexOptions.Compiled);

             MatchCollection matches = tokenizer.Matches(arguments);

             List<string> tokenizedList = new List<string>();

             for (int matchIndex = 1;
                      matchIndex < matches.Count - 1;
                      matchIndex++)
             {
                 tokenizedList.Add(matches[matchIndex].Value);
             }
             Tokenize(tokenizedList.ToArray());
         }

This constructor simply calls the Tokenize method with an array of arguments.

         public ConsoleTokenizer(string[] arguments)
         {
             Tokenize(arguments);
         }

The following method is the heart of the tokenizer. It uses a regular expression to break up a group of arguments into name-value pairs based on the formatting styles described earlier.

          private void Tokenize(string[] arguments)
          {
              string pattern = @"^([/-]|—){1}(?<name>w+)([:=])?(?<value>.+)?$";
              Regex tokenizer = new Regex(pattern,
                                          RegexOptions.IgnoreCase |
                                          RegexOptions.Compiled);

              char[] trimCharacters = { '"', ''' };

              string currentToken = null;

              foreach (string argument in arguments)
              {
                  Match match = tokenizer.Match(argument);
                  if (!match.Success)
                  {

Check if a parameter has already been determined and that the current character selection is its value.

                    if (currentToken != null)
                    {
                        _parameters[currentToken] = argument.Trim(trimCharacters);
                    }

If an argument was specified that is not in the form of a parameter, then it is most likely a file to process, so here we add the argument to the files collection.

                    else
                    {
                        _files.Add(argument);
                    }
                }
                else
                {
                    currentToken = match.Groups["name"].Value;

                    string tokenValue =
                    match.Groups["value"].Value.Trim(trimCharacters);

If no value was found, specify true as the default parameter value. Having a default value of true basically means that a flag or switch was specified (on or off value).

                    if (tokenValue.Length == 0)
                    {
                        _parameters[currentToken] = "true";
                    }

If a value was determined, associate the string dictionary key with it.

                    else
                    {
                        _parameters[currentToken] = tokenValue;
                    }
                }
            }
        }
    }
}

Sample Usage

Using the command line tokenizer is very simple. Console applications have a string array that is passed into the main entry point, and this string array contains the command line parameters specified at the command prompt. Instantiate a new instance of the ConsoleTokenizer class and pass this string array into it. At this point everything has been parsed, and you can either access the Parameters or Files property of the tokenizer instance. Parameters is a string dictionary that uses the parameter name as a key, and then points to the associated value. Here is an example of how to get the parameter value for a parameter named mode.

static void Main(string[] args)
{
    ConsoleTokenizer tokenizer = new ConsoleTokenizer(args);

    string mode = tokenizer.Parameters["mode"];
}

Alternatively, the indexer operator has been overloaded to reference the Parameters dictionary as well, making your code even cleaner.

static void Main(string[] args)
{
    ConsoleTokenizer tokenizer = new ConsoleTokenizer(args);
    string mode = tokenizer["mode"];
}

There may be some optional parameters that you want to use if they are present. If you access the Parameters string dictionary using a key that does not exist, you will be returned null. This is to signify that no such parameter was found. Every parameter should be tested for null to prevent null reference exceptions. This is also how you would enforce required parameters.

static void Main(string[] args)
{
    ConsoleTokenizer tokenizer = new ConsoleTokenizer(args);

    string mode = string.Empty;
    if (tokenizer["mode"] != null)
    {
        mode = tokenizer["mode"];
    }
}

The following code shows a complete console application example that uses the ConsoleTokenizer to parse command line arguments, and then dumps the values to the console window.

using System;
using System.Collections.Generic;
using System.Text;

using ConsoleTokenizerLibrary;

namespace ConsoleTokenizerDemo
{
     class Program
     {
         static void Main(string[] args)
         {
             ConsoleTokenizer tokenizer = new ConsoleTokenizer(args);

             Console.WriteLine("");
             Console.WriteLine("Console Tokenizer Demo Application");
             Console.WriteLine("Pass a parameter string to tokenize it");
             Console.WriteLine("");

             if (tokenizer.Files.Count > 0)
             {
                 Console.WriteLine("Files");
                 Console.WriteLine("*****************************");

                 foreach (string file in tokenizer.Files)
                 {
                     Console.WriteLine(String.Format("File: {0}", file));
                 }

                 Console.WriteLine("");
                 Console.WriteLine("");
             }

             if (tokenizer.Parameters.Keys.Count > 0)
             {
                 Console.WriteLine("Parameters");
                 Console.WriteLine("*****************************");
                foreach (string key in tokenizer.Parameters.Keys)
                {
                    Console.WriteLine(String.Format("Name: {0}	Value: {1}",
                                                     key,
                                                     tokenizer[key]));
                }
            }
        }
    }
}

Conclusion

This chapter discussed common formatting styles of command line arguments, and went on to building a tokenizer using .NET regular expressions. Command line utilities are extremely popular among tools developers, so having a flexible and reusable tokenizer is very important. Having one means that even less time can be spent on developing these tools, which are fast to develop as it is.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.139.169