Puzzler 20

Irregular Expressions

You may have heard the old joke: "A developer has a problem and decides to solve it with a regular expression. Now they have two problems." Regular expressions can indeed get complicated quickly. But they are also very powerful, and can be extremely useful if used judiciously.

Scala's scala.util.matching.Regex class provides utility functions for regular expressions. Its findAllIn method returns a MatchIterator that iterates over all the occurrences of the regular expression in a string:

  scala> for (reMatch <- "l".r.findAllIn("I love Scala")) 
           println(reMatch)
  l
  l

In the next example, a MatchIterator is twice queried for the index of the first regular expression match in a string using the start method. In one case, we also trace the call. What is the result of executing this code:

  def traceIt[T <: Iterator[_]](it: T) = {
    println(s"TRACE: using iterator '${it}'")
    it
  }
  
val msg = "I love Scala" println("First match index: " +    traceIt("a".r.findAllIn(msg)).start) println("First match index: " +    "a".r.findAllIn(msg).start)

Possibilities

  1. Prints:
      TRACE: using iterator 'non-empty iterator'
      First match index: 9
      First match index: 9
    
  2. Both statements throw a runtime exception.
  3. The first statement prints:
      TRACE: using iterator 'non-empty iterator'
      First match index: 9
    

    and the second throws a runtime exception.

  4. The first statement throws a runtime exception, and the second prints:
      First match index: 9
    

Explanation

Even though the regular expression, "a", certainly looks as though it should match the string, "I love Scala", you may wonder whether something could be happening that causes no matches to be found. Or you may suspect that, somehow, simply printing the iterator in the trace method causes things to go wrong. Otherwise, surely both statements should print that the first occurrence of "a" in "I love Scala" is indeed at index 9?

Not so. In fact, the correct answer is number 3:

  scala> println("First match index: " + 
           traceIt("a".r.findAllIn(msg)).start)
  TRACE: using iterator 'non-empty iterator'
  First match index: 9
  
scala> println("First match index: " +          "a".r.findAllIn(msg).start) java.lang.IllegalStateException: No match available   at java.util.regex.Matcher.end(Matcher.java:389)   at scala.util.matching.Regex$MatchIterator.end(       Regex.scala:667)   ...

Huh? The debugging method doesn't "stay out of the way," it actually enables the statement to succeed? What does it do beyond simply printing the iterator's string representation?

Well, nothing! But printing—or more accurately—generating the string representation of a MatchIterator has a hidden side effect. This effect allows the statement with the debugging call to succeed, while the "untraced" version fails.

The key point is that, as the stack trace of the failed statement indicates, Scala's regex support builds on Java's java.util.regex package. In particular, Scala's MatchIterator is backed by a java.util.regex.Matcher, which has the following characteristic:[1]

The explicit state of a matcher is initially undefined; attempting to query any part of it before a successful match will cause an IllegalStateException to be thrown.

Before you can call start, end, or any other method on MatchIterator that delegates to the underlying Matcher, you therefore first need to initialize that matcher! Invoking MatchIterator's toString method fortuitously achieves this since Iterator's toString implementation, which MatchIterator inherits, calls hasNext. This in turn attempts to find a match and initializes the matcher.

Discussion

Obviously, relying on an implementation detail of the Iterator.toString method is not a particularly smart strategy. In any case, you are not usually interested in the string representation when using a MatchIterator. Calling hasNext—or, if you know the iterator will be non-empty, next()—is a more reliable approach:

  scala> { // without a block the REPL will call toString
           val mi = "a".r.findAllIn(msg)
           mi.hasNext // initialize the matcher
           println("First match index: " + mi.start)
         }
  First match index: 9

If you are planning to iterate over the matches using a for expression or by invoking foreach, map, etc., you don't need to do anything. These method invocations will initialize the matcher for you.

You can avoid the problem entirely by using Regex's findAllMatchIn method instead. This converts the iterator to an Iterator[Match] and is equivalent to calling MatchIterator's matchData method, which converts the iterator to an Iterator[Match]. In both cases, you simply cannot call "dangerous" methods such as start without first iterating to a match, which initializes the underlying matcher:

  scala> {
           val mi = "a".r.findAllMatchIn(msg)
           println("First match index: " + mi.next().start)
         }
  First match index: 9
  
scala> {          val mi = "a".r.findAllIn(msg).matchData          println("First match index: " + mi.next().start)        } First match index: 9
image images/moralgraphic117px.png Prefer Regex's findAllMatchIn method to findAllIn, or convert the MatchIterator returned by findAllIn to an Iterator[Match] by calling the MatchIterator.matchData method.

Footnotes for Chapter 20:

[1] See the Javadoc for java.util.regex.Matcher. [Ora]

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.210.205