Regular expressions

Regular expressions are excellent for simpler parsing tasks, replaces, or splits. We will give a short introduction on them and show some examples. These will allow you to get better idea. At the end of this section, we will suggest further reading.

Basic syntax

Usually, when you write a text as a pattern, this means that the text will be matched; for example, apple or pear will match the highlighted parts from the following sentence: "Apple stores do not sell apple or pear ."

These are case sensitive by default, so if the pattern were to be simply apple, this will not match the first word of the sentence or the company name.

There are special characters that need to be escaped when you want to match them: ., [, ], (, ), {, }, -, ^, $, (Well, some of these only in certain positions). To escape them, you should prefix them with , which will result in the following patterns: ., [, ], (, ), {, }, -, ^, $, \.

When you do not want an exact match of characters, you can use the [ characters ] brackets around the possible options, such as [abc], which will match either a, b, or c but not bc (not a single character) or d (not among the options). You can specify the range of characters using the character within brackets, such as [a-z], which will match any lower case English alphabet characters. You can have multiple ranges and values within brackets, such as [a-zA-Z,], which will match either a lowercase or an uppercase character or a comma (equivalent to [[a-z][A-Z][,]] but not to [a-z][A-Z][,] because the latter would match three characters, not one).

To negate a certain character class, you can use the ^ character within brackets; for example, the [^0-9] pattern will match a single character except the digits (or the line separators).

It might be tedious and error prone to specify always certain groups of characters, so there are special sets/classes predefined. Here is a non-exhaustive list of the most important ones:

  • d: It identifies the decimal digits ([0-9])
  • s: It identifies the whitespace characters
  • : It identifies a new line character (by default, only single lines are handled so new lines cannot be matched in that mode, but you can specify a multiline match too)
  • w: It identifies the English alphabet (identifier) characters and decimal digits ([a-zA-Z_0-9])

You can also use the groups within brackets to complement them; for example, [^ds] (a character that is neither a whitespace nor a digit).

These can be used when you know in advance how long you want to match the parts; although, usually this is not the case. You can specify a range for the number of times you want to match certain patterns using the { n , m } syntax, where n and m are nonnegative numbers; for example, [ab]{1,3} will match the following: a, aa, aaa, and bab but not baba or the empty string.

When you do not specify m in the previously mentioned syntax, it will be (right) unbounded the number of times it can appear. When you omit the comma sign too, the preceding pattern has to appear exactly n times to get a match.

There are shorter versions for {0,1} - ?, {0,} - *, {1,} - +.

When there is no suffix for these numeric or symbolic quantifiers, you are using the greedy match; if you append ?, it implies the reluctant; while if you append a + sign, it will be possessive. Here are some examples: [ab]+b, [ab]+?b, and [ab]++b. The details are important, and can be shown by example. We will highlight the matches for certain patterns and texts (we will separate the matches with | if there are multiple):

Textpattern

[ab]+b

[ab]+?b

[ab]++b

[ab]+?

[ab]++

abababbb

abababbb

ab|ab|ab|bb

abababbb

a|b|a|b|a|b|b|b

abababbb

ababa

ababa

ab|ab a

ababa

a|b|a|b|a

ababa

abb

abb

ab b

abb

a|b|b

abb

The last column is a whole text match for each example, also the first column's first and third patterns, but all other examples are just partial (or no) matches.

You might want to create more complex conditions, but you need grouping of certain patterns for them. There are capturing groups and non-capturing groups. The capturing groups can be referred to with their number (there is always an implicit capturing group for each match and the whole match; that is, the 0 group), but the non-capturing groups are not available for further reference or processing, although they can be very useful when you want to separate unwanted parts. The syntax for capturing groups is ( subpattern ) and for non-capturing groups is (?: subpattern ).

When you want to refer back to previous groups, you should use the n notation, where n is the index of the previous group (in the pattern, the start of the nth starting group parentheses).

There is also an option to create named groups using the (?< name > subpattern ) syntax. (This feature is available since Java 7, so it will not work on Mac OS X until you can use KNIME with Java 7 or a later version.) Referring to named patterns can be done with the k< name > syntax.

With these groups, you can express not just more kinds of quantification, but also alternatives using the | (or) construct, for example (ab)?((?:[cd]+)|(?:xzy)), which means that there is optionally a group of ab characters followed by some sequence of c or d characters or the text xzy. The following will match: abxzy, abdcdccd, xzy, c, and cd, but xzyc or cxzy will not.

Positionally, you do not have many options; you can specify whether the match should start at the beginning of the line (^), or it should match till the end of the line ($), or you do not care (no sign).

The lookahead and lookbehind options can be handy in certain situations too, but we will not cover them at this time.

Note

Beware. For certain patterns, the matching might take exponentially long; see http://en.wikipedia.org/wiki/ReDoS for examples. This might warn you to do not accept arbitrary regular expressions as a user input in your workflows.

Partial versus whole match

The pattern can be matched by two ways. You can test whether the whole text matches the pattern or just tries to find the matching parts within the text (probably multiple times). Usually, the partial match is used, but the whole match also has some use cases; for example, when you want to be sure that no remaining parts are present in the input.

Usage from Java

If you want to use regular expressions from Java, you have basically two options:

  • Use java.lang.String methods
  • Use java.util.regex.Pattern and related classes

In the first case, you have not much control about the details; for example, a Pattern object will be created for each call of the facade methods delegating to the Pattern class (methods such as split, matches, or replaceAll, replaceFirst). The usage of Pattern and Matcher allows you to write efficient (using Pattern#compile) and complex conditions and transformations. However, in both cases, you have to be careful, because the escaping rules of Java and the syntax of regular expressions do not make them an easy match. When you use in a regular expression within a string, you have to double them within the quotes, so you should write \d instead of d and \\ instead of \ to match a single .

Tip

Automate the escaping

The QuickREx tool (see References, tools) can do the escaping. You create the pattern, test it, navigate to File | New... | Untitled Text File, and select the Copy RE to Java action from the menu or the QuickREx toolbar. (Now you can copy the pattern to the clipboard and insert them anywhere you want and close the text editor.)

On the Pattern object, you can call the matcher method with the text as an argument and get a Matcher object. On the Matcher object, you can invoke either the find (for partial matches) or the matches (for whole matches) methods. As we described previously, you might have different results.

References and tools

Tip

There is a Reg. Exp. Library view that is also included in QuickREx.

Alternative pattern description

In KNIME, there is an alternative, simpler form of pattern description named wildcard patterns . These are similar to the DOS/Windows or UNIX shell script wildcard syntax. The * character matches zero or more characters (greedy match), but the ? character matches only a single character. The star and question mark characters cannot be used in patterns to match these characters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.239.226