Regular expressions are excellent for simpler parsing tasks, replaces, or splits. We will give a short introduction on them and show some examples. These will allow you to get better idea. At the end of this section, we will suggest further reading.
Usually, when you write a text as a pattern, this means that the text will be matched; for example, apple or pear
will match the highlighted parts from the following sentence: "Apple stores do not sell
apple or pear
.
"
These are case sensitive by default, so if the pattern were to be simply apple
, this will not match the first word of the sentence or the company name.
There are special characters that need to be escaped when you want to match them: .
, [
, ]
, (
, )
, {
, }
, -
, ^
, $
, (Well, some of these only in certain positions). To escape them, you should prefix them with
, which will result in the following patterns: ., [, ], (, ), {, }, -, ^, $, \.
When you do not want an exact match of characters, you can use the [
characters
] brackets around the possible options, such as [abc]
, which will match either a, b, or c but not bc
(not a single character) or d
(not among the options). You can specify the range of characters using the character within brackets, such as [a-z]
, which will match any lower case English alphabet characters. You can have multiple ranges and values within brackets, such as [a-zA-Z,]
, which will match either a lowercase or an uppercase character or a comma (equivalent to [[a-z][A-Z][,]]
but not to [a-z][A-Z][,]
because the latter would match three characters, not one).
To negate a certain character class, you can use the ^ character within brackets; for example, the [^0-9]
pattern will match a single character except the digits (or the line separators).
It might be tedious and error prone to specify always certain groups of characters, so there are special sets/classes predefined. Here is a non-exhaustive list of the most important ones:
d
: It identifies the decimal digits ([0-9]
)s
: It identifies the whitespace characters
: It identifies a new line character (by default, only single lines are handled so new lines cannot be matched in that mode, but you can specify a multiline match too)w
: It identifies the English alphabet (identifier) characters and decimal digits ([a-zA-Z
_0-9]
)You can also use the groups within brackets to complement them; for example, [^ds]
(a character that is neither a whitespace nor a digit).
These can be used when you know in advance how long you want to match the parts; although, usually this is not the case. You can specify a range for the number of times you want to match certain patterns using the {
n
,
m
} syntax, where n and m are nonnegative numbers; for example, [ab]{1,3}
will match the following: a
, aa
, aaa
, and bab
but not baba
or the empty string.
When you do not specify m in the previously mentioned syntax, it will be (right) unbounded the number of times it can appear. When you omit the comma sign too, the preceding pattern has to appear exactly n times to get a match.
There are shorter versions for {0,1}
- ?, {0,}
- *, {1,}
- +.
When there is no suffix for these numeric or symbolic quantifiers, you are using the greedy match; if you append ?
, it implies the reluctant; while if you append a +
sign, it will be possessive. Here are some examples: [ab]+b
, [ab]+?b
, and [ab]++b
. The details are important, and can be shown by example. We will highlight the matches for certain patterns and texts (we will separate the matches with |
if there are multiple):
Textpattern |
[ab]+b |
[ab]+?b |
[ab]++b |
[ab]+? |
[ab]++ |
---|---|---|---|---|---|
|
abababbb |
ab|ab|ab|bb |
|
a|b|a|b|a|b|b|b |
abababbb |
|
ababa |
ab|ab
|
ababa |
a|b|a|b|a |
ababa |
|
abb |
ab
|
|
a|b|b |
abb |
The last column is a whole text match for each example, also the first column's first and third patterns, but all other examples are just partial (or no) matches.
You might want to create more complex conditions, but you need grouping of certain patterns for them. There are capturing groups and non-capturing groups. The capturing groups can be referred to with their number (there is always an implicit capturing group for each match and the whole match; that is, the 0 group), but the non-capturing groups are not available for further reference or processing, although they can be very useful when you want to separate unwanted parts. The syntax for capturing groups is (
subpattern
) and for non-capturing groups is (?:
subpattern
).
When you want to refer back to previous groups, you should use the
n notation, where n is the index of the previous group (in the pattern, the start of the nth starting group parentheses).
There is also an option to create named groups using the (?<
name
>
subpattern
) syntax. (This feature is available since Java 7, so it will not work on Mac OS X until you can use KNIME with Java 7 or a later version.) Referring to named patterns can be done with the k<
name
> syntax.
With these groups, you can express not just more kinds of quantification, but also alternatives using the | (or) construct, for example (ab)?((?:[cd]+)|(?:xzy))
, which means that there is optionally a group of ab
characters followed by some sequence of c
or d
characters or the text xzy
. The following will match: abxzy, abdcdccd, xzy, c, and cd, but xzyc
or cxzy
will not.
Positionally, you do not have many options; you can specify whether the match should start at the beginning of the line (^), or it should match till the end of the line ($), or you do not care (no sign).
The lookahead
and lookbehind
options can be handy in certain situations too, but we will not cover them at this time.
Beware. For certain patterns, the matching might take exponentially long; see http://en.wikipedia.org/wiki/ReDoS for examples. This might warn you to do not accept arbitrary regular expressions as a user input in your workflows.
The pattern can be matched by two ways. You can test whether the whole text matches the pattern or just tries to find the matching parts within the text (probably multiple times). Usually, the partial match is used, but the whole match also has some use cases; for example, when you want to be sure that no remaining parts are present in the input.
If you want to use regular expressions from Java, you have basically two options:
In the first case, you have not much control about the details; for example, a Pattern
object will be created for each call of the facade methods delegating to the Pattern
class (methods such as split
, matches
, or replaceAll
, replaceFirst
). The usage of Pattern
and Matcher
allows you to write efficient (using Pattern#compile
) and complex conditions and transformations. However, in both cases, you have to be careful, because the escaping rules of Java and the syntax of regular expressions do not make them an easy match. When you use in a regular expression within a string, you have to double them within the quotes, so you should write
\d
instead of d
and \\
instead of \
to match a single .
Automate the escaping
The QuickREx tool (see References, tools) can do the escaping. You create the pattern, test it, navigate to File | New... | Untitled Text File, and select the Copy RE to Java action from the menu or the QuickREx toolbar. (Now you can copy the pattern to the clipboard and insert them anywhere you want and close the text editor.)
On the Pattern
object, you can call the matcher
method with the text as an argument and get a Matcher
object. On the Matcher
object, you can invoke either the find
(for partial matches) or the matches
(for whole matches) methods. As we described previously, you might have different results.
Pattern
class is a good summary and you can refer to it at: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.htmlIn KNIME, there is an alternative, simpler form of pattern description named wildcard patterns
. These are similar to the DOS/Windows or UNIX shell script wildcard syntax. The *
character matches zero or more characters (greedy match), but the ?
character matches only a single character. The star and question mark characters cannot be used in patterns to match these characters.
18.118.137.7