Regular expressions

To search for and match patterns in text and other data, regular expressions are an indispensable tool for the data scientist. Julia adheres to the Perl syntax of regular expressions. For a complete reference, refer to http://www.regular-expressions.info/reference.html. Regular expressions are represented in Julia as a double (or triple) quoted string preceded by r, such as r"..." (optionally, followed by one or more of the i, s, m, or x flags), and they are of type Regex. The regexp.jl script shows some examples.

In the first example, we will match the email addresses (#> shows the result):

email_pattern = r".+@.+" 
input = "[email protected]" 
println(occursin(email_pattern, input)) #> true

The regular expression pattern + matches any (non-empty) group of characters. Thus, this pattern matches any string that contains @ somewhere in the middle.

In the second example, we will try to determine whether a credit card number is valid or not:

visa = r"^(?:4[0-9]{12}(?:[0-9]{3})?)$"  # the pattern 
input = "4457418557635128" 
occursin(visa, input)  #> true 
if occursin(visa, input) 
    println("credit card found") 
    m = match(visa, input) 
    println(m.match) #> 4457418557635128 
    println(m.offset) #> 1 
    println(m.offsets) #> [] 
end

The occursin(regex, string) function returns true or false, depending on whether the given regex matches the string, so we can use it in an if expression. If you want the detailed information of the pattern matching, use match instead of occursin. This either returns nothing when there is no match, or an object of type RegexMatch when the pattern is found (nothing is, in fact, a value to indicate that nothing is returned or printed, and it has a type of Nothing).

The RegexMatch object has the following properties:

match contains the entire substring that matches (in this example, it contains the complete number)
offset states at what position the matching begins (here, it is 1)
offsets gives the same information as the preceding line, but for each of the captured substrings
captures contains the captured substrings as a tuple (refer to the following example)

Besides checking whether a string matches a particular pattern, regular expressions can also be used to capture parts of the string. We can do this by enclosing parts of the pattern in parentheses ( ). For instance, to capture the username and hostname in the email address pattern used earlier, we modify the pattern as follows:

email_pattern = r"(.+)@(.+)"

Notice how the characters before @ are enclosed in brackets. This tells the regular expression engine that we want to capture this specific set of characters. To see how this works, consider the following example:

email_pattern = r"(.+)@(.+)" 
input = "[email protected]" 
m = match(email_pattern, input) 
println(m.captures) #> Union{Nothing, 
SubString{String}}["john.doe", "mit.edu"]

Here is another example:

m = match(r"(ju|l)(i)?(a)", "Julia") 
println(m.match) #> "lia" 
println(m.captures) #> l - i - a 
println(m.offset) #> 3 
println(m.offsets) #> 3 - 4 - 5

The search and replace functions also take regular expressions as arguments, for example, replace("Julia", r"u[w]*l" => "red") returns "Jredia". If you want to work with all the matches, matchall and eachmatch come in handy:

str = "The sky is blue"
reg = r"[w]{3,}" # matches words of 3 chars or more 
r = collect((m.match for m = eachmatch(reg, str)))
show(r) #> ["The","sky","blue"]

iter = eachmatch(reg, str) 
for i in iter 
    println(""$(i.match)" ") 
end

The collect function returns an array with RegexMatch for each match. eachmatch returns an iterator, iter, over all the matches, which we can loop through with a simple for loop. The screen output is "The", "sky", and "blue", printed on consecutive lines.

Table of Contents for Regular expressions

Create new playlist

Sign In

Sign Up

Table of Contents for
Regular expressions