Sometimes, you will need to sanitize input. This could be to prevent SQL injections or ensure that an entered URL is valid. In this recipe, we will look at replacing the bad words in a string with asterisks. We are sure that there are more elegant and code-efficient methods of writing sanitation logic using regex (especially when we have a large collection of blacklist words), but we want to illustrate a concept here.
Ensure that you have added the correct assembly to your class. At the top of your code file, add the following line of code if you haven't done so already:
using System.Text.RegularExpressions;
Recipes.cs
class called SanitizeInput()
and let it accept a string parameter:public string SanitizeInput(string input) { }
List<string>
to the method that contains the bad words we want to remove from the input:List<string> lstBad = new List<string>(new string[] { "BadWord1", "BadWord2", "BadWord3" });
|
(OR) metacharacter so that the regex will match any of the words. When the list is complete, you can append the
expression to either side of the regex. This denotes a word boundary and, therefore, will only match whole words:string pattern = ""; foreach (string badWord in lstBad) pattern += pattern.Length == 0 ? $"{badWord}" : $"|{badWord}"; pattern = $@"({pattern})";
Regex.Replace()
method that takes the input and looks for the occurrence of the words defined in the pattern, while ignoring case and replacing the bad words with *****
:return Regex.Replace(input, pattern, "*****", RegexOptions.IgnoreCase);
SanitizeInput()
method will look like this:public string SanitizeInput(string input) { List<string> lstBad = new List<string>(new string[] { "BadWord1", "BadWord2", "BadWord3" }); string pattern = ""; foreach (string badWord in lstBad) pattern += pattern.Length == 0 ? $"{badWord}" : $"|{badWord}"; pattern = $@"({pattern})"; return Regex.Replace(input, pattern, "*****", RegexOptions.IgnoreCase); }
SanitizeInput()
method and run your application:string textToSanitize = "This is a string that contains a badword1, another Badword2 and a third badWord3"; Chapter9.Recipes oRecipe = new Chapter9.Recipes(); textToSanitize = oRecipe.SanitizeInput(textToSanitize); WriteLine(textToSanitize); Read();
Let's take a closer look at the regular expression generated.
Let's step through the code to understand what is happening. We need to get a regex that looks like this: (wordToMatch1|wordToMatch2|wordToMatch3)
.
What this basically says is find me any of the words and only whole words that are denoted by . When we look at the list we created, we will see the words we want to remove from the input string:
We then created a simple loop that will create the list of words to match using the OR metacharacter. We ended up with a BadWord1|BadWord2|BadWord3
pattern after the foreach
loop has completed. However, this is still not a valid regex:
To complete the pattern resulting in the valid regex, we need to add the expression on either side of the pattern to tell the regex engine to only match whole words. As you can see, we are using string interpolation. String interpolation is covered in detail in Chapter 1, New Features in C#6.
It is here, however, that we need to be very careful. Start off by writing the code to complete the pattern without the @
sign, as follows:
pattern = $"({pattern})";
If you run your console application, you will see that the bad words are not matched and filtered out. This is because we have not escaped the character before
b
. The compiler, therefore, interprets this line of code:
The generated expression [](BadWord1| BadWord2| BadWord3)[]
is not a valid expression and will therefore not sanitize the input string.
To correct this, we need to add the @
symbol before the string to tell the compiler to treat the string as a literal. This means any escape sequences are ignored. The correctly formatted line of code looks like this:
pattern = $@"({pattern})";
Once you do this, the string for the pattern is interpreted literally by the compiler, and the correct regex pattern generated:
With our correct regex pattern, we called the Regex.Replace()
method. It takes the input to check, the regex to match, the text to replace the matched words with, and optionally allows for the ignoring of case:
When the string returns to the calling code in the console application, the string will be sanitized properly:
Regex can become quite complex and can be used to perform a multitude of tasks to format and validate input and other text.
18.118.24.30