Sanitizing input

Sometimes, you will need to sanitize input. This could be to prevent SQL injections or ensure that an entered URL is valid. In this recipe, we will look at replacing the bad words in a string with asterisks. We are sure that there are more elegant and code-efficient methods of writing sanitation logic using regex (especially when we have a large collection of blacklist words), but we want to illustrate a concept here.

Getting ready

Ensure that you have added the correct assembly to your class. At the top of your code file, add the following line of code if you haven't done so already:

using System.Text.RegularExpressions;

How to do it…

  1. Create a new method in your Recipes.cs class called SanitizeInput() and let it accept a string parameter:
    public string SanitizeInput(string input)
    {
                    
    }
  2. Add a list of type List<string> to the method that contains the bad words we want to remove from the input:
    List<string> lstBad = new List<string>(new string[] { "BadWord1", "BadWord2", "BadWord3" });

    Note

    In reality, you might make use of a database call to read the blacklisted words from a table in the database. You would usually not hardcode them in a list like this.

  3. Start constructing the regex that we will use to look for the blacklisted words. Concatenate the words with the | (OR) metacharacter so that the regex will match any of the words. When the list is complete, you can append the  expression to either side of the regex. This denotes a word boundary and, therefore, will only match whole words:
    string pattern = "";
    foreach (string badWord in lstBad)
        pattern += pattern.Length == 0 ? $"{badWord}" : $"|{badWord}";
    
    pattern = $@"({pattern})";
  4. Finally, we will add the Regex.Replace() method that takes the input and looks for the occurrence of the words defined in the pattern, while ignoring case and replacing the bad words with *****:
    return Regex.Replace(input, pattern, "*****", RegexOptions.IgnoreCase);
  5. When you have completed this, your SanitizeInput() method will look like this:
    public string SanitizeInput(string input)
    {
        List<string> lstBad = new List<string>(new string[] { "BadWord1", "BadWord2", "BadWord3" });
        string pattern = "";
        foreach (string badWord in lstBad)
            pattern += pattern.Length == 0 ? $"{badWord}" : $"|{badWord}";
    
        pattern = $@"({pattern})";
    
        return Regex.Replace(input, pattern, "*****", RegexOptions.IgnoreCase);            
    }
  6. In the console application, add the following code to call the SanitizeInput() method and run your application:
    string textToSanitize = "This is a string that contains a badword1, another Badword2 and a third badWord3";
    Chapter9.Recipes oRecipe = new Chapter9.Recipes();
    textToSanitize = oRecipe.SanitizeInput(textToSanitize);
    WriteLine(textToSanitize);
    Read();
  7. When you run your application, you will see the following in the console window:
    How to do it…

Let's take a closer look at the regular expression generated.

How it works…

Let's step through the code to understand what is happening. We need to get a regex that looks like this: (wordToMatch1|wordToMatch2|wordToMatch3).

What this basically says is find me any of the words and only whole words that are denoted by . When we look at the list we created, we will see the words we want to remove from the input string:

How it works…

We then created a simple loop that will create the list of words to match using the OR metacharacter. We ended up with a BadWord1|BadWord2|BadWord3 pattern after the foreach loop has completed. However, this is still not a valid regex:

How it works…

To complete the pattern resulting in the valid regex, we need to add the  expression on either side of the pattern to tell the regex engine to only match whole words. As you can see, we are using string interpolation. String interpolation is covered in detail in Chapter 1, New Features in C#6.

It is here, however, that we need to be very careful. Start off by writing the code to complete the pattern without the @ sign, as follows:

pattern = $"({pattern})";

If you run your console application, you will see that the bad words are not matched and filtered out. This is because we have not escaped the character before b. The compiler, therefore, interprets this line of code:

How it works…

The generated expression [](BadWord1| BadWord2| BadWord3)[] is not a valid expression and will therefore not sanitize the input string.

To correct this, we need to add the @ symbol before the string to tell the compiler to treat the string as a literal. This means any escape sequences are ignored. The correctly formatted line of code looks like this:

pattern = $@"({pattern})";

Once you do this, the string for the pattern is interpreted literally by the compiler, and the correct regex pattern generated:

How it works…

With our correct regex pattern, we called the Regex.Replace() method. It takes the input to check, the regex to match, the text to replace the matched words with, and optionally allows for the ignoring of case:

How it works…

When the string returns to the calling code in the console application, the string will be sanitized properly:

How it works…

Regex can become quite complex and can be used to perform a multitude of tasks to format and validate input and other text.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.24.30