When dealing with textual data, you may be required to find or replace words that follow a particular pattern. For instance, you may wish to find words that end with “al” when carrying out data wrangling. Using regular expressions is an easy way to go about this in Natural Language Processing. It is a powerful method used to find, split, or replace words according to some pattern. Regular expressions can help you extract key information from dirty data during data analysis. You can quickly get dates, price of a good, the email address of customers, or their telephone numbers.
You can also go beyond pattern matching with regular expressions. You may want to preprocess the format or markup of texts in a document. You may want to ensure that the first word in a sentence begins with a capital letter or sentences in the form of questions that ends with a question mark. During web scraping, you may want to extract texts with a particular tag. You can, for instance, extract the texts in the <abrev></abrev> tag and create a list of abbreviations with the extracted texts.
Regular expressions have become very popular over the years. At the moment, many programming languages such as Java, Python, C, Perl, and many more support regular expressions. In this tutorial, you will learn how to use regular expressions in python. We’d go further by treating its use cases and take some examples. Without further ado, let’s jump into it.
Let’s start by saying you make use of the regular expression by importing the re module
import re
Regular Expression Building Blocks
- The Wildcard
The “.” symbol is referred to as the wildcard. This is because it is used to match any single character. If we create a regular expression “d.ink” for instance, it would match the words drink, drank, and drunk. Note that the “.” matches just one character. This implies that where we want to match two characters or more, the “.” character should be repeated for as many characters. For example, ..ng matches all four-lettered words that end in “ng”.
- Repeatability
The “+” sign is used to indicate that the immediately preceding character can be repeated up to a random number of times. The expression “brus+h” matches words such as brush, brush, brusssh, brusssh, and so on. The + symbol particularly shines when used alongside the “.” symbol. The expression “b.+h” returns any word that starts with the letter b and ends with the letter h. The expression “.+ing” returns any word that ends with the suffix -ing.
The “*” is used to indicate that the immediate past character is optional and repeatable. The expression “*.fit*. matches all words that contain the word “fit” including “fit” itself.
- Optionality
The “?” symbol is used to indicate that the immediate past character is not compulsory. The expression “odou?r” matches both “odor” and “odour”. The symbol could as well be used alongside punctuations such as a hyphen. The expression “e-?mail” matches both “email” and “e-mail”.
- Choices
While the wildcard allows you to select any character, there are situations where you may want to limit the character choices to a few options. The “[]” notation is used for the purpose. The expression “f[aeiou]n” matches words like fan, fen, fin, and fun. You may add a little flexibility with the + symbol. As explained earlier, the + symbol allows you to repeat the character selected. The expression “p[aeiou]+t” matches words like pout, poet, and peat.
- Ranges
When using the [] notation, you have to list all the characters to choose from individually. But if these characters are within a range, you can use the “-“ between the first and last characters. The expression [a-z] for instance captures all lowercase letters.
When you combine ranges with other symbols, you can do even more powerful things. The expression [A-Z]* matches all words in capital letters. Words like acronyms or abbreviations. [a-zA-z] matches all lower or upper case letters.
There are other important metacharacters such as $, ^, \w, \t, etc. The table below shows the metacharacter and their application
Notation | Characteristics |
. | Used to match any character |
* | Used to match none, one or more of the preceding items |
+ | Used to match one or more of the preceding items |
? | Used to match zero or one of the preceding items |
^xyz | Used to match the pattern xyz at the beginning of a string |
Xyz$ | Used to match the pattern xyz at the end of a string |
[xyz] | Used to match a character selection |
[^xyz] | Used to match the characters, not in the square bracket |
[A-Z0-9] | Used to match a character from a list of uppercase characters or numbers |
{n} | Used to match n number of repeats. Note that n must be a non-negative integer. |
{n,} | Used to match at least one repeats |
{,n} | Used to match not more than n repeats |
{m,n} | Used to match at least m but not more than n repeats |
\. | Used to match the symbol literally |
\s | Used to match whitespace character such as space, newline, tab, etc |
\S | Used to match a non-whitespace |
\w | Used to match alphanumeric characters |
\W | Used to match non-alphanumeric characters |
\d | Used to specifically match a digit i.e. [0-9] |
\D | Used to match a non-digit |
\b | Used to match a word boundary |
() | Used to group regular expressions and returns the matched text |
^\W\d_ | Used to match letters alone |
Regular Expression Functions
The regular expression module has a couple of functions used for different purposes. To have a rounded understanding of how to effectively apply the Regexp module, let’s discuss some of the most useful functions. re.split(pattern, string, [maxsplit=0]): This function splits a list of strings according to some defined pattern. Let’s see an example.
#import the regular expression library import re #splits the word ‘Artificial Intelligence’ by 'I' text = re.split(r'i', 'Artificial Intelligence') #prints the result print(text)
Output:
['Art', 'f', 'c', 'al Intell', 'gence']
As seen in the result, ‘Artifical Intelligence’ was split by ‘i’. There is a third argument that can be defined when using the split method – maxsplit. Maxsplit indiciates the maximum splits that can be done and are set to zero by default. In cases where the character to split by appears more than once, it good practice to define the maxsplit. Let’s see an example with maxsplit=2.
#import the regular expression library import re #splits the word 'Python' by 't' text = re.split(r'i', 'Artificial Intelligence', maxsplit=2) #prints the result print(text)
Output:
['Art', 'f', 'cial Intelligence']
As seen, the text was not split after the second ‘I’
- re.match(pattern, string): This method checks for a match in a string. It matches if the defined pattern occurs at the beginning of the string. Trying to match ‘Artificial’ in ‘Artificial Intelligence’ will match. Let’s see an example.
#import the regular expression library import re #checks if there is a match text = re.match(r'Artificial', 'Artificial Intelligence') #prints the result print(text)
Output:
<re.Match object; span=(0, 10), match='Artificial'>
The result indicates that there is a match at index 0 to 10. If, however, we attempt to match ‘Intelligence’ in ‘Artificial Intelligence’, the program would return a None value, indicating that there is no match.
- re.search(pattern, string): This method works similarly to the match() method but does not restrict its search to the first occurrence of the pattern. The searches if the patterns match the string anywhere but return only the first occurrence. Let’s see an example.
#import the regular expression library import re #checks whether there is a match text = re.search(r'Intelligence', 'Artificial Intelligence Intelligence') #prints the result print(text)
Output:
<re.Match object; span=(11, 23), match='Intelligence'>
The result shows that the match occurs from the 11th index to the 23rd index. Observe that even though the word appears a second time, the search() method does not pick it.
- re.findall(pattern, string): This method is used to get all the patterns that match. Unlike the match() or search() method, it is not constrained to check/return the beginning or end of the string. The findall() method is the most commonly used method since it can work like the match() and search() method. Let’s see an example where the findall() method is used.
#import the regular expression library import re #finds the word 'Intelligence' in the string text = re.findall(r'Intelligence', 'Artificial Intelligence Intelligence') #prints the result print(text)
Output:
['Intelligence', 'Intelligence']
4. re.sub(pattern, repl, string): This method is used to find and replace a pattern with a new string. Let’s take an example.
#import the regular expression library import re #replaces the word 'Artificial' with 'Emotional' text = re.sub(r'Artificial', 'Emotional', 'Artificial Intelligence') #prints the result print(text)
Output:
Emotional Intelligence
In cases where the pattern is not found, the returned string remains the same.
Tokenizing Sentences with NLTK’s RegexpTokenizer
In earlier tutorials, we have used nltk.word_tokenize() to carry out tokenization on a piece of text. It may also interest you to know that regular expressions can as well be used for tokenization. This is done using the RegexpTokenizer class or the regexp_tokenize() helper function. Interestingly, this method gives you more control over how the text will be tokenized. Let’s take some examples.
#import the RegexpTokenizer library from nltk.tokenize import RegexpTokenizer #instantiate the tokenize class with the regular expression rule as an argument tokenizer = RegexpTokenizer("[\w']+") #define a text text = "I won't stop learning about Artificial Intelligence" #tokenize the text tokenizer.tokenize(text)
Output:
['I', "won't", 'stop', 'learning', 'about', 'Artificial', 'Intelligence']
We can go ahead to do more interesting things with RegexpTokenizer class. Take, for instance, we want to extract the domain name of an email address. What changes in the code is the regular expression rule/pattern?
#import the RegexpTokenizer library from nltk.tokenize import RegexpTokenizer #instantiate the tokenize class with the regular expression pattern as an argument tokenizer = RegexpTokenizer("@\w+.\w+") #define an email email = '[email protected]' #tokenize the text tokenizer.tokenize(email)
Output:
['@h2kinfosys.com']
Going forward, if you do not wish to instantiate the RegexpTokenizer class, there’s also a helper function, regexp_tokenize(), that can quickly be used. The regexp_tokenize takes two compulsory parameters, the text to be tokenized and a defined pattern to work with. Let’s see this example.
#import the regexp_tokenize function from nltk.tokenize import regexp_tokenize #define a text text = "I won't stop learning about Artificial Intelligence" #tokenize the text tokenized_text = regexp_tokenize(text, "[\w']+") #tokenize the text print(tokenized_text)
Output:
['I', "won't", 'stop', 'learning', 'about', 'Artificial', 'Intelligence']
As seen, it’s a similar result to the earlier example. A shorter code this time.