Stemming and Lemmatization are both text normalization techniques in Natural Language Processing. Text normalization involves the transformation of words in a sentence into a standard form make the text distribution more compact. In order words, text normalization attempts to make the distribution of the texts have a normal distribution curve. For example, the words ‘smile’, ‘smiling’ and ‘smiled’ are 3 different words but they do have the same meaning. Well, a human can easily know they have the same meaning. The variation was due to context. But machines do not know this. In Natural Language Processing, it is good practice to allow your model to interpret those words as one word, with one meaning.
Text normalization is a vital preprocessing step when dealing with textual data. And by extension, stemming and lemmatization are. Let’s understand what these are.
What is Stemming
Stemming is a text normalizing technique that cuts down affixes of words, to extract its base form or root words. Stemming is a crude process and sometimes, the root word, also called the stem, may not have grammatical meaning. In fact, in some other NLP libraries like spaCy, stemming is not included.
There are various stemming programs used to carry out stemming. These programs are called stemmer or stemming algorithm. In NLTK, there is the Porter Stemmer, Lancaster Stemmer, Regular Expression Stemmer, and Snowball Stemmer. The most common is the Porter stemming algorithm
Porter Stemming Algorithm
The Porter Stemming Algorithm is arguably the most popular stemming algorithm in Natual Language Processing. In NLTK, it can be instantiated using the PorterStemmer class. The algorithm takes an input of tokenized words and outputs the stems. Let’s take a simple code example using the PorterStemmer class.
#import the PorterStemmer class from nltk.stem library from nltk.stem import PorterStemmer #insantiate the PorterStemmer class stemmer = PorterStemmer() #create a list of tokens tokens = [‘smiling’, ‘smile’, ‘smiled’, ‘smiles’] #create an empty list to take in the stemmed words stemmed_words = [] #loop over each token in the list for each_word in tokens: #stem each word in the list stemmed_word = stemmer.stem(each_word) #add the stemmed word to the empty stemmed word list stemmed_words.append(stemmed_word) #print the stemmed words list print(stemmed_words)
Output:
['smile', 'smile', 'smile', 'smile']
As seen, all variations of the word have been stemmed from its root word, ‘smile’. As earlier mentioned, some words may not be stemmed into meaning root words. If we attempt to stem the words ‘cry’, ‘crying’, ‘cries’ and ‘cried’, it outputs the word ‘cri’, which does not have any grammatical meaning.
Let’s take another example where we pass sentences as input.
#import the PorterStemmer class from nltk.stem function from nltk.stem import PorterStemmer #import th word_tokenize method in the nltk library from nltk import word_tokenize #insantiate the PorterStemmer class stemmer = PorterStemmer() #define some statement sentences = 'NLTK is a very interesting subject. I have learnt a lot from this website. I will not stop learning' #tokenize each word in the sentences each_sentence = word_tokenize(sentences) #create an empty list to take in the stemmed words stemmed_words = [] #loop over each token in the list for word in each_sentence: #stem each word in the list stemmed_word = stemmer.stem(word) #add the stemmed word to the empty stemmed word list stemmed_words.append(stemmed_word) #print the stemmed words list print(stemmed_words)
Output:
['nltk', 'is', 'a', 'veri', 'interest', 'subject', '.', 'I', 'have', 'learnt', 'a', 'lot', 'from', 'thi', 'websit', '.', 'I', 'will', 'not', 'stop', 'learn']
Again, some of the stemmed words do not have a dictionary meaning. Mind you, there are other stemming algorithms. The tweak to the code would be importing the new stemming algorithm and instantiating the same.
- Using the Lancaster stemmer, with the LancasterStemmer class outputs
['nltk', 'is', 'a', 'very', 'interest', 'subject', '.', 'i', 'hav', 'learnt', 'a', 'lot', 'from', 'thi', 'websit', '.', 'i', 'wil', 'not', 'stop', 'learn']
- The regular expression stemmer takes a regular expression and cut of any suffix or prefix that matches the defined expression. Using the Regular Exppression stemmer with the RegexpStemmer class and defining the ‘ing’ parameter as the regular expression outputs
['NLTK', 'is', 'a', 'very', 'interest', 'subject', '.', 'I', 'have', 'learnt', 'a', 'lot', 'from', 'this', 'website', '.', 'I', 'will', 'not', 'stop', 'learn']
- Snowball stemmer allows for stemming in 15 other languages including, Arabic, French, German, Italian, Portuguese, Russian, Spanish and Swedish. When using snowball stemmer, the language has to be defined. Using the Snowball Stemmer with the SnowballStemmer class abd defining the language as ‘english’ outputs
['nltk', 'is', 'a', 'veri', 'interest', 'subject', '.', 'i', 'have', 'learnt', 'a', 'lot', 'from', 'this', 'websit', '.', 'i', 'will', 'not', 'stop', 'learn']
Let’s now take a look at lemmatization
What is Lemmatization?
Lemmatization is almost like stemming, in that it cuts down affixes of words until a new word is formed. Only that in lemmatization, the root word, called ‘lemma’ is a word with a dictionary meaning. The morphological analysis of words is done in lemmatization, to remove inflection endings and outputs base words with dictionary meaning. Lemmatization with the NLTK library is done using the WordNetLemmatizer class. It’s almost the same methodology as using PorterStemmer. Let’s take a simple coding example.
#import the WordNetLemmatizer class from nltk.stem library from nltk.stem import WordNetLemmatizer #insantiate the WordNetLemmatizer class lemmatizer = WordNetLemmatizer() #create a list of tokens tokens = ['crying', 'cry', 'cried'] #create an empty list to take in the lemmatized words lemmatized_words = [] #loop over each token in the list for each_word in tokens: #lemmatize each word in the list lemmatized_word = lemmatizer.lemmatize(each_word) #add the lemmatized word to the empty lemmatized word list lemmatized_words.append(lemmatized_word) #print the lemmatized words list print(lemmatized_words)
Output:
['cry', 'cry', 'cried']
WordNetLemmatizer outputs meaningful words – cry and cried, as opposed to what PorterStemmer returned – cri.
Use Case Scenarios
Let’s say we have a text we want a machine learning model to understand. We need to preprocess the text using stemming or lemmatization. Let’s take a code example for each of them starting with stemming.
#import the necessary libraries from nltk import sent_tokenize, word_tokenize from nltk.corpus import stopwords from nltk.stem import PorterStemmer #define the text to be preprocessed text = ''' A part of speech is a classification system in the English Language that reveals the role a word plays in a context. There are eight parts of speech in the English Language: nouns, pronouns, verbs, adverbs, adjectives, prepositions, conjunctions, and interjections. POS tagging in simple terms means allocating every word in a sentence to a part of speech. NLTK has a method called pos_tag that performs POS tagging on a sentence. The methods apply supervised learning approaches that utilize features such as context, the capitulation of words, punctuations, and so on to determine the part of speech. POS tagging is a critical procedure to understand the meaning of a sentence and know the relationship between words. ''' #instantiate the PorterSTemmer class stemmer = PorterStemmer() #Tokenize the text to lists of sentences sentence = sent_tokenize(text) #loop over each list based on its index for index, _ in enumerate(sentence): #tokenized each sentence to a list of words tokenized_words = word_tokenize(sentence[index]) #apply stemmer if the word is not a stopword words = [stemmer.stem(word) for word in tokenized_words if word not in set(stopwords.words('english'))] #add the stemmed words to the sentence variable sentence[index] = ' '.join(words)
Output:
Observe that words such as, of, in, the, etc are completely taken out. They are called stopwords. Stopwords do not add serious meaning to a sentence. It is good practice to remove them.
Now, let’s carry out lemmatization on the same text and see the result.
#import the necessary libraries from nltk import sent_tokenize, word_tokenize from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer #define the text to be preprocessed text = ''' A part of speech is a classification system in the English Language that reveals the role a word plays in a context. There are eight parts of speech in the English Language: nouns, pronouns, verbs, adverbs, adjectives, prepositions, conjunctions, and interjections. POS tagging in simple terms means allocating every word in a sentence to a part of speech. NLTK has a method called pos_tag that performs POS tagging on a sentence. The methods apply supervised learning approaches that utilize features such as context, the capitulation of words, punctuations, and so on to determine the part of speech. POS tagging is a critical procedure to understand the meaning of a sentence and know the relationship between words. ''' #instantiate the WordNetLemmatizer class lemmatizer = WordNetLemmatizer() #Tokenize the text to lists of sentences sentence = sent_tokenize(text) #loop over each list based on its index for index, _ in enumerate(sentence): #tokenized each sentence to a list of words tokenized_words = word_tokenize(sentence[index]) #apply lemmatizer if the word is not a stopword words = [lemmatizer.lemmatize(word) for word in tokenized_words if word not in set(stopwords.words('english'))] #add the lemmatized words to the sentence variable sentence[index] = ' '.join(words)
Output:
So rounding off…
Stemming or Lemmatization: Which should you go for?
No doubt, lemmatization is better than stemming. But there could be tradeoffs. Lemmatization requires a solid understanding of linguistics, hence it is computationally intensive. If speed is one thing you require, you should consider stemming. If you are trying to build a sentiment analysis or an email classifier, the base word is sufficient to build your model. In this case, as well, go for stemming.
If however, your model would actively interact with humans – say you are building a chatbot, language translation algorithm, etc, lemmatization would be a better option.