Tokenization in NLP Tutorial

Tokenization in NLP Tutorial

Table of Contents

Tokenization is the process of splitting a chunk of text, phrase, or sentence into smaller units called tokens. The smaller units could be individual words or terms. Tokenization is a pivotal step for extracting information from textual data

To build NLP driven systems, such as sentiment analysis, chatbot, language translation, or a voice assistant, patterns will need to be learned from a conversation. Tokens are used to learn patterns from a chunk of text. They are also used for other NLP operations such as stemming and lemmatization. Do not be perturbed if those terms are unfamiliar. We shall be treating stemming and lemmatization in detail in a later a tutorial. Suffice to say here that stemming and lemmatization, are fundamental steps for cleaning textual data in NLP. 

Tokenization operations are performed using the tokenize module of NLTK’s library. This tokenize module has functions for performing various tasks of which include word_tokenize() and sent_tokenize(). We shall take a look at each of them in this tutorial. 

Tokenization of Words

The word_tokenize method is used for splitting a corpus into individual words. The list of words can be converted into a dataframe to allow for further data cleaning before it’s being fed into a machine learning algorithm for model building. 

Since machine learning algorithms require numeric data to learn from data and make predictions, it becomes critical to apply Tfidf_vectorizer or Count_vectoriser on the tokens. This helps to convert the tokens in strings to a matrix of numbers. You may want to read about vectorizers to get a better understanding.

IT Courses in USA

Let’s see a coding example

from nltk.tokenize import word_tokenize 
text = "I love artificial intelligence. So I am reading this tutorial. I love it!" 
print(word_tokenize(text)) 

Output: ['I', 'love', 'artificial', 'intelligence', '.', 'So', 'I', 'am', 'reading', 'this', 'tutorial', '.', 'I', 'love', 'it', '!']

Explaining each line of code

We started by importing the word_tokenize function from the tokenize module of NLTK. Afterward, a variable that held the textual data was defined. Upon applying the function and passing the variable as a parameter, the sentences were split into words and punctuations, as it can be seen in the output. 

Tokenization of Sentences 

Th sent_tokenize function is used to convert split a corpus into sentences. This can come in handy when you want to calculate say the average number of words in a sentence. You would need both word_tokenize and sent_tokenize for this computation. 

Let’s take a code example

from nltk.tokenize import sent_tokenize 
corpus = "I love artificial intelligence. So I am reading this tutorial. I love it!" 
print(sent_tokenize(text)) 

Output: ['I love artificial intelligence.', 'So I am reading this tutorial.', 'I love it!']

Explaining each line of code

Here also, we imported the required function, sent_tokenize, and passed the corpus variable as a parameter. From the output, we see that code splits the text into three sentences. 

Armed with this information, you have a robust understanding of how tokenization works and what it is used for. In the next tutorial, you will learn about stemming and lemmatization.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share this article
Enroll IT Courses

Enroll Free demo class
Need a Free Demo Class?
Join H2K Infosys IT Online Training
Subscribe
By pressing the Subscribe button, you confirm that you have read our Privacy Policy.