Tokenization is the process of splitting a chunk of text, phrase, or sentence into smaller units called tokens. The smaller units could be individual words or terms. Tokenization is a pivotal step for extracting information from textual data
To build NLP driven systems, such as sentiment analysis, chatbot, language translation, or a voice assistant, patterns will need to be learned from a conversation. Tokens are used to learn patterns from a chunk of text. They are also used for other NLP operations such as stemming and lemmatization. Do not be perturbed if those terms are unfamiliar. We shall be treating stemming and lemmatization in detail in a later a tutorial. Suffice to say here that stemming and lemmatization, are fundamental steps for cleaning textual data in NLP.
Tokenization operations are performed using the tokenize module of NLTK’s library. This tokenize module has functions for performing various tasks of which include word_tokenize() and sent_tokenize(). We shall take a look at each of them in this tutorial.
Tokenization of Words
The word_tokenize method is used for splitting a corpus into individual words. The list of words can be converted into a dataframe to allow for further data cleaning before it’s being fed into a machine learning algorithm for model building.
Since machine learning algorithms require numeric data to learn from data and make predictions, it becomes critical to apply Tfidf_vectorizer or Count_vectoriser on the tokens. This helps to convert the tokens in strings to a matrix of numbers. You may want to read about vectorizers to get a better understanding.
Let’s see a coding example
from nltk.tokenize import word_tokenize text = "I love artificial intelligence. So I am reading this tutorial. I love it!" print(word_tokenize(text)) Output: ['I', 'love', 'artificial', 'intelligence', '.', 'So', 'I', 'am', 'reading', 'this', 'tutorial', '.', 'I', 'love', 'it', '!']
Explaining each line of code
We started by importing the word_tokenize function from the tokenize module of NLTK. Afterward, a variable that held the textual data was defined. Upon applying the function and passing the variable as a parameter, the sentences were split into words and punctuations, as it can be seen in the output.
Tokenization of Sentences
Th sent_tokenize function is used to convert split a corpus into sentences. This can come in handy when you want to calculate say the average number of words in a sentence. You would need both word_tokenize and sent_tokenize for this computation.
Let’s take a code example
from nltk.tokenize import sent_tokenize corpus = "I love artificial intelligence. So I am reading this tutorial. I love it!" print(sent_tokenize(text)) Output: ['I love artificial intelligence.', 'So I am reading this tutorial.', 'I love it!']
Explaining each line of code
Here also, we imported the required function, sent_tokenize, and passed the corpus variable as a parameter. From the output, we see that code splits the text into three sentences.
Armed with this information, you have a robust understanding of how tokenization works and what it is used for. In the next tutorial, you will learn about stemming and lemmatization.