Word embeddings is a form of word representation in machine learning that lets words with similar meaning be represented in a similar way. Word embedding is done by mapping words into real-valued vectors of pre-defined dimensions using deep learning, dimension reduction, or probabilistic model on the co-occurrence matrix on the word. How it does this is by mapping each word into a corresponding vector and the values of the vector are learned by a neural network. There are a couple of word embedding models out there. Examples include word2vec which was developed by Google, GloVe which was developed by Stanford, and FastText which was developed by Facebook.
In the vector space, the word distribution is based on the usage of the words in the training dataset. For example, words like ‘King’ and ‘Queen’ are closely related and are more likely to appear close to each other in a text than words like ‘King’ and ‘cosmology’. Hence, ‘King’ and ‘Queen’ would be closer together in the vector space than ‘King’ and ‘cosmology’.
The use of word embedding has turned out to be one of the major breakthroughs experienced in the performance of deep learning models when solving NLP problems. It is by far an improvement over bag-of-words word encoding techniques such as counting of words and word frequencies in a document. It thus becomes really important to understand what word embedding is and how to build word embedding models.
By the end of this tutorial, here’s what you will learn.
- Where can word embedding be applied?
- The various word embedding algorithms (embedding layer, glove, and word2vec).
- How to use word embeddings.
- What word2vec does.
- The Architecture of word2vec (CBOW and skip-gram).
- The Bag of Words methods (Count and Tfidfvectorizer).
- How word2vec relates with NLTK.
- Knowing when to use word2vec.
- What activators are?
- The Gensim Python Library.
- Developing a Word Embedding model using Gensim.
Let’s jump into it.
Where can Word Embedding be applied?
Word embedding can be used for many natural language processing tasks including text classification, feature generation and document clustering, and many more. Let’s list out some of the prominent applications.
- Grouping related words: This is perhaps the most obvious. Word embeddings allow words that have similar characteristics to be grouped together while words that are dissimilar be spread far apart in the vector space.
- Finding similar words: Because the words are vectorized such that similar words are not too far apart, word embedding can be used to predict similar words in a model. In the same vein, it can be used to predict dissimilar words and also find words that appear too often in the document.
- Text classification in a feature: When building a text-based classifier or any machine learning model whatsoever, the machine learning algorithm can not deal with strings of the textual data. Hence, the texts must be converted to numbers. Word embedding allows strings to be mapped into lists of vectors which can then be used as the training data for the model to make predictions. In addition, word embedding build semantics which is useful in text-based classification.
- Document clustering: Word embeddings can be used to cluster documents since it can distill frequently used words (keywords) in a text as well as similar and dissimilar words. This is a widely used application.
In general, the word embedding technique shines in most feature extraction processes such as in POS tagging, text-based sentiment analysis, and of course, syntactic analysis.
As earlier mentioned, there are various word embedding models developed by researchers in the field. Let’s look at some of them.
Word Embedding Models
The learning methods of word embeddings can either be a neural network model on some task or an unsupervised learning process, such as using document statistics. In this tutorial, we will be looking at three models that apply word embedding to learn from textual data.
- Embedding layer: An embedding layer is a word embedding technique where the learning is done together with a neural network model, on a particular NLP task such as document classification. When using an embedding layer, the textual data has to be clean and preprocessed such that each word is represented by a one-hot encoding.
The vectors are specified to have some discrete dimensions. The dimensions could be initialized by selecting some random numbers such as 20, 50, 100, or 200 dimensions. The embedding layer is usually the front end of the neural network. Consequently, the model learns by supervised means using the backpropagation algorithm.
The learning process of the embedding layer requires a lot of training data hence, can be extremely slow. But this approach will learn an embedding, targeted both to the textual data and the NLP task.
- GloVe: The GloVe (coined from Global Vector) algorithm is an unsupervised machine learning algorithm used for learning word vectors efficiently. It was created by Pennington, et al. at Stanford in 2014 and has since gained wide popularity.
Before then, matrix factorization techniques such as Latent Semantic Analysis (LSA) were used to create vector spaces of words in a corpus. LSA used text statistics for vector space representations which did a great job of using global statistics but not as good as local context-based learning such as word2vec.
GloVe combines the features of global text statistics approach such as LSA and local context-based approach to produce a much better result at word embeddings.
- Word2vec: Word2vec is a statistical approach for learning word embeddings for each word in a text corpus. Word2vec was developed in 2013 by Tomal Mikolov, et al at Google in a bid to make neural network-based training of textual data more efficient. It has since then become a benchmark for developing pre-trained word embeddings.
The work by Tomal et al employed the exploration of vector mathematics to better represent the words in the vector space. For instance, the analogy below holds true.
King – Man + Woman = Queen
In other words, if we remove the masculinity of a king and add a feminine, we should get a queen. This is a way of describing the comparison, ‘A man is to king and a woman is to queen’.
Using Word Embeddings
When you want to use word embeddings in your Natural Language Processing project, there are different ways to go about it.
- When learning an embedding
You can decide to learn the word embedding for your project. To do this, however, you will need a large amount of data to capture each word and learn its semantic relationship with other words. If you want to train the word embedding yourself, be sure to have a corpus with millions or even billions of words for an accurate result. When creating your word embedding, the learning can either be standalone or joint.
- Standalone learning: This is when your model is trained to learn an embedding using a particular dataset. This embedding is then saved and reused as a part of a different model for another task. This would be a fantastic approach if you wish to use one embedding for different models.
- Joint learning: In this case, the model is trained to learn the embedding as part of a large task. If you want to use an embedding for a task, no matter how large, then you should apply this approach.
- When reusing an embedding
Rather than training you embeddings from scratch, you can use pre-trained word embeddings done by expert researchers in the field. The embeddings are made available to the public mostly under a permissive license. This is a fast approach to use for your projects, whether academic or commercial. Word2vec and GloVe are typical examples of word embeddings that are pre-trained and made available to the public for free.
The remaining part of the tutorial will be focused on the word2vec model. Let’s start by discussing in more detail what word2vec does.
What Word2vec Does
First, understand that neural networks and machine learning algorithms cannot take in raw textual data as input. They only understand numeric data. Therefore, the textual data needs to be converted to numerics before they can be fed into the neural network. Word2vec provides a way of performing this text to vector transformation.
As mentioned earlier, word2vec converts words in a vector space representation. This vector representation is done such that similar words are placed close to each other while dissimilar words are way far apart. Technically, word2vec uses the semantic relationship between words for vector representation.
Also, word2vec checks for the linguistics context of words in a sentence. By context, we mean words that surround a particular word in a sentence. When communicating as humans, we use context to understand what the other party is saying.
If for instance, you read the statement “The man was dozing at work”. You may quickly conclude that he must be a lazy man to be dozing at work. But if only some context was added. Let’s say it now reads this way. “The man stayed up all night to finish his presentation slide. When he got to work the following morning, he was dozing at work”. The extra content provided the context that completely changes our perspective about the man. Now you won’t see him as a lazy man but as a human who needs some rest. That’s how powerful context is.
Word2vec Architecture
The word2vec model can be developed using two different learning methods for word embedding. They are.
- Continuous Bag of Words (CBOW) method
- Continuous skip-gram model
Continuous bag of words models learns word embedding by predicting the present word based on the context of the corpus. The continuous skip-gram model, on the other hand, learns by predicting the surrounding based solely on the present word.
CBOW is quite faster than continuous skip-gram and represents common words better in the text whereas skip-gram needs an amount of data and is good at representing uncommon words or phrases in the data.
In both methods, however, the learning takes paces based on the usage of surrounding words. This method provides better word embedding learning with low space and lesser computational time. This allows for embeddings from large texts, and that is to the tune of billions of words.
Why Word2vec?
Word2vec gives a much better representation of the words in vectors since it checks the relationship between the surrounding words. To appreciate why the word2vec method is important, let us discuss what was in place before word2vec.
Before word2vec, an approach called a bag of words (BOW) was used to represent texts as numbers. The bag of words creates a numeric representation of each word by forming a matrix whose dimension is equal to the size of the text. There are two ways to carry out BOW in python: count vectorizer and TfIdfvectorizer.
Using Countvectorizer
Here, the matrix is populated such that the words are counted if it’s in the dictionary, else it is not.
Say we have a corpus with two sentences.
“Mike is a good boy. The boy Mike, loves to be a boy”.
Let’s apply the count vectorizer technique on these sentences with Python.
#import the count vectorizer class from sklearn.feature_extraction.text import CountVectorizer # instantiate the class vectorizer = CountVectorizer() #define the corpus corpus = ['Mike is a good boy. The boy Mike loves to be a boy'] #fit the vectorizer on the corpus vectorizer.fit(corpus) #transform the fitted corpus transformed_corpus = vectorizer.transform(corpus) #print the transformed data in matrix form print(transformed_corpus.toarray())
Output:
[[1 3 1 1 1 2 1 1]] |
To get the feature names for each count, we convert the array to a dataframe using the pandas library. It can be done with the line of code below.
#import the pandas library import pandas as pd #define the column names columns=vectorizer.get_feature_names() #convert the array to a dataframe pd.DataFrame(transformed_corpus.toarray(), columns=columns)
Output:
As seen, the countvectorizer simply counts the number of times a word occurs in the corpus. This is what is fed into the machine learning algorithm as the features. As seen, applying bag of words on these sentences simply counts the number of times the words appear in the sentence. But there’s a big challenge with this approach generally. The semantic analysis of the words is not taken into consideration. In other words, the presence of the words is equally represented irrespective of their importance. But in reality, some words have weightier effects in a sentence. ‘Good’ in this sentence is an important word. Changing the word or removing it altogether, changes the message in no small way.
The second approach: Tfidfvectorizer tweaks the vector population operation a little. Let’s see how it does that.
Using Tfidfvectorizer
The term frequency-inverse document frequency or TfIdf for short is used to count the frequency of a word in a sentence versus the frequency of the word in the entire document. Mathematically, it can be calculated using the formula
Tf = Number of a particular word in a sentenceNumber of words in the sentence
Idf =log( Number of sentenceNumber of sentences containing a particular word)
To compute this in python, we import the Tfidfvectorizer class from scikit- learn and instantiate the same. Take a look at the code below.
#import the count vectorizer class from sklearn.feature_extraction.text import TfidfVectorizer # instantiate the class vectorizer = TfidfVectorizer() #define the corpus corpus = ['Mike is a good boy. The boy Mike loves to be a boy'] #fit the vectorizer on the corpus vectorizer.fit(corpus) #transform the fitted corpus X= vectorizer.transform(corpus) #print the transformed data in matrix form print(X.toarray())
Output:
[[0.22941573 0.6882472 0.22941573 0.22941573 0.22941573 0.45883147 0.22941573 0.22941573]] |
We can as well convert it into a pandas dataframe.
#import the pandas library import pandas as pd #define the column names columns=vectorizer.get_feature_names() #convert the array to a dataframe pd.DataFrame(X.toarray(), columns=columns)
Output:
‘Mike’ was represented by 0.229 because it appeared twice while ‘is’, 0.688 because it appears once. That’s one thing worth noting: TF-IDF vectorizer gives move importance to rare words.
Generally speaking, bag of words has a couple of shortcomings which demands the development of a better vector representation.
Problems associated with the Bag of Words method
- The semantic analysis of the sentence is not taken into consideration
- The context of the words is overlooked and we already saw how important context is.
- The word arrangement is discarded. The arrangement of words in the sentence does not matter in both bag of words techniques. For example, in the bag of words techniques, the sentence “Red means stop” is represented the same way as “Stop means read” which of course is incorrect.
- With a bag of words, there are higher chances of overfitting.
Consequently, word embedding methods including word2vec were developed to tackle these challenges.
How Does Word2vec relate to NLTK?
NLTK, which means Natural Language Toolkit is a popular python library for preprocessing textual data. It can help with important tasks such as tokenization, POS tagging, stemming, lemmatization, removal of stop words, unique words, and so on. NLTK helps to clean the data such that the machine learning architecture can prepare the feature from the words.
Word2vec on the other hand helps in semantic and syntactic analysis of words. In other words, word2vec checks for surrounding words when learning embedding. Additionally, it maintains the sequence/arrangement of the words in the text. Due to this amazing capability, word2vec can do quite advanced stuff like find similar/dissimilar words, dimensionality reduction, etc. Furthermore, word2vec can be used to convert texts of higher dimensions into vectors of lower dimensions. Word2vec allows you to define specifically the vector dimension you wish to work with.
How to Know When to Use NLTK or Word2vec
If you wish to do common tasks such as tokenization, POS tagging, or stop word removal, NLTK does that very quickly and should hence be used. Whereas, if you are going to build models that need slightly advanced models like document similarity, word prediction as per its context or topic prediction from the text, then word2vec would be one of your best bets.
Let’s touch how neural networks return output with activation functions.
Activators and Word2vec
The activation function of a neural network defines the output of the neural network given some set of input. The whole idea of building neural networks is inspired by the working of the human brain. The brain activates different neurons (output) based on the stimuli it gets from the surrounding (input).
The neural network in artificial intelligence mimics this by trying to create models that learn from data through neurons connected together. Neurons connected together on different layers is called a neural network.
In the diagram above, x1, x2, …, xm is the input received, w1, w2, …, wm are the weights, and b is called the bias. The neuron sums up the effect of the inputs, weight, and bias and parses it through some activation function to produce an output.
Why do we Need an Activation Function?
Most systems in real life have non-linear relationships. An activation function allows us to map an input to an output in a non-linear way. In other words, making use of activation functions allows you to transform linear functions to nonlinear functions, making it possible to develop complex machine learning models. There are several activation functions to select depending upon the kind of model you are trying to create. Some examples include the ReLU activation function, sigmoid activation function, softmax activation function, and the leaky ReLU activation function.
How is the Activation Function Computed in Word2vec?
The softmax activation function, also called the normalized exponential function is usually used as the output layer in the word2vec or any word embedding model. In some cases, hierarchical softmax can as well be used as the output layer to fire the neuron. The hierarchical softmax function offers a reduced complexity as it is calculated by O(log2V) whereas, in softmax, the complexity is calculated by O(V) where V is the vocabulary size.
This means if we want to calculate the probability of predicting a word in a vocabulary size of 16, hierarchical softmax will require log216, i.e. 4 computations to achieve this. Softmax on the other hand will require 16 computations.
Is there any Alternative to Softwax?
Yes. When building word2vec models, negative sampling can be used as an alternative for training the model. It operates almost like gradient descent. Negative sampling works with sampling words randomly and noise contrastive estimation. The model is trained very quickly and chooses its context randomly. If the predicted word is found in random words, the vectors are placed not far from one another.
Thus far, we’ve spoken extensively about the workings on word2vec. Let’s learn how to implement word2vec in python. As mentioned earlier, the word2vec is an open-source model and can be accessed using the Gensim library.
What is Gensim?
Gensim, known as ‘Generate Similar’ in full is an open-source NLP library that is used to perform powerful NLP projects. Gensim has been used and cited by many researchers and programmers in the field for myriad applications. It also supports popular open-source models like fastText, word2vec, LSA, LDA and so much more. In this tutorial, we will be focusing on using the Word2vec class in Gensim to build a model that can understand a chunk of text.
Developing Word2vec embedding using Gensim
To build a word2vec embedding, we will need to import the Gensim library. The word2vec class is imported from the genism. models package. Also, since we will be working with textual data, nltk and regular expressions are import libraries to import as well.
from gensim.models import Word2Vec import nltk from nltk.corpus import stopwords import re
If you are getting a “Module Not Found Error” when importing gensim, it means the library is not installed in your machine. You can install the machine using pip with the statement below
pip install gensim
Then wait for it to download the necessary requirements and library.
Next, we would need to define the training data for our word embedding. To build a robust system, the training data should be a large dataset with millions of words. But for the sake of this tutorial, let’s take an excerpt from the discussion on AI we had some time ago.
John McCarthy and Marvin Minsky defined AI in 1959 as the ability of any program or machine to carry out a task, such that a human would need to apply intelligence to do that same task. This followed a theory by Alan Turing in 1950, that determines whether a machine can be tagged artificially intelligent. The Turing Test, as it is called, says that if it becomes difficult to set a machine apart from a human, from its behavior, then the machine can be called intelligent.
That said, AI systems should possess attributes related to speech, reasoning, and vision. Artificial intelligence is changing the world at a very fast pace. AI is now the buzzword of the last decade and this. The biggest companies are funding a lot of research to come up with AI-driven solutions to world problems.
AI has penetrated virtually every field of work. In the automotive industry, cars are becoming self-driven. In production and manufacturing, robots are now used to complete tasks in a faster and more efficient manner. Automated image diagnosis as well as virtual nursing assistants has revolutionized the healthcare sector in no small way. It is machine learning that determines the best ads to pop up in your social media feeds, classify emails as spam or ham, suggest an appropriate reply to a message, translate languages in real-time, recommend the videos to watch next on YouTube or the kind of songs you’d most likely love when streaming online.
The next step is to preprocess the data. We would remove all extra characters and also convert all the letters to lowercase with the help of regular expressions.
#remove extra characters text = re.sub(r"[[0-9]*\]", " ", text) #remove the extra spaces between words text = re.sub(r"\s+", " ", text) #convert all letters to lowercase text = text.lower()
If the above lines of code seem unclear to you, please visit our tutorial on regular expressions. Going forward, the word2vec() constructor takes an argument of the tokenized words. So, we would need to tokenize the sentence, then the words.
#tokenize the text to list of sentences tokenized_sentence = nltk.sent_tokenize(text) #tokenize the list of sentences to list of words tokenized_words = [nltk.word_tokenize(sentence) for sentence in tokenized_sentence] #define the english stopwords stop_words = stopwords.words('english') #remove the stop words from the text for i, _ in enumerate(tokenized_words): tokenized_words[i] = [word for word in tokenized_words[i] if word not in stop_words]
We can now invoke the Word2vec constructor. This constructor has many parameters that tweak how the word2vec class behaves. While the default values can work just fine, let’s discuss some of the parameters that are important.
- Size: The default is 100. This defines the number of dimensions of the embedding. For vector representation, it is the length of the vector that represents a given word.
- Window: The default is 5. This defines the distances between a given word and the word around that given word.
- Min_count: The default is 5. This defines the minimum number of times a word must appear in the text. Any word with a count less than the set value will be ignored.
- Workers: This defines the number of threads to use during training.
- Sg: The default is 0 or CBOW. This defines the training algorithm. 0 stands for CBOW and 1 stand for skip-gram
- Hs: The default is 0 or negative sampling. This defines the activation function the model uses. If it is 1, then it is a hierarchical softmax. If 0, then negative sampling is used.
Let’s involve the Word2Vec constructor now.
#invoke the Word2Vec with the tokenized words as argument model = Word2Vec(tokenized_words, min_count=1)
The min_count was set to 1 because it is a small text and we want every word to count. After the model is trained, we can access the model using the ‘wv’ attribute of Word2Vec. If you want to determine the words that are being learned by the model, you used the vocab attribute.
#return the list of words learned learned_words = list(model.wv.vocab) #print the learned words print(learned_words)
Output:
['john', 'mccarthy', 'marvin', 'minsky', 'defined', 'ai', '1959', 'ability', 'program', 'machine', 'carry', 'task', ',', 'human', 'would', 'need', 'apply', 'intelligence', '.', 'followed', 'theory', 'alan', 'turing', '1950', 'determines', 'whether', 'tagged', 'artificially', 'intelligent', 'test', 'called', 'says', 'becomes', 'difficult', 'set', 'apart', 'behavior', 'said', 'systems', 'possess', 'attributes', 'related', 'speech', 'reasoning', 'vision', 'artificial', 'changing', 'world', 'fast', 'pace', 'buzzword', 'last', 'decade', 'biggest', 'companies', 'funding', 'lot', 'research', 'come', 'ai-driven', 'solutions', 'problems', 'penetrated', 'virtually', 'every', 'field', 'work', 'automotive', 'industry', 'cars', 'becoming', 'self-driven', 'production', 'manufacturing', 'robots', 'used', 'complete', 'tasks', 'faster', 'efficient', 'manner', 'automated', 'image', 'diagnosis', 'well', 'virtual', 'nursing', 'assistants', 'revolutionized', 'healthcare', 'sector', 'small', 'way', 'learning', 'best', 'ads', 'pop', 'social', 'media', 'feeds', 'classify', 'emails', 'spam', 'ham', 'suggest', 'appropriate', 'reply', 'message', 'translate', 'languages', 'real-time', 'recommend', 'videos', 'watch', 'next', 'youtube', 'kind', 'songs', '’', 'likely', 'love', 'streaming', 'online'] |
All these learned words are now represented in vectors of 10 dimensions. If you don’t trust me, you can check the vector for one of the words using the following statement.
#check the vector representation for the word ‘AI’ model.wv['ai']
Output:
array([-0.00231438, -0.00061445, -0.00383337, 0.00136599, -0.00473457, -0.00431065, 0.0039705 , 0.00036382, 0.00066529, -0.0002722 , -0.00121445, -0.00286695, 0.00025976, 0.00331246, -0.00045034, -0.00157754, -0.00288696, 0.00442494, 0.00251541, 0.00361015, -0.00450731, -0.00366106, 0.00183947, 0.00244478, -0.00311585, 0.0036825 , -0.00423879, 0.00035439, 0.00214975, 0.00193875, 0.00283081, 0.0038534 , -0.00305016, 0.0034705 , -0.00400357, -0.00093232, 0.00301382, 0.00090915, -0.00425109, -0.00412315, 0.00242115, 0.00427755, 0.00020219, -0.00438496, -0.00386935, 0.00327547, 0.00172136, -0.00011458, 0.00449564, -0.00280074, -0.00228639, -0.00313017, 0.00193294, 0.0028305 , 0.00410372, -0.00141312, -0.00106672, 0.00418554, -0.00478295, -0.00181764, 0.00418185, 0.00355082, -0.00438244, -0.00466042, -0.00295377, 0.00182738, 0.0026409 , 0.00179612, 0.00189488, 0.00373415, -0.00030857, 0.00342671, -0.00010824, -0.00254777, -0.00200502, 0.00337977, 0.00232408, -0.00111729, -0.00082494, -0.0038951 , -0.00369731, 0.00263628, 0.00155697, -0.0045178 , 0.00296585, 0.00103964, -0.00102372, 0.00100404, -0.0026283 , -0.00085761, -0.00183032, 0.00269664, -0.00047447, 0.00021483, -0.00237412, -0.00298153, 0.00295124, 0.00255864, -0.0043104 , -0.00202522], dtype=float32)
If we want to check similar words, we can use the statement below.
#list the similar words to AI
model.wv.most_similar('ai')
Output:
[('turing', 0.20339138805866241), |
As we can see, words like Turing, Automated, Self-driven, etc. are listed as similar words to AI. This is substantially correct even with the small dataset. The accuracy of the results can be bettered with a larger dataset.
Finally, we can save the model using the save() method.
#save the model
model.wv.save('model.bin')
We can load the model again with the load() function.
#load the model
Word2Vec.load('model.bin')
Visualizing the Word Embedding
We can visualize how the vectors for each word are placed using the matplotlib library. First, we would need to convert the 100-dimensional vectors into a 2-dimensional vector using the PCA class of sci-kit learn then create a scatter plot with the 2-dimensional vector.
#retrive the vectors from the model vectors = model[model.wv.vocab] #instantiate the PCA class with 2 dimensions pca = PCA(n_components=2) #train the model result = pca.fit_transform(vectors) #plot a scatter plot plt.scatter(x=result[:, 0], y=result[:, 1]) #add annotations of the words to the data points words = list(model.wv.vocab) for i, word in enumerate(words): plt.annotate(word, size=5, xy=(result[i, 0], result[i, 1]))
Output:
We can see words like ‘production’ and ‘automoted’ close to each. ‘Message’ and ‘called’ as well. While it may be difficult to distill a lot of similar words from the plot because of the small dataset, for bigger data, it will make a lot more sense.
Conclusion
Let’s wrap up with some key learnings from this tutorial.
- We said that word embedding helps to convert words into vector representation with which the machine learning can understand.
- Word2Vec is better than the conventional bag of words method since it considers the semantic and syntactic relationship between words.
- The framework for word2vec can either be a skip-gram or continuous bag of words
- When NLTK is used in synchrony with word2vec, it can be a very powerful tool for NLP applications.
- Word2Vec can be used for complex NLP tasks such as text classification, feature generation, and document
- Gensim is a toolkit that allows us to access the word2vec model in python