Country prediction using Word Embedding
Natural Language Processing(NLP) is a branch of AI which helps understand and interpret human language bridging the gap between human and machine language.
We use the concept of analogies between words to predict a country, given the name of a capital city.
Word Embedding:
Machine learning and deep learning algorithms generally deal with numeric data. So, for converting text into numbers, BagofWords technique has been developed to extract numeric features from text. It uses the concept of frequency distribution of words to find the number of times each word appeared in the text which is also known as the vectorization. It converts a text into features by creating a matrix of occurrences of words in a document within the corpus. Thus, each document is represented as a word-count vector of the size of length of vocabulary in the corpus. However, this model results in sparse matrices along with lack of capturing meaningful relationships in the text.
Word embedding is a technique which solves the above two problems. Using this method, each word in a language is represented as a real-valued vector in a lower-dimensional space such that semantically similar vectors are placed close to each other.
Creating such meaningful vector spaces gives the algorithms an opportunity to identify the patterns and detect the analogies in the given task.
There are many dimensionality reduction techniques that can be used to capture important information from high dimensional spaces and project it onto a smaller dimensional space. PCA is one of the most common dimensionality reduction technique available to create word embedding. PCA works by taking the bag of word vectors as inputs, tries to find out the most correlated features and tries to combine the features in a way such that maximum information is extracted and projected onto smaller dimensional spaces. Semantically similar items are placed close to each other in the vector space using the nearest neighborhood algorithm.
Word2vec is an algorithm by Google to train word embedding based on the concept of distributional hypothesis such that semantically similar words are mapped to geometrically close embedding vectors. gensim.models provides KeyedVectors class to directly load the word vectors pre-trained using Word2vec model. You can download the data set from here
import nltk
from gensim.models import KeyedVectors
embeddings = KeyedVectors.load_word2vec_format(‘./GoogleNews-vectors-negative300.bin’, binary = True)
f = open(‘capitals.txt’, ‘r’).read()
set_words = set(nltk.word_tokenize(f))
word_embeddings = get_word_embeddings(embeddings)
print(len(word_embeddings))
pickle.dump( word_embeddings, open( “word_embeddings_subset.p”, “wb” ) )
Finding the similarity between vectors in the models:
Euclidean distance:
Euclidean distance computes similarity between 2 vectors by calculating length of a line segment between the two points. The more similar the words, the more likely the Euclidean distance will be close to 0.
Cosine similarity:
Euclidean distance can be misleading at times to understand the similarity between 2 documents when documents of different sizes are compared. So in scenarios where we are working with text data, magnitude of the vector does not matter, as there is a high chance of comparing documents of uneven length. Another drawback of Euclidean distance is that it can’t work well in high dimensional space. So we use Cosine similarity metric to correct this. It is the cosine of angle between the two vectors which quantifies similarity between two documents. It captures semantic similarity better if we think of the direction of the vectors as its meaning. Also, the angle between vectors is more immune to the external factors like word counts.
def cos_similarity(u,v):
dot = np.dot(u,v)
det = np.linalg.norm(u)*np.linalg.norm(v)
cos = dot/det
Using basic trigonometric functions,
Cos (0) = 1 (if angle is 0, the vectors are on the same line and direction and hence they are highly similar)
Cos (90) = 0(if angle is 90, the vectors are orthogonal and hence they are not similar)
Cos(180) = −1(if angle is 180, the vectors are entirely dissimilar)
So when the angle θ is between documents is between 0 and 90(0 <= Cos(θ) <= 1), the documents are similar, else dissimilar
Finding the country of each capital:
Given the word embedding dictionary, a relationship(Country-Capital) and a capital city, the function returns the most likely country with a similar relationship and its similarity score.
We use the equation King — man + woman = queen, which is one of the most famous word2vec arithmetic representing the hidden algebraic structure of words. Thus to get the country of a capital city, we use a similar equation, Country2 = Capital1 — Country1 + Capital2 and implement it mathematically using word embedding and similarity function
def get_country(city1, country1, city2, embeddings):
vec = country1_embedding — capital1_embedding + capital2_ embedding
similarity = -1
for word in embeddings.keys():
word_emb = embeddings[word]
cur_similarity = cosine_similarity(vec,word_emb)
if cur_similarity > similarity:
similarity = cur_similarity
country = (word, similarity)
return countryget_country('Athens', 'Greece', 'Cairo', word_embeddings)
The above function call returns:
('Egypt', 0.7626821)
An accuracy of 0.92 is achieved in predicting the countries using this model!
Note : This blog is based on the new specialization NLP at https://www.deeplearning.ai/