You may have noticed that services like Netflix, Amazon Prime, YouTube recommends the same kind of video you have watched earlier. So, how does this happen, it is because of Machine Learning based recommendation engine which determines how similar the videos are to other things you like and based on that assumption it serves up a recommendation. So, recommendation engine is a system that predicts the user preference.
Some basic concept you must know before jumping to Word2vec model, is as follows.
Collaborative Filtering and Content Based Filtering are two main approaches used by any recommendation system
It works on a principle of correlation. It considers the common interest shared by two or more people. Viz if Alice likes item A, B, C and Bob likes item B, C, D then there are chances that Alice might like D and Bob might like A.
As shown in above figure we can see that the prediction of a movie for Dushyant is calculated by computing the weighted sum of the user ratings given by Sachin to Casino. So, for prediction we need similarity between two users. Users having higher correlation will tend to be similar.
The main advantage of collaborative filtering is it does not depend on machine analyzing content, that means it does not required understanding of an item.
For new user, as no history is available or because of insufficient data, collaborative filtering compromises the accuracy.
Content-based Filtering works on the principle that people who agreed in the past will agree in the future, means based on user’s previous preferences it recommends the product. For example if Alice and Bob like movie Goodfellas, then system will recommend movies that fall under the same genre.
Content-based filtering methods are based on a context of the product and a information of the user’s. It can also include opinion-based recommended systems. In some cases, users are allowed to leave text review or feedback on the items. These user-generated texts are data for the recommended system because they are potentially rich resource of both feature/aspects of the item, and users’ evaluation/sentiment to the item.
What is Word2vec ?
We all know machine can only learn mathematical data and cannot deal with raw textual data. It is necessary to convert textual data into vectors to provide to machine learning algorithm. Word2vec model takes word as an input, convert it into vector and perform operation. It is a three layer neural network model, in which there is one hidden layer with linear activation function. It was introduce in 2013 and change entire perspective of looking towards Natural Language processing.
Word2vec model can learn from billions and millions of word corpus to produce high quality word vector.
In word2vec semantic information of the words is not stored. Even in TF-IDF model we only give more importance to the uncommon words. There’s a chance of overfitting the model. Overfitting is a scenario when model performs very well with your dataset but fails miserably when applied to any new dataset. This is because it learns outliers during training.
In this model, each word is represented as vector of 32 or more dimension instead of a single number. Also, relation between different words is preserved.
Previously in natural language processing many different models were introduce like n Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). But, google has found that they are not effective as neural network.
Let’s see some neural network models.
CBOW is a continuous bag of words model which is similar to the Feed forward Neural Net Language Model (NNLM), where the non-linear hidden layer is removed and the projection layer is shared for all words (not just the projection matrix) as shown in the figure below; thus, all words get projected into the same position (their vectors are averaged). CBOW tries to predict the word based on context. the name bag of words is because of the order of words in the history does not influence the projection. As unlike standard bag-of-words model, it uses continuous distributed representation of the context.
The second architecture is similar to CBOW, but instead of predicting the current word based on the context, it tries to maximize classification of a word based on another word in the same sentence. We will discuss this in implementation section. Word2vec model is combination of two models CBOW and Skip gram model.
Here, I have used Skip-gram model.
Step 1 – Data Preparation
Let’s try to prepare data using following sentence.
“I am going to watch the season premiere”
I am going to use color combination here, red is for input and green is for the output.
1. “I am going to watch the season premiere”
So, the training samples with respect to this input word will be as follows:
- I | am
- I | going
2. “I am going to watch the season premiere”
So, the training samples with respect to this input word will be as follows and it will append to previous.
- I | am
- I | going
- am | I
- am | going
- am | to
This will continue till last word of the sentence. And, in this way we can extract large amount of data from only one sentence. So, you can image what amount of data we can get from 1 million sentences.
Step 2 – Model Building
Skip-gram model consist of three layers, input layer, one hidden layer and output layer. The hidden layer neurons passes the weighted sum of input layer to output layer, and the output neurons use softmax.
Now, suppose we have extract the data points from million sentences.
Suppose we have 20,000 unique words(v) in our data and we want to create word vector(N) of 100 for each word.
- V = 20,000
- N = 100
Each word in input layer will be the one-hot encoded vectors and output layer would give the probability of being the nearby word for every word in the vocabulary.
Now we have our model trained, we can easily extract the learned weight matrix WV x N and use it to extract the word vectors:
From the above figure we can see that, fix size vectors are obtain. Similar words in this dataset would have similar vectors, i.e. vectors pointing towards the same direction. For example, the terms “King” and “Queen” would have similar vectors as shown below
In this way we have build our model.
Step 3 – Implementation
I know some of you may have wondering, how we can apply this knowledge to recommend movies, as movie names are distinct and not sentence. But what if we consider related movie name as sentence and predict next movie.
The strategy is to consider each movie name as vector, and find similar names using that vector, as i already mentioned similar words has similar vectors. Just take the watching history of a costumer as a sentence and the movie name as its words:
In this way we have taken each word from the sentence -> Convert it into vector -> Predict next movie.
I am going to use python programming language for implementation. Let’s see….
Step 1 – Import python libraries
import pandas as pd import numpy as np from gensim.models import Word2Vec import random from tqdm import tqdm import matplotlib.pyplot as plt %matplotlib inline import warnings; warnings.filterwarnings('ignore')
Step 2 – Get Data
We are using the MovieLens 20M Dataset curated by the MovieLens research team. It contains 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. For more details, you can visit the official website. You can download the dataset via this link.
df_movies = pd.read_csv('movies.csv') df_ratings = pd.read_csv('ratings.csv')
Step 3 – Data Merging
Data merging means combining two data sets in such a way that each row in both dataset aligns based on common attributes or columns. Here, we will merge movie and rating dataset to get movie ID, user ID, and movie tittle in one data-frame.
df = pd.merge(df_movies,df_ratings)
So, here we have got everything in single frame. Note, user ID is unique.
Since we have sufficient data, we will drop all the rows with missing values.
Step 4 – Data Pre- processing
First, we will convert movie ID into string and then will check number of unique users.
df['movieId']= df['movieId'].astype(str) users = df["userId"].unique().tolist() len(users)
So, there are total 1,62,541 users in our dataset. . For each of these user, we will extract their watching history. In other words, we can have 1,62,541 sequences of watching movie history.
Step 5 – Data Splitting
In order to check performance of our model, we have to split our data into training and testing set. Here, I have taken 90% of data for training and 10% of data for testing.
random.shuffle(users) # extract 90% of user ID's users_train = [users[i] for i in range(round(0.9*len(users)))] # split data into train and validation set train_df = df[df['userId'].isin(users_train)] validation_df = df[~df['userId'].isin(users_train)]
Step 6 – Strategy
As I mention earlier, I am going to use watch history of each user and based on that I will recommend him/her movie. For that, first we have to create empty list and append movie ID and user ID , from that we can tell that this user watch that movie or this user likes that type of movies.
#list to capture watch history of the users watch_train =  # populate the list with the movie ID for i in tqdm(users_train): temp = train_df[train_df["userId"] == i]["movieId"].tolist() watch_train.append(temp)
Note – It took me 3 hours to capture watch history of each user. It may take longer in your PC based upon configurations.
Step 7 – Training the Model
For training of model, we are going to use gemsim module. This module implements the word2vec family of algorithms, using highly optimized C routines, data streaming and Pythonic interfaces. I recommend you to read original documentation of gemsim module.
model = Word2Vec(window = 10, sg = 1, hs = 0, negative = 10, alpha=0.03, min_alpha=0.0007, seed = 14) model.build_vocab(watch_train, progress_per=200) model.train(watch_train, total_examples = model.corpus_count, epochs=10, report_delay=1)
Now, let’s print our model.
X = model[model.wv.vocab] X.shape
Our model has a vocabulary of 31673 unique words and their vectors of size 100 each. Next, we will extract the vectors of all the words in our vocabulary and store it in one place for easy access.
Step 8 – Result
Let’s create a movie-ID and movie tittle dictionary to easily map a movies name to its ID and vice versa.
watch = train_df[["movieId", "title"]] # remove duplicates watch.drop_duplicates(inplace=True, subset='movieId', keep="last") # create movie id and tittle dictionary watch_dict = watch.groupby('movieId')['title'].apply(list).to_dict()
I have defined the function below. It will take a movie ID as input and return top 6 similar products:
def similar_watch(v, n = 6): # extract most similar movies for the input vector ms = model.similar_by_vector(v, topn= n+1)[1:] # extract name and similarity score of the similar movies new_ms =  for j in ms: pair = (watch_dict[j], j) new_ms.append(pair) return new_ms
Let’s recommend now!
From above, we can see that, user has given movie ID as an input and print movie name and our model gave top 6 recommendations along with probabilities.
There is another method to do this.
Word Embedding Word Embedding Word Embedding Word Embedding Word Embedding Word Embedding Word Embedding Word Embedding Word Embedding Word Embedding