How cool would it be if you can get to know about a movie positive and negative comments from youtube comments or from any social media. Sentimental analysis is used to know about a person’s review about anything from the text. Sentimental analysis is used in businesses to know about their customers opinion,movie reviews, product ratings in amazon,flipkart. It is used in twitter to check if any person is commenting abusely about any leaders, celebrity.
When you comment about this project, I can analyse your mood about this project using sentimental analysis.
We just have the youtube comments. We dont have any labels for the algorithm to learn from.So we will use Vader sentiment to find the sentiments of each comments. We will use the output of vader sentiment library as the target for the dataset. Then we will be training our own model to classify the sentiments.We can save the model and use it later to find the sentiment of a sentence. For now we will see how vader sentiments library works and how to use machine learning model to predict the sentiment.
I will be attaching a dataset about public opinion about Trump vs Clinton’s presidential debate. Lets dive into the coding. I have attached the code and dataset below.
I will explain the code section by section.
Assigning the target:
1. As I said, we only had raw data without any targets for the model to learn from.In this section we will be getting the sentiments of the data using vader sentiment library.
2. Initially we had 306 samples. After vader sentiments, we would have 238 samples because some comments may be neutral so (306-238) samples were neutral comments. This is how the dataset looks like after assigning the target before cleaning the text.
2. The dataset contains many symbols, unwanted formats, numbers etc. We need to clean the text data before we can feed it to the model. First step is to remove all symbols, punctuations, numbers from the dataset. We will use regex to keep only alphabet letters.
3. We will then remove all unnecessary or common words like ‘the’ , ’duh’ , ’oh’ , ‘a’ , ‘he’ , ‘she’ etc using stop words. We can also add our own to the stop words lists.
4. Then we will find the root word of each word in the text file using word lemmatizer. Eg: Root word of ‘swimming’ is ‘swim’ , ’ran’ – ’run’ etc.
5. After text preprocessing,we will have our cleaned dataset that is ready for the model building. We will again create a dataframe for cleaned text and assign the sentiments from the previous dataframe to the cleaned_text dataframe. This is how it looks after text cleaning,
7. Lets look how the words are distributed by looking at the most frequently occuring words.
6. You can use any classification algorithm but I will be using Naïve Bayes algorithm.
7. Algorithms cant take text as input so we have to convert the words to numbers. It takes vectors of numbers as input, therefore we need to convert documents to fixed length vectors of numbers
8. We can do this by Bag of words model. It assigns each word a unique number. You can see in each row that it will be zero everywhere except at some places. It is the number of times that each word occurred in that sentence. Bag of words model is based on frequency of each words.
9. We will split the dataset into training and testing set. The algorithm will learn from the train set and then we will test it using the test set. Confusion matrix is used the evaluate the model. By looking at the diagonal elements, we can see number of True postives and True negatives.
10. Thats it , now we can classify a sentence into positive or negative given a sentence.
11. The dataset is really very small. The accuracy is also low. But you can use larger dataset and use some other classification algorithm to get a robust model. Just try and comment your output.