In order to make predictions around whether tweets are correlated with a given state’s happiness ranking, the raw text of the individual tweets needed to be analyzed to determine the general opinion of the content. We did this by leveraging existing algorithms in Python3 to determine the Twitter Users’ attitude, and strength of their sentiment, in a given tweet.
Approach
Using the Python3 package VADER, a lexicon of positive and negative words was created. The frequency of these words was then counted in each tweet and recorded. As described in Cleaning and Preparing the Twitter Data, the sentiment of each tweet was assigned into four classes: Positive, Negative, Neutral, and Compound, and then binned into 5 categories, Very Negative, Little Negative, Neutral, Little Positive, Very Positive. For more details on this, refer to the Binning the Twitter Data Section.
Sentiment Exploratory Analysis
To confirm that the sentiment was correctly assigned by the Python3 package VADER, a sample of 50 tweets was randomly gathered and hand-labeled for sentiment. Additional detail on the specifics of this analysis are outlined in the table below.
Based on the sample of tweets from the Twitter Data, we found that only 8%, or 4 tweets, contained a sentiment assignment very far off from what was hand-assigned. All 4 of these discrepancies were of the case that the tweet was hand-assigned as ‘Negative’, but assigned by the VADAR algorithm as ‘Very Positive’ – suggesting that the VADAR algorithm may have a slight preference towards the positive. This could potentially be explained by the issue of detecting sarcasm, which continues to be both a problem for hand assignment and in algorithm application. Based on the minimal percentage of discrepancies in the sample of tweets, it was determined that the VADAR algorithm sufficiently determined the sentiment of the raw tweet text.
Sentiment Prediction Question – Can the count of friends, count of followers, count of favorites, count of retweets, and result type of a given tweet be predicted by the sentiment label of the tweet?
Hypothesis Description: Researchers looked to determine if applying Random Forests, KNN, SVM, and Decision Trees could identify patterns in the Twitter data.
Null Hypothesis: The average of compound scores by new label are equal.
Alternative Hypothesis: The average of compound scores by new label are not equal.
Purpose: If the null hypothesis is not rejected, researchers can conclude that the predictive models can identify a pattern in {followers_count, friends_count, fav_count, retweet_count} to predict sentiment labels.
If random forests, KNN, decision tree can identify the patterns in the attributes {followers_count, friends_count, fav_count, retweet_count}, the average compound scores for the new labels will have a noticeable differences and will be prone to the average compound score of original labels. However, if the null hypothesis is incorrect and a pattern cannot be identified, then the predictions will be wrong, and a new label would have equal chance to be any label. For example, if the original label of a tweet is {very negative} and a pattern cannot be identified by the predictive model, then its new label will be equal chance to be {Very negative, little negative, neutral, little positive, very positive}. As a result, the average compound scores of new labels will be the same because the new labels are randomly permuted by the predictive models’ inability to identify a pattern.
Hypothesis Tests: (Note: ROC curves are not included because this is a multiclass classification)
- Random Forest: A random forest is a machine learning technique that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.
Cross Validation Score: 0.26
One-Way Anova Test Description and Results:
- It tests the null hypothesis that two or more groups have the same population mean
- Computed F-Value = 0.79
- P-value = 0.543
Conclusion: With a p-value of 0.543, there is not enough evidence to reject the null hypothesis. Based on results this extreme, it is very unlikely that a random forest can find a pattern in the data to predict.
2. Decision Tree: Decision Tree methodology uses a flowchart-like tree structure to predict and classify data.
Cross Validation Score: 0.248
One-Way Anova Test Results:
– Computed F Value = 1.05
– P-Value = 0.37
Conclusion: With a p-value of 0.37, there is not enough evidence to reject the null hypothesis. With results this extreme, it is unlikely that a decision tree can find a pattern in the data to predict.
3. KNN: It predicts the label of a data point by taking a majority vote on the labels of its k closest points. KNN is a lazy learner and it does not build out a model explicitly.
Cross-Validation Score: 0.272
One-Way Anova Test Results:
– Computed F Value = 2.57
– P-Value = 0.0356
Conclusion: With a p-value less than 0.05, the results indicate sufficient evidence to reject the null hypothesis. The average compound scores of new labels are not the same. However, does this mean KNN can identify a pattern in data? By looking into the confusion matrix, we noted that the accuracy is lower than the Decision tree’s predictions, but most of predictions are {neutral}. This is probably the reason why there is a lot of variance in the data. As such, it cannot be concluded that KNN can find a pattern in the data.
5. Support Vector Machine (Linear SVC): SVM predicts by finding a hyperplane that best separates the data points by labels.It works very well with high-dimensional data.
Cross Validation Score: 0.3
A One-Way ANOVA cannot be conducted on these samples because there are no tweets in {Little Negative} after the prediction. However, by checking the confusion matrix we can tell that SVM is not a very good predictive model for the data. It appears to be predicting a very high percentage (nearly 100%) of the Twitter Data to have neutral sentiment, which is incorrect.