Predictive Analysis – Sooeun’s blog

Prediction Question #1 – Can friends count, followers count, retweet count, and favorites predict happiness by states?

Hypothesis Description: We looked to determine if applying Naive Bayes, K-nearest neighbors (KNN) and decision trees could identify patterns in the Twitter data.

Null Hypothesis: The count of followers, the count of friends, the count of retweet, and the count of favorites of each state is not significantly different from every other state.

Alternative Hypothesis: The count of followers, the count of friends, the count of retweet, and the count of favorites of each state is significantly different from other state.

Purpose: If the null hypothesis is not rejected, we can conclude that the predictive models can identify a pattern in {followers_count, friends_count, fav_count, retweet_count} to predict sentiment labels by states.

The one-way ANOVA test can investigate whether the count of followers, the count of friends, the count of retweet, and the count of favorites of each state is significantly different from other state.

The following is a one way F-test result for each factor.

The count of friends is significantly different by states as the p-value of test of friends_count is significantly low.

Now, with the factors such as the count of followers, the count of friends, the count of retweet, and the count of favorites, researchers would like to see if the compound sentiment score could be predicted through Naïve Bayes, K-nearest neighbors algorithm (KNN), Classification and Regression Tree (CART) algorithm methods. One of the key features of Naïve Bayes is that it is a family or collection of algorithms assumes that the features being classified are independent of all other features. In K-Nearest Neighbors (KNN) , an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors. It is also important to note that KNN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The classification and regression method (CART), a modern name of decision tree method, creates a model that predicts the value of a target variable based on several input variables. Tree models where the target variable can take a discrete set of values are called classification trees whereas regression trees are the decision trees where the target variables can take continuous values. Before conducting the prediction, the data was normalized for convenience and add each algorithm to see clearer results.

Prediction Models

Comparing three prediction models, it was concluded that the accuracy of Naïve Bayes method is the highest as it has the largest cross-validation score.

Now, another ANOVA test was conducted to determine if different happiness scores existed by states with the average compound sentiment score of tweets as following.

A p-value this small indicates strong evidence against the null hypothesis. There is a significant difference in happiness scores among the 50 states.

Now, we applied the information of states for better predictions. Here, again, we normalized data first for convenience and added each algorithm to see clearer results.

Prediction Models

Based on the prediction result, the accuracy of Naïve Bayes is the highest. With the additional information of states, we concluded that there are different levels of happiness by states.

Conclusion: With a p-value this small, there is significant evidence to reject the null hypothesis. Thus, we concluded there are significantly different happiness scores by states.

Prediction Question #2- Can scores of Emotional and Physical Well-Being alone predict the overall happiness scores by states?

Hypothesis Description: If there is a linear relationship between Emotional and Physical Well-Being Scores and Overall Happiness Scores, linear regression can be used to build a prediction model. If the linear relationship is not significant or there is no linear relationship, then the slope of the linear model would be zero. Researchers tested whether the slope is equal to zero to evaluate the prediction.

Null Hypothesis: The slope of the linear equation, where Y is the overall happiness score and X is the Emotional/Physical Well Being scores, is equal to zero. There is no linear relationship between Emotional and Physical Well-Being Scores and Overall Happiness Scores.

Alternative Hypothesis: The slope is not equal to 0 and there is a linear relationship between Emotional and Physical Well-Being Scores and Overall Happiness Scores.

By using Python to build a model, researchers obtained the following linear equation:

Y = – 0.5759 * X + 68.8598

Test Results:

Conclusion: With a p-value this small, the null hypothesis that there is no linear relationship between Overall Happiness Scores and Well-Being Scores is rejected. We can conclude that scores of Emotional/Physical Well-Being have relationship with Overall Happiness Scores. This corresponds with the results of the correlation analysis between these two fields, conducted in earlier analyses.