Cluster Analysis

Multidimensional Scaling (MDS) Analysis

To visualize any similarities and differences of tweets across states, the cosine similarity was calculated – determining the cosine of the angle between tweet content of all 50 states as a way of measuring the distance between states. This was then plotted through Multidimensional Scaling (MDS) as shown in the figure below,

Two-Dimensional Scaling of Tweets by State

Looking at the state tweet distances in 2D Cartesian in the figure above, it was determined that the majority of states appear to have fairly similar tweet content. Analyses show that states falling farther out from the center have more dissimilar tweet content from the remaining states.

Roughly 8 states appear to deviate most heavily from the centroid – Vermont, South Dakota, Wyoming, Montana, Alaska, North Dakota, New Hampshire, and Idaho. As shown in the Two-Dimensional Scaling of Tweets by State figure above, these states appear to have tweet content most dissimilar from all other states.

Three-Dimensional Scaling of Tweets by State

The three-dimensional plot of the Multidimensional Scaling of Tweets by State, above, provides similar insight as the previous two-dimensional version of the figure, with the content of tweets from most states appearing similar, with a few states with more unique tweet content. To further investigate which, if any, of the states had a true difference in the tweet content, 3 methods of cluster analysis were applied to the Twitter data.

Ward’s Cluster Analysis

Utilizing the cosine similarities of tweets across states, researchers implemented Ward’s cluster analysis to further visualize which states were most similar based on tweet content. The Hierarchical Clustering Dendrogram of tweet content by state is shown in the figure below.

Ward’s Clustering of Tweets by State

Based on the Hierarchical Clustering Dendrogram in the figure above, there are roughly 2 to 4 distinct groups of states based on tweet content. Just as shown in the MDS plot, Vermont, South Dakota, Wyoming, Montana, Alaska, North Dakota, and Idaho appear to make up one hierarchical cluster.

The next group of most dissimilar states in the MDS plot comprised of Oregon, Hawaii, Maine, Rhode Island, Mississippi, New Hampshire, Louisiana, New Mexico, and Delaware – these same states make up another of the four key hierarchical clusters.

KMeans Cluster Analysis

Based on the visualization from the Ward’s Cluster Analysis, we felt three or four clusters may be appropriate most for the data. Using K-Means with 3 clusters, the states were grouped as shown in figure and table, and resulted in a silhouette score of 0.555. As silhouette coefficients of 1 indicate the best clustering, a score of this magnitude indicates the clusters are reasonably well fit.

K-Means Clustering of States by Cosine Similarity into 3 Clusters

Because the number of clusters in K-Means analysis is subjective, additional analysis was performed using k=4 clusters. Clustering the states by cosine similarity into four clusters resulted in a lower silhouette score of 0.416, indicating weaker results when a greater number of clusters were applied. The visualization and states in each cluster using this methodology are shown in the figure and table below.

K-Means Clustering of States by Cosine Similarity into 4 Clusters

DBSCAN Cluster Analysis

Using the DBSCAN clustering methodology resulted in two clusters with a silhouette coefficient of 0.609, higher than the scores from the K-means methodology with either k=3 or k=4 clusters.

DBSCAN Clustering of States by Cosine Similarity

Conclusions of Cluster Analysis

It is interesting to note that there were some similarities across all clustering methodologies and the MDS plot: Cluster 4 of the Ward’s Hierarchical clustering, Cluster 2 of the DBSCAN methodology, Cluster 1 of the K-Means Clustering with 3 clusters, and Cluster 2 of the K-Means Clustering with 4 clusters are direct mirror, containing the same states. This suggests the states of Alaska, Montana, North Dakota, South Dakota, Vermont and Wyoming may have true similarities in tweet content coming from these states. Researchers found this to make intuitive sense, as the populations of these six states generally are known to have similar interests and would potentially tweet about similar topics.

Looking into the Happiness Scores of these states, displayed in the table below, researchers noted these states had a range in happiness scores of 19.28. The pairs of Alaska and Wyoming and Vermont and North Dakota displayed nearly identical happiness rankings. Looking at the Ward’s hierarchical plot of these states in the figure below (Ward’s Clustering of States with Strong Cosine Similarity), it is apparent that Vermont and North Dakota have the most similar tweet content, aligning with their known Happiness Rankings.

Ward’s Clustering of States with Strong Cosine Similarity