Exploratory Data Analysis (EDA)

Basic Statistical Analysis

As part of the initial exploratory analysis, researchers gathered the summary statistics of the data attributes, including the mean, median, and standard deviation of key numerical attributes and count and mode of key categorical attributes.

As shown in the table above, the summary statistics of the State Happiness data attributes, including Rank, Emotional/Physical Well-Being, Community Environment, and Work Environment are identical across Count, Mean, Median, Standard Deviation, Minimum, and Maximum. This is as expected because these fields represent rankings of the United States states, and are valued at 1 to 50 with no repetitions in rankings. The summary statistics as displayed confirm that there are no invalid values in for these fields.

The summary statistics of the Twitter data attributes, including Followers Count, Friends Count, Retweet Count, Favorite Count and Compound Sentiment, provide additional insight. It is interesting to note that the median Retweet Count and Favorite Count are both zero. On average, our population of tweets have been retweeted about 9 times, and favorited roughly 24 times on average. The attribute Compound Sentiment represents the overall sentiment of a given tweet, with a range in valid values from -1 to 1 – very close to the range of our sample. The mean and median close to 0 suggests that the majority of the tweets have a neutral sentiment, with a smaller percentage of tweets displaying a strongly negative or strongly positive sentiment.

As shown in the table above, the count of our Twitter Data categorical attributes are all identical, and match the counts of the Twitter Data numerical attributes, with the exception of Time Zone. This suggests that the attribute Time Zone contains some missing values, whereas the remaining attributes do not.

Looking at the Unique Count, a value of 50 is as expected for the State Name attribute, suggesting that tweets are available for all 50 U.S. states. The Tweet Date has a unique count of 11, reflecting that tweets were pulled from Twitter over a span of 11 days. It is interesting to note that the User Name field has a count of 84,377, but a Unique Count of only 57,338. This suggests that 27,039 tweets came from non-unique users.

The modes of the Categorical attributes provide additional insight. Looking at State Name and Time Zone together, researchers noted that the majority of tweets came from the state of California, yet the most frequent time zone of tweets was Eastern Time. This could potentially be explained by the fact that the Eastern Time zone is made up of a greater number of states, and the majority of the tweets may have come from the collective East coast. Additionally, it is possible that the Californian tweets make up a disproportionate amount of the tweets that did not retain Time Zone information.

The mode of the User Name is “?” which suggests that many Twitter users may be using emojis or other characters not recognized by Python. As part of the data cleanliness analysis, we investigated the User Identification Numbers to determine whether the majority of tweets came from unique users, for additional information on this analysis, refer to Data Cleaning Insight: Cleaning and Preparing the Twitter Data.

Data Cleaning Insight

With the potential data issues in consideration, Python3 was leveraged to clean the Happiest States Data and Twitter Data and identify and address outliers. To download a copy of the raw and cleaned datasets, or the related Python programs, refer to the Reference Materials page of this site.

Cleaning and Preparing the Happiest States Data

After a visual assessment was completed on the Happiest States dataset, data quality measures were applied to confirm all fields contained valid values. Those fields containing ranking information, including ‘Rank’, ‘Emotional and Physical Well-Being’, ‘Work Environment’, and ‘Community Environment’ are known to have valid values of only integers between 1 and 50, as they represent ordinal ranks. Similarly, the columns must total to exactly 1,338, as there can be no duplicate ranks for a given attribute. This was confirmed again indirectly through the summary statistics of these attributes. The State attribute was also examined to confirm all values represented valid states, and no duplicates existed. For simplicity and continuity across the Twitter Data, state names in the Happiest State Data with a field length of greater than two characters were converted to the standard two-letter state abbreviation. All attributes within the Happiest States Data were analyzed to confirm no missing values existed.

Cleaning and Preparing the Twitter Data

After visually assessing the data quality issues within the Twitter Data, we preprocessed the raw tweets to remove noise. To improve capabilities for sentiment analysis, non-textual elements, including URLs, of the raw tweets were considered as noise and removed via use of regular expression matching. Communication between Twitter Users, via use of  “@username,” was considered as noise unnecessary to sentiment analysis and systematically removed. Hashtags in the format of #Hashtag, were replaced with the exact same word without the octothorpe, to reduce noise while retaining the intent behind the text. The raw text of all tweets was converted to lowercase. Lastly, question marks were removed to avoid excess noise and confusion with improperly formatted emoticons.

Once the raw tweets were cleansed, the remaining Twitter Data attributes were cleaned and assessed for quality. The tweet_date attribute was converted into an organized, easy-to-read format, retaining only the month, year, and day of the original value. As the user_description contained many blank values due to Twitter Users not incorporating descriptions within their profiles, all missing values were systematically replaced with the string ‘No Description’.

For increased accuracy in determining tweet location, the User-reported user_location attribute was deleted in favor of the auto-detected location for state information only. State names were extracted from location_name column. Using the location_type attribute, the formatting of the location_name attribute was identified, allowing for automated extraction of state information. For example, location_type of ‘city’ yields a location_name in the format of ‘San Francisco, CA.’ Roughly 1,500 rows with location_type as ‘poi’, ‘neighborhood, and ‘country’ were dropped since these rows did not contain distinguishable state information, and the relatively low count is not expected to affect overall analysis.

To create continuity with the Happiest States Data, tweets with invalid locations outside the 50 U.S. states, were considered to be invalid data and removed from the Twitter Data. It was determined that 862 tweets contained a location of Washington, DC. To avoid excessive removal of valid data, the location of these tweets was updated from D.C. to Maryland, due to proximity and the assumption that sentiment is likely to be relatively homogenous across these two locations.

After implementation of all other cleansing procedures, rows containing multiple bad entries (for example 15 missing values in a single row) were dropped as these may have resulted from a Twitter API server error, or other data collection issues.

After cleaning, researchers gathered the Top 5 Twitter Users in Twitter Data by Tweet Count (see table below) to help identify any outliers. Considering the U.S. population and the total number of tweets, the highest number of tweets by one user is 41 tweets, as shown in the table below. Because 41 is not excessively high, this indicated that the sample contains tweets from many Twitter Users, rather than an extraordinarily high number of tweets from a small sample of Twitter Users.

To begin an analysis of the tweet content, the text of each tweet was segmented into tokens, broken out into individual words for further analysis. Then, existing packages within Python were leveraged to assess the sentiment of the text of each of the tweets, assigning a score to each tweet. Using the VADER algorithm, the sentiment of each tweet was assigned into four classes: positive, negative, neutral, and compound, where the positive score represents the sum of the tweet’s positive characteristics, divided by the total, and similarly for the negative and neutral scores. The compound score is an aggregated score of the positive, negative, and neutral scores of an individual tweet [4].

Analysis of Data Cleanliness

The cleanliness of both the State Happiness and Twitter Data was assessed both before and after cleaning procedures were implemented, through a data quality score. Since the data is comprised of tweets and rankings, the clearest indicator of cleanliness is the number of missing values. The Data Quality Score, below, checks for the missing values of a column and returns a score that represents the level of cleanliness, with 100 being the best and 0 being the worst.

def checkCleanliness(df_col, row_count):

    score = 100

    missing_values_counts = df_col.isnull().sum()

    score = score *(1-missing_values_counts/row_count)

Investigation into the scores of the Twitter Data cleanliness before and after cleansing by attribute (see table below), indicates that all attributes improved in cleanliness, with the exception of time_zone. However, this field is unlikely to negatively impact sentiment analysis.

A similar investigation into the scores of the State Happiness Data cleanliness before and after cleansing by attribute (see table below), indicates that all attributes maintained cleanliness, with no issues in missing data.

Binning the Twitter Data

As part of the continued preparation of the Twitter data, the compound sentiment score was binned into 5 categories: Very Negative, Little Negative, Neutral, Little Positive, Very Positive, as displayed in the table below,

 

Binning of Compound Sentiment Scores of Twitter Data
Bin Label Range of Compound Scores Example Score
Very Negative -1.0 to -0.5 -0.89
Little Negative -0.49 to 0 -0.32
Neutral 0 0
Little Positive 0 to 0.49 0.30
Very Positive 0.5 to 1.0 0.92

 

 

Before binning, the compound sentiment score is locally normal between range [-1, 0] and range [0,1] as shown in the left figure above. However, the distribution of the compound scores as a whole is not very normal or smooth. After binning the compound sentiment scores into 5 labels, results are as follows: a compound score of -2 is Very Negative, -1 is Little Negative, 0 is Neutral, 1 is Little Positive, and scores up to 2 are very positive. By binning the attribute, the data is more smooth and normally distributed, as shown in the right histogram above.

Histograms of Key Attributes

To further investigate the Twitter and State Happiness data and gather additional insight, key attributes were plotted in histograms. Firstly, we investigated the distribution of the Twitter Data sample by state to assist in confirming that the sample of tweets was representative of the true population of the states and help ensure accuracy in analysis results. By plotting histograms of the Number of Tweets by State (see figure below) and comparing to the population of each state (shown below), we investigated whether the sample of tweets was distributed sensibly across all U.S. states, with the number of tweets per state roughly aligning with the population of a given state. To further investigate the sample of tweets, the number of users per state was plotted. Both of these visual analyses suggested that the proportions were roughly aligned and provided confidence that the sample of tweets was representative by state.

 

 

In reviewing the histograms above, it was determined that the plots of the sample of tweets are all similarly distributed. This suggests that that the sample of tweets being analyzed are representative of the true U.S. population by state in terms of proportion.

 

 

 

The rankings of happiness were plotted by state, as illustrated in the figure above. Through visual analysis, it was apparent that there are no obvious outliers in happiness scores, and the range of scores is not excessively high. It appears that most states have relatively similar happiness rankings.

We plotted frequently used words from the raw text of the Twitter data to visually investigate the sentiment. Based on the results, shown above, Twitter users appear to use a mix of words with positive and negative sentiment, with positive words like “love” and “good” appearing frequently. Additional analysis would be required to draw further conclusions on sentiment patterns.