There has been an increase in the exploration of text as a rich data source. Quantifying textual data can reveal trends, patterns, and characteristics that may go unnoticed initially by human interpretation. Combining quantitative analyses with computing capabilities of modern technology allows for quick processing of substantially large amounts of text.
Here we present a sentiment analysis of an intriguing form of text – podcast transcripts – to provide a discussion on the process of text analysis.
Podcast transcripts are a unique form of text because their initial intent was to be listened to, not read, creating a more intimate form of communication. The text used in this example are the transcripts from the NPR “Serial” podcast hosted by Sarah Koenig. “Serial” explores the investigation and trial of Adnan Syed who was accused of the murder of his girlfriend in 1999. The podcast consists of 12 episodes averaging 43 minutes and 7,480 words each. Here we examine the 12 episodes together as a whole text.
Sentiment analysis involves processing text data to identify, quantify, and extract subjective information from the text. Using the available tools from the Tidyr package in R, we can examine the polarity (positive or negative tone) and the emotional association (joy, anger, fear, trust, anticipation, surprise, disgust, and sadness) of the text. We present one method of sentiment analysis which involves referencing a sentiment dictionary or list of words coded based on the objective. For examining the polarity, each word is given a positive, negative, or neutral value and for examining the emotions, each word is tagged with any associated emotions. As an example, the word “murder” is coded as negative and tagged with fear and sadness. We chose to use the NRC sentiment dictionary for this analysis as it is the only one that includes emotions, and it was created as a general purpose lexicon that works well for different types of text.
Starting with an overall visualization of the emotions and polarity of the podcast, a bar graph (Figure 1) displays the percentage of the text characterized by each emotion. In examining the text in this way, a particularly intriguing discovery is that the most common emotion is trust, which may be surprising for a podcast about a murder investigation and trial. The next most common emotion is anticipation. This confirms what one may expect in the context of podcasts: hosts would want to keep their listeners interested in the story so anticipation would play a key role in getting people to listen regularly.
Figure 2 shows that overall this text is positive as a larger percent of the words are coded as positive. Looking closer at which words occur most often within a specific sentiment or emotion, a sorted word cloud allows one to visually identify the most commonly used words coded as positive or negative.
The most frequently used negative words are crime, murder, kill, and calls. The most frequently used positive words are friend, talk, police, and pretty. It is important to examine the context of the most common words. Consider the word “pretty”, in the text “pretty” was used as an adverb not as an adjective (e.g. “I’m pretty sure I was in school. I think– no?”) All 53 instances of “pretty” in the text were used to show uncertainty. However, the NRC dictionary defines and codes the word “pretty” as an adjective describing something as being attractive. This mismatch between usage within the text and the dictionary impacts the sentiment analysis. One should carefully consider how to handle such words appropriately.
Similarly we can examine each emotion in more detail. These graphs allow one to see which words were most represented in each emotion.
This graph again illustrates the importance of critically examining the results. The word “don” is coded as a top positive word, however, in this text “don” is the name of a person and like the other names it should be coded as neutral. However, the NRC lexicon codes the word “don” defined as a gentlemen or mentor. Similar concerns may be present for other words that may have multiple meanings. These words should be appropriately considered, particularly if among the most frequently used words in the text.
These graphs show a few of the many ways to quantify and visualize text data through a sentiment analysis to understand a text more objectively. As text analyses become more prevalent, it is imperative to actively engage in the process and critically examine results paying attention to not only the numbers and graphs but also the subject matter of the text.