Did you read the next episode? Using textual cues for predicting podcast popularity, Joshi et al., 2020
We investigate the textual cues that assist in differing popular podcasts from unpopular ones. We employ a triplet-based training method to learn a text-based representation of a podcast, which is then used in a popularity prediction task.
There has been some attempts to learn a general representation for media content, but only based on the audio of the content, not from the textual cues.
Yang et al. (2019) propose a GAN-based model to learn representations of podcasts based on non-textual features, and showed its applications in downstream tasks like music retrieval and popularity detection.
We use the dataset from Yang et al., consisting of 6.5k episodes whith 837 popular and 5674 unpopular podcasts, with the average duration of a podcast being 9.83mins.
We use the transcripts provided with the audio, we remove the timestamps, stop words, and verbal vocalizations. The podcast transcription contains 1557 tokens on average.
We extract the polarity scores of each podcast using TextBlob, calculated by averaging the polarity of pre-defined lexicons, inferred from the words in the podcast. The overall polarity of popular and unpopular podcasts is roughly the same, 0.14. (anything above 0 is considered to be positive)
Using TextBlob, we also looked into subjectivity scores for each podcast, obtained by averaging the subjectivity of pre-defined lexicons, inferred from the words in the podcast. As with polarity, the subjectivity score is the same in both cases.
We use Empath (Fast et al., 2016) to analyze the topical signals with the help of 194 pre-defined lexicons (social media, war, violence, money, alcohol, crime, etc). We extract scores from Empath for each category, for each podcast. It is observed that content centered around 'Politics', 'Crime' or 'Media' is more popular than other.
We rank bigrams based on their pointwise mutual information scores. Popular podcasts commonly include keyword pairs like Hillary Clinton, Donald Trump, Gordon Hayward, hinting that categories like Politics, Sports or Celebrities to be responsible for making a podcast popular.
See in paper section 5
We employ a triplet-based training procedure to counter the class imbalance problem in our data. In future work, we plan to explore this problem in a multi-modal setting, by constructing multi-modal embeddings that leverage both audio and textual data. We also plan to leverage temporal information associated with the transcripts, in the form of timestamps of the spoken words, for the task of popularity prediction.