Skip to content

Latest commit

 

History

History
283 lines (209 loc) · 12.5 KB

Lab03.md

File metadata and controls

283 lines (209 loc) · 12.5 KB

Lab session #3: Extracting and Analyzing data from the Cloud

This Lab assignment is built on top of the previous ones to discuss a basis for extracting appealing terms from a dataset of tweets while we “keep the connection open” and gather all the upcoming tweets about a particular event.

Tasks for Lab session #3

In case we want to “keep the connection open”, and gather all the upcoming tweets about a particular event, the Real-time tweets API is what we need.

The Real-time tweets APIs gives developers low latency access to Twitter’s global stream of Tweet data. Proper implementation of a streaming client will be pushed messages indicating Tweets and other events have occurred.

Connecting to the real-time tweets API requires keeping a persistent HTTP connection open. In many cases, this involves thinking about your application differently than if you were interacting with the REST API. Visit the Real-time tweets API for more details about the differences between Realtime tweets and REST.

The Real-time tweets API is one of the favorite ways of getting a massive amount of data without exceeding the rate limits. If you intend to conduct singular searches, read user profile information, or post Tweets, consider using the REST APIs instead.

We need to extend the StreamListener() class to customize the way we process the incoming data. We will base our explanation on a working example (from Marco Bonzanini) that gathers all the new tweets with the "ArtificialIntelligence" content:

import tweepy
from tweepy import OAuthHandler
from tweepy import Stream
from tweepy.streaming import StreamListener

class MyListener(StreamListener):

    def on_data(self, data):
        try:
            with open('ArtificialIntelligenceTweets.json', 'a') as f:
                f.write(data)
                return True
        except BaseException as e:
            print("Error on_data: %s" % str(e))
        return True

    def on_error(self, status):
        print(status)
        return True

consumer_key = 'YOUR-CONSUMER-KEY'
consumer_secret = 'YOUR-CONSUMER-SECRET'
access_token = 'YOUR-ACCESS-TOKEN'
access_secret = 'YOUR-ACCESS-SECRET'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

twitter_stream = Stream(auth, MyListener())
twitter_stream.filter(track=['ArtificialIntelligence'])

The core of the streaming logic is implemented in the MyListener class, which extends StreamListener and overrides two methods: on_data() and on_error().

These are handlers that are triggered when new data is coming through, or the API throws an error. If the problem is that we have been rate limited by the Twitter API, we need to wait before we can use the service again.

When data becomes available on_data() method is invoked. That function simply stores the data inside the  ArtificialIntelligenceTweets.json file. Each line of that file will then contain a single tweet, in the JSON format. You can use the command wc -l ArtificialIntelligenceTweets.json from a Unix shell to check how many tweets you’ve gathered.

Before continuing this hands-on, be sure that you have correctly generated the .json file.

Once your program works correctly you can try finding another term of your interest.

We can leave the previous program running in the background fetching tweets while we write TwitterAnalyzer.py, on a different Python file, our first exploratory analysis: a simple word count. We can, therefore, observe which are the most commonly used terms in the dataset.

Let's then go and read the file with all tweets to be sure that everything is correct:

import json

with open('ArtificialIntelligenceTweets.json','r') as json_file:
         for line in json_file:
             tweet = json.loads(line)
             print(tweet["text"])

We are now ready to start to tokenize all these tweets:

import json

with open('ArtificialIntelligenceTweets.json', 'r') as f:
    line = f.readline()
    tweet = json.loads(line)
    print(json.dumps(tweet, indent=4))

Now, if we want to process all our tweets, previously saved on file:

with open('ArtificialIntelligenceTweets.json', 'r') as f:
    for line in f:
        tweet = json.loads(line)
        tokens = preprocess(tweet['text'])
        print(tokens)

Remember that preprocess have been defined in the previous Lab in order to capture Twitter-specific aspects of the text, such as #hashtags, @-mentions and URLs.:

import re

emoticons_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""

regex_str = [
    emoticons_str,
    r'<[^>]+>', # HTML tags
    r'(?:@[\w_]+)', # @-mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs

    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:[\w_]+)', # other words
    r'(?:\S)' # anything else
]

tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)

def tokenize(s):
    return tokens_re.findall(s)

def preprocess(s, lowercase=False):
    tokens = tokenize(s)
    if lowercase:
        tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
    return tokens

In order to keep track of the frequencies while we are processing the tweets, we can use collections.Counter() which internally is a dictionary (term: count) with some useful methods like most_common():

import operator
import json
from collections import Counter

fname = 'ArtificialIntelligenceTweets.json'
with open(fname, 'r') as f:
   count_all = Counter()
   for line in f:
       tweet = json.loads(line)
       # Create a list with all the terms
       terms_all = [term for term in preprocess(tweet['text'])]
       # Update the counter
       count_all.update(terms_all)
   print(count_all.most_common(5))

As you can see, the above code produces words (or tokens) that are stop words. Given the nature of our data and our tokenization, we should also be careful with all the punctuation marks and with terms like RT (used for re-tweets) and via (used to mention the original author), which are not in the default stop-word list.

import nltk
from nltk.corpus import stopwords
nltk.download("stopwords") # download the stopword corpus on our computer
import string

punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['rt', 'via', 'RT']

We can now substitute the variable terms_all in the first example with something like:

import operator
import json
from collections import Counter

fname = 'ArtificialIntelligenceTweets.json'
with open(fname, 'r') as f:
    count_all = Counter()
    for line in f:
        tweet = json.loads(line)
        # Create a list with all the terms
        terms_stop = [term for term in preprocess(tweet['text']) if term not in stop]
        count_all.update(terms_stop)
    for word, index in count_all.most_common(5):
        print ('%s : %s' % (word, index))

Besides stop-word removal, we can further customize the list of terms/tokens of our interest. For instance, if we want to count hastags only by using:

terms_hash = [term for term in preprocess(tweet['text'])
              if term.startswith('#')]

In the case we are interested to count terms only, no hashtags and no mentions:

terms_only = [term for term in preprocess(tweet['text'])
              if term not in stop and
              not term.startswith(('#', '@'))]

Mind the double brackets (( )) at startswith(). It takes a tuple (not a list).

Although we do not consider it in this Lab session, there are other convenient NLTK functions. For instance, to put things in context, some analysis examines sequences of two terms. In this case, we can use bigrams() function that will take a list of tokens and produce a list of tuples using adjacent tokens.

Create a program at TwitterAnalyzer.py that reads only once the .json file generated by the listener process and prompts:

  • a list of the top ten most frequent tokens
  • a list of the top ten most frequent hashtags
  • a list of the top ten most frequent terms, skipping mentions and hashtags.
  • copy and paste the output to README.md

At "Racó (course intranet)" you can find a small dataset as an example (please do not distribute due to Twitter licensing). This dataset contains 1060 tweets downloaded from around 18:05 to 18:15 on January 13. We used "Barcelona" as a track parameter at twitter_stream.filter function. Save the file in your folder but do not add it to the git repository.

We would like to get a rough idea of what was telling people about Barcelona. For example, we can count and sort the most commonly used hashtag:

fname = 'Lab3.CaseStudy.json'
with open(fname, 'r') as f:
    count_all = Counter()
    for line in f:
        tweet = json.loads(line)
        # Create a list with all the terms
        terms_hash = [term for term in preprocess(tweet['text']) if term.startswith('#') and term not in stop]
        count_all.update(terms_hash)
# Print the first 10 most frequent words
print(count_all.most_common(15))

The output is (ordering may differ): [(u'#Barcelona', 68), (u'#Messi', 30), (u'#FCBLive', 17), (u'#UDLasPalmas', 13), (u'#VamosUD', 13), (u'#barcelona', 10), (u'#CopaDelRey', 8), (u'#empleo', 6), (u'#BCN', 6), (u'#riesgoimpago', 6), (u'#news', 5), (u'#LaLiga', 5), (u'#SportsCenter', 4), (u'#LionelMessi', 4), (u'#Informe', 4)]

To see a more visual description we can plot it. There are different options to create plots in Python using libraries such as matplotlib or ggplot. We have decided to use matplotlib and the following code:

import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (6,6)
import matplotlib.pyplot as plt

sorted_x, sorted_y = zip(*count_all.most_common(15))
print(sorted_x, sorted_y)

plt.bar(range(len(sorted_x)), sorted_y, width=0.75, align='center')
plt.xticks(range(len(sorted_x)), sorted_x, rotation=60)
plt.axis('tight')

plt.show()                  # show it on IDE

plt.savefig('CaseStudy.png')     # save it on a file

that uses the function zip(). We obtain the following plot:

Lab03-Plot

We can see that people were talking about football, more than other things! And it seems that they were mostly talking about the football league match played the day after.

Create a "matplot" with the dataset generated by you in the previous task. Add the file CaseStudy.png to your repository.

We ask the student following this hands-on to create an example that will show compelling insight from Twitter, using some realistic data taken as explained above.

Using what you have learned in this and the previous Lab sessions you can download some data using the real-time tweets API, pre-process the data in JSON format and extract some curious terms and hashtags from the tweets.

Create a file named StudentProposal.py with the solution and add it to your repository. Write in README.md a brief description of your proposal and the compelling insights that you have gained on this topic.

How to Submit this Assignment:  

Make sure that you have updated your remote GitHub repository with the Lab files generated along this Lab. Submit before the deadline to the RACO Practicals section a "Lab3.txt" file including:

  1. Group number
  2. Name and email of the members of the group
  3. GitHub URL containing your lab answers (as for Lab1 session)
  4. Link to your dataset created in task 3.4.
  5. Add any comment that you consider necessary