In this assignment we will use requests
and BeautifulSoup
to scrape Wikipedia's List of accidents and incidents involving commercial aircraft and analyze the data. Put your scraping code in a script called scrape.py
and put your solutions to part B in scripts called q1.py,...,q6.py
. Put the answers to questions 1,2,4, and 5 in a file called ANSWERS.txt
.
Here we will write code to scrape the list of flights along with the following characteristics for each flight (located in the "infobox" on the right of each accident's page):
- Date
- Operator
- Flight origin
- Destination
- Fatalities
Note that not all of the accident infoboxes contain all of these attributes, so some data will be missing in our results. After you have all of the data, put it into a pandas DataFrame and write that to a CSV file that we can load and analyze in part B. (16 points)
There are many different ways to perform this scraping. Here is an outline of one of them:
-
Get the Wikipedia article and turn it into a
BeautifulSoup
. -
Find a way to identify all of the accident links. I noticed that all of the links are bold, i.e. wrapped in
<b>...</b>
tags. So you can start withfind_all('b')
but beware that not all bold text in the article has a link to an accident. More on that below. -
Iterate over the bold tags and request the page for each link and turn it into a
BeautifulSoup
. As mentioned above, not all of them have links. In fact one of the bold links isn't to an accident:<b><a href="#cite_ref-1">^</a></b>
which has a link but not to an accident. It's the very last one so if you call the list of bold tags
bolds
then you can ignore this one by simply usingbolds[:-1]
.More annoyingly, there are some bold tags that don't contain links. Some are accidents without articles, others are not accidents. If you try to
find()
a tag with BeautifulSoup that doesn't exist it will returnNone
. Thus, you can test whether a bold tagb
contains a link by checking whetherb.find('a') is None
. -
For each accident article, extract from the infobox the attributes listed above into a dictionary. Also include the name of the accident, which you can get from the text of the original link. For example the first accident would produce the following dictionary:
{'Date': 'July 21, 1919', 'Destination': 'White City amusement park, Chicago, Illinois', 'Fatalities': '13 (2 passengers, 1 crew, 10 on ground)', 'Flight origin': 'Grant Park, Chicago, Illinois', 'Name': 'Goodyear dirigible Wingfoot Air Express catches fire and crashes', 'Operator': 'Goodyear Tire and Rubber Company'}
By looking at the HTML source for an accident article note that each datum is the text of the table cell (
<td>
) immediately after a table header (<th>
) that describes it. For example, the date of the first accident:<tr> <th scope="row" style="line-height:1.3em; padding-right:1.0em; white-space:nowrap;">Date</th> <td style="line-height:1.3em;">July 21, 1919</td> </tr>
In class we saw that
find('div', class_='nytint-detainee-fullcol')
finds the<div>
whoseclass="nytint-detainee-fullcol"
. We can similarly find a tag with a particular text with thetext
argument tofind()
. So to find the Date table header you can usea_page.find('th', text='Date')
. Then you'll need to find the next<td>
and get its text.Put this code into a function calld
get_table_data(a_page, header)
that takes as its argument an accident page and the name of a header (e.g.'Date'
,'Flight origin'
, etc.) and returns the corresponding value. Not every page has every piece of data. In those casesget_table_data()
can just returnNone
. To identify these cases you can check if the result of yourfind('th', text=header)
is None
.Finally, iterate over the different header names and get each piece of data and put it into a dictionary. Append that dictionary to a big list of dictionaries, one for each accident. I recommend testing this process on a single accident page first and then on a small number before scraping all 1000+ accidents.
-
Turn that list of dictionaries into a DataFrame simply by passing it to
pd.DataFrame()
. This is an alternative way to construct a DataFrame, e.g.:>>> pd.DataFrame([{'a':1, 'b':2}, {'a':3, 'b':4}]) a b 0 1 2 1 3 4
You'll also want to call
drop_duplicates()
on your DataFrame because there are a few accidents that were linked multiple times in the original list article. Finally, write your results to a CSV file calledaccidents.csv
.
Now we will analyze the data from A. In case you have trouble getting your scraper to work, I have posted the the data here. Thus you can get partial credit by proceeding with those results.
-
What is the most common origin for accidents and how many accidents have originated there? (2 points) (Note: the origin names on Wikipedia are messy so a naive count will not be accurate. However, you need not clean them up to receive full credit for this problem.)
-
Which operator has had the most accidents and how many? (2 points)
-
Extract the number of fatalities from each accident into a column called
'Fatalities count'
. Save this asaccidents2.csv
for use below. (4 points)Hint: The fatalities can be strings like
'13 (2 passengers, 1 crew, 10 on ground)'
. We'll assume that the first number in each string is the total number of fatalities and use a regex to extract it.Write a function called
get_first_number(text)
that takes a string and uses a regular expression to return the first number in it. If the text is null or contains no numbers, returnNone
. You can check iftext
is null using thepd.isnull()
function. Apply this todf['Fatalities']
. -
Which flight had the most fatalities and how many? (2 points)
-
Which air operator has had the most fatalities and how many? Hint: Use
groupby
. (4 points) -
Make a line plot where the x axis the year and the y axis is the number of accidents in that year. Save it as
years.png
. (4 points)Hint: Some of the dates are missing or formatted poorly so you can pass the argument
errors='coerce'
topd.to_datetime()
to simply convert them toNaT
("Not a Time", the equivalent for times ofNaN
for numbers).