-
Notifications
You must be signed in to change notification settings - Fork 0
/
homework5-part2-Valeeva.py
198 lines (96 loc) · 4.54 KB
/
homework5-part2-Valeeva.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
#!/usr/bin/env python
# coding: utf-8
# # Homework 5, Part 2: Answer questions with pandas
#
# **Use the Excel file to answer the following questions.** This is a little more typical of what your data exploration will look like with pandas.
# ## 0) Setup
#
# Import pandas **with the correct name** .
# In[1]:
import pandas as pd
# ## 1) Reading in an Excel file
#
# Use pandas to read in the `richpeople.xlsx` Excel file, saving it as a variable with the name we'll always use for a dataframe.
#
# > **TIP:** You will use `read_excel` instead of `read_csv`, *but you'll also need to install a new library*. You might need to restart your kernel afterward!
# In[2]:
pip install openpyxl
# In[5]:
import openpyxl
# In[6]:
df = pd.read_excel("richpeople.xlsx")
# ## 2) Checking your data
#
# Display the number of rows and columns in your data. Also display the names and data types of each column.
# In[10]:
df.info()
# ## 3) Who are the top 10 richest billionaires? Use the `networthusbillion` column.
# In[11]:
df.sort_values(by='networthusbillion', ascending = False).head(10)
# ## 4) How many male billionaires are there compared to the number of female billionares? What percent is that? Do they have a different average wealth?
#
# > **TIP:** The last part uses `groupby`, but the count/percent part does not.
# > **TIP:** When I say "average," you can pick what kind of average you use.
# In[12]:
df.gender.value_counts()
# In[21]:
round(df.gender.value_counts(normalize=True)*100)
# In[26]:
df.groupby(by="gender").networthusbillion.median()
# ## 5) What is the most common source/type of wealth? Is it different between males and females?
#
# > **TIP:** You know how to `groupby` and you know how to count how many times a value is in a column. Can you put them together???
# > **TIP:** Use percentages for this, it makes it a lot more readable.
# In[32]:
df.typeofwealth.value_counts()
# In[35]:
round(df.groupby(by = "gender").typeofwealth.value_counts(normalize=True)*100)
# ## 6) What companies have the most billionaires? Graph the top 5 as a horizontal bar graph.
#
# > **TIP:** First find the answer to the question, then just try to throw `.plot()` on the end
# >
# > **TIP:** You can use `.head()` on *anything*, not just your basic `df`
# >
# > **TIP:** You might feel like you should use `groupby`, but don't! There's an easier way to count.
# >
# > **TIP:** Make the largest bar be at the top of the graph
# >
# > **TIP:** If your chart seems... weird, think about where in the process you're sorting vs using `head`
# In[81]:
df.company.value_counts().head(5).sort_values(ascending=True).plot(kind="barh")
# ## 7) How much money do these billionaires have in total?
# In[63]:
print (f'{round(df.networthusbillion.sum())} billions USD')
# ## 8) What are the top 10 countries with the most money held by billionaires?
#
# I am **not** asking which country has the most billionaires - this is **total amount of money per country.**
#
# > **TIP:** Think about it in steps - "I want them organized by country," "I want their net worth," "I want to add it all up," and "I want 10 of them." Just chain it all together.
# In[70]:
df.groupby(by = "citizenship").networthusbillion.sum().sort_values(ascending=False).head(10)
# ## 9) How old is an average billionaire? How old are self-made billionaires vs. non self-made billionaires?
# In[75]:
round(df.age.median())
# In[76]:
round(df.groupby(by = "selfmade").age.median())
# ## 10) Who are the youngest billionaires? Who are the oldest? Make a graph of the distribution of ages.
#
# > **TIP:** You use `.plot()` to graph values in a column independently, but `.hist()` to draw a [histogram](https://www.mathsisfun.com/data/histograms.html) of the distribution of their values
# In[82]:
df.sort_values(by='age', ascending=False).head(5)
# In[83]:
df.sort_values(by='age', ascending=True).head(5)
# In[84]:
df.age.hist()
# ## 11) Make a scatterplot of net worth compared to age
# In[86]:
df.plot.scatter(x='age', y='networthusbillion')
# ## 12) Make a bar graph of the wealth of the top 10 richest billionaires
#
# > **TIP:** When you make your plot, you'll need to set the `x` and `y` or else your chart will look _crazy_
# >
# > **TIP:** x and y might be the opposite of what you expect them to be
# In[92]:
df.sort_values(by='networthusbillion', ascending=False).head(10)
# In[108]:
df.sort_values(by='networthusbillion', ascending=False) .head(10) .sort_values(by="networthusbillion") .plot(x='name', y='networthusbillion', kind='barh', legend=False)