-
Notifications
You must be signed in to change notification settings - Fork 0
/
homework5-part1-valeeva.py
252 lines (125 loc) · 4.88 KB
/
homework5-part1-valeeva.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
#!/usr/bin/env python
# coding: utf-8
# # Homework 5, Part 1: Building a pandas cheat sheet
#
# **Use `animals.csv` to answer the following questions.** The data is small and the questions are pretty simple, so hopefully you can use this for pandas reference in the future.
# ## 0) Setup
#
# Import pandas **with the correct name**.
# In[1]:
import pandas as pd
# ## 1) Reading in a csv file
#
# Use pandas to read in the animals CSV file, saving it as a variable with the normal name for a dataframe
# In[2]:
ls
# In[3]:
df = pd.read_csv("animals.csv")
# ## 2) Checking your data
#
# Display the number of rows and columns in your data. Also display the names and data types of each column.
# In[4]:
len(df)
# In[5]:
len(df.columns)
# In[6]:
df.info()
# In[7]:
df.dtypes
# ## 3) Display the first 3 animals
#
# Hmmm, we know how to take the first 5, but maybe the first 3. Maybe there is an option to change how many you get? Use `?` to check the documentation on the command.
# In[8]:
df.head(3)
# ## 4) Sort the animals to show me the 3 longest animals
#
# > **TIP:** You can use `.head()` after you sort things!
# In[9]:
df.sort_values(by='length', ascending=False).head(3)
# ## 5) Get the mean and standard deviation of animal lengths
#
# You can do this with separate commands or with a single command.
# In[10]:
df.describe()
# ## 6) How many cats do we have and how many dogs?
#
# You only need one command to do this
# In[11]:
df.animal.value_counts()
# ## 7) Only display the dogs
#
# > **TIP:** It's probably easiest to make it display the list of `True`/`False` first, then wrap the `df[]` around it.
# In[12]:
df[df.animal == 'dog']
# ## 8) Only display the animals that are longer than 40cm
# In[13]:
df[df.length >= 40]
# ## 9) `length` is the animal's length in centimeters. Create a new column called `inches` that is the length in inches.
# In[14]:
df['inches']= df['length']/2.54
df.head()
# ## 10) Save the cats to a separate variable called `cats`. Save the dogs to a separate variable called `dogs`.
#
# This is the same as listing them, but you just save the result to a variable instead of looking at it. Be sure to use `.head()` to make sure your data looks right.
#
# Once you do this, every time you use `cats` you'll only be talking about the cats, and same for the dogs.
# In[15]:
cats = df[df.animal == 'cat']
# In[16]:
dogs = df[df.animal == 'dog']
# In[17]:
print (cats.head())
# ## 11) Display all of the animals that are cats and above 12 inches long.
#
# First do it using the `cats` variable, then also do it using your `df` dataframe.
#
# > **TIP:** For multiple conditions, you use `df[(one condition) & (another condition)]`
# In[18]:
cats[cats['length']>12]
# In[19]:
df[(df.animal == 'cat')&(df.length>12)]
# ## 12) What's the mean length of a cat? What's the mean length of a dog?
# In[20]:
cats.length.mean()
# In[21]:
dogs.length.mean()
# ## 13) If you didn't already, use `groupby` to do #12 all at once
# In[22]:
df.groupby(by="animal").length.mean()
# ## 14) Make a histogram of the length of dogs.
#
# We didn't talk about how to make a histogram in class! It **does not** use `plot()`. Imagine you're a programmer who doesn't want to type out `histogram` - what do you think you'd type instead?
#
# > **TIP:** The method is four letters long
# >
# > **TIP:** First you'll say "I want the length column," then you'll say "make a histogram"
# >
# > **TIP:** This is the worst histogram ever
# In[33]:
dogs.length.hist()
# ## 15) Make a horizontal bar graph of the length of the animals, with the animal's name as the label
#
# > **TIP:** It isn't `df['length'].plot()`, because it needs *both* columns. Think about how we did the scatterplot in class.
# >
# > **TIP:** Which is the `x` axis and which is the `y` axis? You'll notice pandas is kind of weird and wrong.
# >
# > **TIP:** Make sure you specify the `kind` of graph or else it will be a weird line thing
# >
# > **TIP:** If you want, you can set a custom size for your plot by sending it something like `figsize=(15,2)`
# In[52]:
df.plot(x='name', y='length', kind='barh')
# ## 16) Make a sorted horizontal bar graph of the cats, with the larger cats on top
#
# > **TIP:** Think in steps, even though it's all on one line - first make sure you can sort it, then try to graph it.
# In[109]:
cats.sort_values('length', ascending=True).plot(x='name', y='length', kind='barh', legend=False)
# ## 17) As a reward for getting down here: run the following code, then plot the number of dogs vs. the number of cats
#
# > **TIP:** Counting the number of dogs and number of cats does NOT use `.groupby`! That's only for calculations.
# >
# > **TIP:** You can set a title with `title="Number of animals"`
# In[85]:
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
# In[96]:
df.animal.value_counts().plot(kind='barh', title='Number of animals')