Real-world data is messy. You spend much time to deal with the raw data. Pandas is a very powerfull library for the data analysis.
In pandas there are lots of functions for transformating data. Some of them(merge, concat) are easy to understand. Some of them(melt, pivot, …) are like black box even after you read the documentation.
By the end of this talk, you’ll
- set up the data analysis platform (jupyter notebook)
- get some tips on jupyter notebook
- know how to get started with Pandas if you are new to it.
- have better undaterstanding of unstack, stack, melt, pivot,
- quick jupyter notebook intro
- tips
- must installed extension
- quick pandas intro
- what can pandas do
- data transformation
- tech stack for data analysis
- clean data: example
The Jupyter Notebook is an interactive computing environment that enables users to
- Edit code in the browser
- Run code from the browser
- The default runs Python Code Julia, R, Ruby, Haskell, Scala, node.js, Go, C++
- Write documents in Markdown
- See the results of computations with rich media representations
- HTML
- PNG, SVG
- LaTex
https://www.slideshare.net/e2m/introduction-to-ipython-jupyter-notebooks
- vim, emacs
- PyCharm, Eclipse
- jupyter notebook
Enviroment like jupyter fit the needs of data analysis.
- Exploring data
- visulazation
**Sunsetting Python 2 support**
IPython
IPython Notebooks
Jupyter Noterbook == IPython Notebook 3.x and on
Jupyter Notebook Current version 4.1
https://www.continuum.io/downloads
anaconda is the easiest way to setup your env!
In your terminal, type
jupyter notebook
If you want more, checkout
h ESC a b Ctrl + Enter Shift + Enter ...
!ls *.csv !pip install plotly
%cd %ls %env
access prevous cell ouput
_ __ _7 _i7 (input)
%%bash
%%HTML
%%python2
%%python3
%%ruby
%%perl
%%bash for i in {1..5} do echo "i is $i" done
Jupyter Notebook Best Practices for Data Science
Jupyter Version Control
https://github.com/ipython-contrib/jupyter_contrib_nbextensions
conda install -c conda-forge jupyter_contrib_nbextensions
https://github.com/captainsafia/notebook-toc
organize your thoughts, document structure
https://github.com/captainsafia/notebook-toc/raw/master/notebook-toc-screencast.gif
source: https://github.com/captainsafia/notebook-toc
Similar chrome extension: Smart TOC - Chrome Web Store
pip install ipython-sql
conda install -c damianavila82 rise
In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data.
source: https://twitter.com/bigdataborat/status/306596352991830016
Powerful (Python) data analysis toolkit
- Expolore yor data (example)
- Tidy up your data
- …
Python libraries in data analysis
numpy pandas matplotlib scikit-learn searborn plotly …
- Start from practical/real world example
- Dig into pandas API (like reading office manual)
Learning by doing
Things in Pandas I Wish I’d Known Earlier
Common Excel Tasks Demonstrated in Pandas Common Excel Tasks Demonstrated in Pandas - Part 2
Quick and Dirty Data Analysis with Pandas
A comprehensive introduction to data wrangling
Brandon Rhodes - Pandas From The Ground Up - PyCon 2015
https://github.com/brandon-rhodes/pycon-pandas-tutorial
http://pandas.pydata.org/pandas-docs/stable/tutorials.html
pandas own 10 Minutes to pandas http://pandas.pydata.org/pandas-docs/stable/10min.html#min
Key componets:
- DataFrame
- Series
import pandas as pd
pd.DataFrame
pd.Series
Core concept:
- Series, DataFrame
- Index (multi index)
multi index
- from csv
- from json
- form hdf5
- from SQL database
- from html
- from python dict
- from python list
- from numpy array
- …
There are a whole bunch of ways to create dataframe, don’t dig it too much at first.
cheatsheet https://drive.google.com/file/d/0ByIrJAE4KMTtTUtiVExiUGVkRkE/view
https://github.com/brandon-rhodes/pycon-pandas-tutorial/blob/master/cheat-sheet.txt
https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf
http://www.kdnuggets.com/2017/01/pandas-cheat-sheet.html
http://www.swegler.com/becky/blog/2014/08/06/useful-pandas-snippets/
http://manishamde.github.io/blog/2013/03/07/pandas-and-python-top-10/
https://github.com/jminh/master_pandas_transformation/blob/master/stack_unstak_demo.ipynb
https://github.com/jminh/master_pandas_transformation/blob/master/groupby_pivottable.ipynb
https://github.com/jminh/master_pandas_transformation/blob/master/melt_pivot_demo.ipynb
- unstack
- stack
- set_index
- reset_index
- pivot
- pivot_table
- groupby
- melt
not covered
- merge, join
- cocat
- crosstab
- …
Bonus:drag and drop https://github.com/nicolaskruchten/jupyter_pivottablejs
http://nicolas.kruchten.com/content/2015/09/jupyter_pivottablejs/
machine learning stack
conda create -n mldm python=3.5 anaconda
source activate ml_2017
conda install seaborn
conda install -c conda-forge jupyter_contrib_nbextensions
conda install -c conda-forge jupyter_nbextensions_configurator
Scratchpad Table of Contents Skip-Traceback
conda install -c glemaitre imbalanced-learn
conda install -c damianavila82 rise
http://conda.pydata.org/docs/r-with-conda.html
conda install -c r r-essentials
pip install cufflinks #--upgrade
pip install plotly #--upgrade