Skip to content

jminh/master_pandas_transformation

Repository files navigation

Mastering pandas data transformation

Real-world data is messy. You spend much time to deal with the raw data. Pandas is a very powerfull library for the data analysis.

In pandas there are lots of functions for transformating data. Some of them(merge, concat) are easy to understand. Some of them(melt, pivot, …) are like black box even after you read the documentation.

By the end of this talk, you’ll

  • set up the data analysis platform (jupyter notebook)
  • get some tips on jupyter notebook
  • know how to get started with Pandas if you are new to it.
  • have better undaterstanding of unstack, stack, melt, pivot,

Outline

  • quick jupyter notebook intro
    • tips
    • must installed extension
  • quick pandas intro
    • what can pandas do
    • data transformation
  • tech stack for data analysis
  • clean data: example

jupyter notebook quick intro

What is jupyter

The Jupyter Notebook is an interactive computing environment that enables users to

  • Edit code in the browser
  • Run code from the browser
  • The default runs Python Code Julia, R, Ruby, Haskell, Scala, node.js, Go, C++
  • Write documents in Markdown
  • See the results of computations with rich media representations
    • HTML
    • PNG, SVG
    • PDF
    • LaTex

https://www.slideshare.net/e2m/introduction-to-ipython-jupyter-notebooks

Why jupyter?

  • vim, emacs
  • PyCharm, Eclipse
  • jupyter notebook

Enviroment like jupyter fit the needs of data analysis.

  • Exploring data
    • visulazation

New: IPython 6.0

**Sunsetting Python 2 support**

IPython

IPython Notebooks

Jupyter Noterbook == IPython Notebook 3.x and on

Jupyter Notebook Current version 4.1

Try jupyter online

https://try.jupyter.org/

Get jupyter on your own PC.

https://www.continuum.io/downloads

anaconda is the easiest way to setup your env!

In your terminal, type

jupyter notebook

Turn your git repo into ineractive notebook

http://mybinder.org/

Tips

If you want more, checkout

keyborad shortcuts

h
ESC
a
b
Ctrl + Enter
Shift + Enter
...

shell commands

!ls *.csv

!pip install plotly

magic commands

%cd
%ls
%env

input/out cell

access prevous cell ouput

_
__

_7
_i7 (input)

combinde different lang code

%%bash

%%HTML

%%python2

%%python3

%%ruby

%%perl

%%bash
for i in {1..5}
do
   echo "i is $i"
done

practice

Jupyter Notebook Best Practices for Data Science

slide

Jupyter Version Control

Power up your jupyter env

nbextensions

https://github.com/ipython-contrib/jupyter_contrib_nbextensions

conda install -c conda-forge jupyter_contrib_nbextensions

https://github.com/captainsafia/notebook-toc

organize your thoughts, document structure

https://github.com/captainsafia/notebook-toc/raw/master/notebook-toc-screencast.gif

source: https://github.com/captainsafia/notebook-toc

Similar chrome extension: Smart TOC - Chrome Web Store

jupyter with sql

pip install ipython-sql

catherinedevlin/ipython-sql

qgrid

quantopian/qgrid

RISE: turn your notebook into slide

damianavila/RISE

conda install -c damianavila82 rise

pandas quick intro

why

In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data.

source: https://twitter.com/bigdataborat/status/306596352991830016

pandas

Powerful (Python) data analysis toolkit

  • Expolore yor data (example)
  • Tidy up your data

data

Python libraries in data analysis

numpy pandas matplotlib scikit-learn searborn plotly …

how to get started with pandas

  1. Start from practical/real world example
  2. Dig into pandas API (like reading office manual)

Practical example

Learning by doing

Things in Pandas I Wish I’d Known Earlier

Pandas DataFrame by Example

Pandas cookbook

Common Excel Tasks Demonstrated in Pandas Common Excel Tasks Demonstrated in Pandas - Part 2

Quick and Dirty Data Analysis with Pandas

A comprehensive introduction to data wrangling

vedio

Brandon Rhodes - Pandas From The Ground Up - PyCon 2015

https://github.com/brandon-rhodes/pycon-pandas-tutorial

Official

http://pandas.pydata.org/pandas-docs/stable/tutorials.html

pandas own 10 Minutes to pandas http://pandas.pydata.org/pandas-docs/stable/10min.html#min

DataFrame and Series

Key componets:

  • DataFrame
  • Series
import pandas as pd

pd.DataFrame

pd.Series

Core concept:

  • Series, DataFrame
  • Index (multi index)

multi index

http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/pandas/03.05-Hierarchical-Indexing.ipynb

creation

  • from csv
  • from json
  • form hdf5
  • from SQL database
  • from html
  • from python dict
  • from python list
  • from numpy array

There are a whole bunch of ways to create dataframe, don’t dig it too much at first.

cheat sheet

cheatsheet https://drive.google.com/file/d/0ByIrJAE4KMTtTUtiVExiUGVkRkE/view

https://github.com/brandon-rhodes/pycon-pandas-tutorial/blob/master/cheat-sheet.txt

https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf

http://www.kdnuggets.com/2017/01/pandas-cheat-sheet.html

useful snippets

http://www.swegler.com/becky/blog/2014/08/06/useful-pandas-snippets/

http://manishamde.github.io/blog/2013/03/07/pandas-and-python-top-10/

pandas data transformation

today’s topic

https://github.com/jminh/master_pandas_transformation/blob/master/stack_unstak_demo.ipynb

https://github.com/jminh/master_pandas_transformation/blob/master/groupby_pivottable.ipynb

https://github.com/jminh/master_pandas_transformation/blob/master/melt_pivot_demo.ipynb

  • unstack
  • stack
    • set_index
    • reset_index
  • pivot
  • pivot_table
  • groupby
  • melt

not covered

  • merge, join
  • cocat
  • crosstab

Bonus:drag and drop https://github.com/nicolaskruchten/jupyter_pivottablejs

http://nicolas.kruchten.com/content/2015/09/jupyter_pivottablejs/

Example: clean data

Tidy Data in Python

Stack

machine learning stack

conda create -n mldm python=3.5 anaconda
source activate ml_2017
conda install seaborn
conda install -c conda-forge jupyter_contrib_nbextensions
conda install -c conda-forge jupyter_nbextensions_configurator

Scratchpad Table of Contents Skip-Traceback

conda install -c glemaitre imbalanced-learn
conda install -c damianavila82 rise

http://conda.pydata.org/docs/r-with-conda.html

conda install -c r r-essentials
pip install cufflinks #--upgrade
pip install plotly #--upgrade

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published