NumPy, short for Numerical Python, is one of the most important foundational packages for numerical computing in Python.
While NumPy provides a computational foundation for general numerical data processing, many readers will want to use pandas as the basis for most kinds of statistics or analytics, especially on tabular data. pandas also provides some more domain-specific functionality like time series manipulation, which is not present in NumPy.
One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large datasets in Python. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements.
For example,
In [12]: import numpy as np
# Generate some random data
In [13]: data = np.random.randn(2, 3)
In [14]: data
Out[14]:
array([[-0.2047, 0.4789, -0.5194],
[-0.5557, 1.9658, 1.3934]])
In [16]: data + data
Out[16]:
array([[-0.4094, 0.9579, -1.0389],
[-1.1115, 3.9316, 2.7868]])
An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array.
For example,
In [17]: data.shape
Out[17]: (2, 3)
In [18]: data.dtype
Out[18]: dtype('float64')
dtypes are a source of NumPy’s flexibility for interacting with data coming from other systems. In most cases they
provide a mapping directly onto an underlying disk or memory representation, which makes it easy to read and write
binary streams of data to disk and also to connect to code written in a low-level language like C or Fortran. The
numerical dtypes are named the same way: a type name, like float or int, followed by a number indicating the number
of bits per element. A standard double-precision floating-point value (what’s used under the hood in Python’s float
object) takes up 8 bytes or 64 bits. Thus, this type is known in NumPy as float64
.
np.save
and np.load
are the two workhorse functions for efficiently saving and loading array data on disk. Arrays
are saved by default in an uncompressed raw binary format with file extension .npy
.
pandas adopts significant parts of NumPy’s style of array-based computing, especially array-based functions and a preference for data processing without for loops.
While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.
A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index.
For example,
In [13]: obj = pd.Series([4, 7, -5, 3])
In [14]: obj
Out[14]:
0 4
1 7
2 -5
3 3
dtype: int64
and
In [17]: obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
In [18]: obj2
Out[18]:
d 4
b 7
a -5
c 3
dtype: int64
In [19]: obj2.index
Out[19]: Index(['d', 'b', 'a', 'c'], dtype='object')
Compared with NumPy arrays, you can use labels in the index when selecting single values or a set of values.
Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values.
A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index.
For example,
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
In [47]: frame
Out[47]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
5 Nevada 2003 3.2
pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names). Any array or other sequence of labels you use when constructing a Series or DataFrame is internally converted to an Index.
In [78]: obj = pd.Series(range(3), index=['a', 'b', 'c'])
In [79]: index = obj.index
In [80]: index
Out[80]: Index(['a', 'b', 'c'], dtype='object')
In [81]: index[1:]
Out[81]: Index(['b', 'c'], dtype='object')
https://www.databricks.com/session_na21/koalas-does-koalas-work-well-or-not
https://learning.oreilly.com/library/view/python-for-data/9781491957653/