Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation updates: simplify examples and add section on data sources #955

Merged
merged 7 commits into from
Nov 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/docs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,8 @@ jobs:
set -x
source venv/bin/activate
cd docs
curl -O https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv
curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet
make html

- name: Copy & push the generated HTML
Expand Down
2 changes: 2 additions & 0 deletions docs/.gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
pokemon.csv
yellow_trip_data.parquet
yellow_tripdata_2021-01.parquet

11 changes: 10 additions & 1 deletion docs/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,17 @@
#

set -e

if [ ! -f pokemon.csv ]; then
curl -O https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv
fi

if [ ! -f yellow_tripdata_2021-01.parquet ]; then
curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet
fi

rm -rf build 2> /dev/null
rm -rf temp 2> /dev/null
mkdir temp
cp -rf source/* temp/
make SOURCEDIR=`pwd`/temp html
make SOURCEDIR=`pwd`/temp html
Binary file added docs/source/images/jupyter_lab_df_view.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
25 changes: 6 additions & 19 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,27 +43,13 @@ Example

.. ipython:: python

import datafusion
from datafusion import col
import pyarrow
from datafusion import SessionContext

# create a context
ctx = datafusion.SessionContext()
ctx = SessionContext()

# create a RecordBatch and a new DataFrame from it
batch = pyarrow.RecordBatch.from_arrays(
[pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
names=["a", "b"],
)
df = ctx.create_dataframe([[batch]], name="batch_array")
df = ctx.read_csv("pokemon.csv")

# create a new statement
df = df.select(
col("a") + col("b"),
col("a") - col("b"),
)

df
df.show()


.. _toc.links:
Expand All @@ -85,9 +71,10 @@ Example

user-guide/introduction
user-guide/basics
user-guide/configuration
user-guide/data-sources
user-guide/common-operations/index
user-guide/io/index
user-guide/configuration
user-guide/sql


Expand Down
74 changes: 39 additions & 35 deletions docs/source/user-guide/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,72 +20,76 @@
Concepts
========

In this section, we will cover a basic example to introduce a few key concepts.
In this section, we will cover a basic example to introduce a few key concepts. We will use the same
source file as described in the :ref:`Introduction <guide>`, the Pokemon data set.

.. code-block:: python
.. ipython:: python

import datafusion
from datafusion import col
import pyarrow
from datafusion import SessionContext, col, lit, functions as f

# create a context
ctx = datafusion.SessionContext()
ctx = SessionContext()

# create a RecordBatch and a new DataFrame from it
batch = pyarrow.RecordBatch.from_arrays(
[pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
names=["a", "b"],
)
df = ctx.create_dataframe([[batch]])
df = ctx.read_parquet("yellow_tripdata_2021-01.parquet")

# create a new statement
df = df.select(
col("a") + col("b"),
col("a") - col("b"),
"trip_distance",
col("total_amount").alias("total"),
(f.round(lit(100.0) * col("tip_amount") / col("total_amount"), lit(1))).alias("tip_percent"),
)

# execute and collect the first (and only) batch
result = df.collect()[0]
df.show()

The first statement group:
Session Context
---------------

The first statement group creates a :py:class:`~datafusion.context.SessionContext`.

.. code-block:: python

# create a context
ctx = datafusion.SessionContext()

creates a :py:class:`~datafusion.context.SessionContext`, that is, the main interface for executing queries with DataFusion. It maintains the state
of the connection between a user and an instance of the DataFusion engine. Additionally it provides the following functionality:
A Session Context is the main interface for executing queries with DataFusion. It maintains the state
of the connection between a user and an instance of the DataFusion engine. Additionally it provides
the following functionality:

- Create a DataFrame from a CSV or Parquet data source.
- Register a CSV or Parquet data source as a table that can be referenced from a SQL query.
- Register a custom data source that can be referenced from a SQL query.
- Create a DataFrame from a data source.
- Register a data source as a table that can be referenced from a SQL query.
- Execute a SQL query

DataFrame
---------

The second statement group creates a :code:`DataFrame`,

.. code-block:: python

# create a RecordBatch and a new DataFrame from it
batch = pyarrow.RecordBatch.from_arrays(
[pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
names=["a", "b"],
)
df = ctx.create_dataframe([[batch]])
# Create a DataFrame from a file
df = ctx.read_parquet("yellow_tripdata_2021-01.parquet")

A DataFrame refers to a (logical) set of rows that share the same column names, similar to a `Pandas DataFrame <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html>`_.
DataFrames are typically created by calling a method on :py:class:`~datafusion.context.SessionContext`, such as :code:`read_csv`, and can then be modified by
calling the transformation methods, such as :py:func:`~datafusion.dataframe.DataFrame.filter`, :py:func:`~datafusion.dataframe.DataFrame.select`, :py:func:`~datafusion.dataframe.DataFrame.aggregate`,
and :py:func:`~datafusion.dataframe.DataFrame.limit` to build up a query definition.

The third statement uses :code:`Expressions` to build up a query definition.
Expressions
-----------

The third statement uses :code:`Expressions` to build up a query definition. You can find
explanations for what the functions below do in the user documentation for
:py:func:`~datafusion.col`, :py:func:`~datafusion.lit`, :py:func:`~datafusion.functions.round`,
and :py:func:`~datafusion.expr.Expr.alias`.

.. code-block:: python

df = df.select(
col("a") + col("b"),
col("a") - col("b"),
"trip_distance",
col("total_amount").alias("total"),
(f.round(lit(100.0) * col("tip_amount") / col("total_amount"), lit(1))).alias("tip_percent"),
)

Finally the :py:func:`~datafusion.dataframe.DataFrame.collect` method converts the logical plan represented by the DataFrame into a physical plan and execute it,
collecting all results into a list of `RecordBatch <https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html>`_.
Finally the :py:func:`~datafusion.dataframe.DataFrame.show` method converts the logical plan
represented by the DataFrame into a physical plan and execute it, collecting all results and
displaying them to the user. It is important to note that DataFusion performs lazy evaluation
of the DataFrame. Until you call a method such as :py:func:`~datafusion.dataframe.DataFrame.show`
or :py:func:`~datafusion.dataframe.DataFrame.collect`, DataFusion will not perform the query.
10 changes: 1 addition & 9 deletions docs/source/user-guide/common-operations/aggregations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,15 +26,7 @@ to form a single summary value. For performing an aggregation, DataFusion provid

.. ipython:: python

import urllib.request
from datafusion import SessionContext
from datafusion import col, lit
from datafusion import functions as f

urllib.request.urlretrieve(
"https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv",
"pokemon.csv",
)
from datafusion import SessionContext, col, lit, functions as f

ctx = SessionContext()
df = ctx.read_csv("pokemon.csv")
Expand Down
6 changes: 0 additions & 6 deletions docs/source/user-guide/common-operations/functions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,14 +25,8 @@ We'll use the pokemon dataset in the following examples.

.. ipython:: python

import urllib.request
from datafusion import SessionContext

urllib.request.urlretrieve(
"https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv",
"pokemon.csv",
)

ctx = SessionContext()
ctx.register_csv("pokemon", "pokemon.csv")
df = ctx.table("pokemon")
Expand Down
2 changes: 2 additions & 0 deletions docs/source/user-guide/common-operations/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@
Common Operations
=================

The contents of this section are designed to guide a new user through how to use DataFusion.

.. toctree::
:maxdepth: 2

Expand Down
11 changes: 4 additions & 7 deletions docs/source/user-guide/common-operations/select-and-filter.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,18 +21,15 @@ Column Selections
Use :py:func:`~datafusion.dataframe.DataFrame.select` for basic column selection.

DataFusion can work with several file types, to start simple we can use a subset of the
`TLC Trip Record Data <https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page>`_
`TLC Trip Record Data <https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page>`_,
which you can download `here <https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet>`_.

.. ipython:: python

import urllib.request
from datafusion import SessionContext

urllib.request.urlretrieve("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet",
"yellow_trip_data.parquet")
from datafusion import SessionContext

ctx = SessionContext()
df = ctx.read_parquet("yellow_trip_data.parquet")
df = ctx.read_parquet("yellow_tripdata_2021-01.parquet")
df.select("trip_distance", "passenger_count")

For mathematical or logical operations use :py:func:`~datafusion.col` to select columns, and give meaningful names to the resulting
Expand Down
6 changes: 0 additions & 6 deletions docs/source/user-guide/common-operations/windows.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,16 +30,10 @@ We'll use the pokemon dataset (from Ritchie Vink) in the following examples.

.. ipython:: python

import urllib.request
from datafusion import SessionContext
from datafusion import col
from datafusion import functions as f

urllib.request.urlretrieve(
"https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv",
"pokemon.csv",
)

ctx = SessionContext()
df = ctx.read_csv("pokemon.csv")

Expand Down
Loading
Loading