From 0573f78c7e7a4d94c3204cee464b3860479e0afb Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Tue, 16 Apr 2024 09:58:16 -0400 Subject: [PATCH] Update datafusion-cli docs, split up (#10078) * Update datafusion-cli docs, split up * remove PiPI instructions --- datafusion-cli/README.md | 8 +- docs/source/index.rst | 2 +- docs/source/user-guide/cli.md | 660 --------------------- docs/source/user-guide/cli/datasources.md | 349 +++++++++++ docs/source/user-guide/cli/index.rst | 27 + docs/source/user-guide/cli/installation.md | 63 ++ docs/source/user-guide/cli/overview.md | 45 ++ docs/source/user-guide/cli/usage.md | 288 +++++++++ 8 files changed, 777 insertions(+), 665 deletions(-) delete mode 100644 docs/source/user-guide/cli.md create mode 100644 docs/source/user-guide/cli/datasources.md create mode 100644 docs/source/user-guide/cli/index.rst create mode 100644 docs/source/user-guide/cli/installation.md create mode 100644 docs/source/user-guide/cli/overview.md create mode 100644 docs/source/user-guide/cli/usage.md diff --git a/datafusion-cli/README.md b/datafusion-cli/README.md index 0afcd489f725..2aae6bfc3052 100644 --- a/datafusion-cli/README.md +++ b/datafusion-cli/README.md @@ -23,20 +23,20 @@ [DataFusion](https://arrow.apache.org/datafusion/) is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format. -The DataFusion CLI is a command line utility that runs SQL queries using the DataFusion engine. +DataFusion CLI (`datafusion-cli`) is a small command line utility that runs SQL queries using the DataFusion engine. # Frequently Asked Questions ## Where can I find more information? -Answer: See the [`datafusion-cli` documentation](https://arrow.apache.org/datafusion/user-guide/cli.html) for further information. +See the [`datafusion-cli` documentation](https://arrow.apache.org/datafusion/user-guide/cli.html) for further information. ## How do I make my IDE work with `datafusion-cli`? -Answer: "open" the `datafusion/datafusion-cli` project as its own top level +"open" the `datafusion/datafusion-cli` project as its own top level project in my IDE (rather than opening `datafusion`) -The reason `datafusion-cli` is not listed as part of the workspace in the main +The reason `datafusion-cli` is not part of the main workspace in [`datafusion Cargo.toml`] file is that `datafusion-cli` is a binary and has a checked in `Cargo.lock` file to ensure reproducible builds. diff --git a/docs/source/index.rst b/docs/source/index.rst index 919a7ad7036f..8d8d5f7ef03a 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -67,7 +67,7 @@ Please see the `developer’s guide`_ for contributing and `communication`_ for user-guide/introduction user-guide/example-usage - user-guide/cli + user-guide/cli/index user-guide/dataframe user-guide/expressions user-guide/sql/index diff --git a/docs/source/user-guide/cli.md b/docs/source/user-guide/cli.md deleted file mode 100644 index 9c3fc8bd60c8..000000000000 --- a/docs/source/user-guide/cli.md +++ /dev/null @@ -1,660 +0,0 @@ - - -# Command line SQL console - -The DataFusion CLI is a command-line interactive SQL utility for executing -queries against any supported data files. It is a convenient way to -try DataFusion's SQL support with your own data. - -## Installation - -### Install and run using Cargo - -The easiest way to install DataFusion CLI a spin is via `cargo install datafusion-cli`. - -### Install and run using Homebrew (on MacOS) - -DataFusion CLI can also be installed via Homebrew (on MacOS). If you don't have Homebrew installed, you can check how to install it [here](https://docs.brew.sh/Installation). - -Install it as any other pre-built software like this: - -```bash -brew install datafusion -# ==> Downloading https://ghcr.io/v2/homebrew/core/datafusion/manifests/12.0.0 -# ######################################################################## 100.0% -# ==> Downloading https://ghcr.io/v2/homebrew/core/datafusion/blobs/sha256:9ecc8a01be47ceb9a53b39976696afa87c0a8 -# ==> Downloading from https://pkg-containers.githubusercontent.com/ghcr1/blobs/sha256:9ecc8a01be47ceb9a53b39976 -# ######################################################################## 100.0% -# ==> Pouring datafusion--12.0.0.big_sur.bottle.tar.gz -# 🍺 /usr/local/Cellar/datafusion/12.0.0: 9 files, 17.4MB - -datafusion-cli -``` - -### Install and run using PyPI - -DataFusion CLI can also be installed via PyPI. You can check how to install PyPI [here](https://pip.pypa.io/en/latest/installation/). - -Install it as any other pre-built software like this: - -```bash -pip3 install datafusion -# Defaulting to user installation because normal site-packages is not writeable -# Collecting datafusion -# Downloading datafusion-33.0.0-cp38-abi3-macosx_11_0_arm64.whl.metadata (9.6 kB) -# Collecting pyarrow>=11.0.0 (from datafusion) -# Downloading pyarrow-14.0.1-cp39-cp39-macosx_11_0_arm64.whl.metadata (3.0 kB) -# Requirement already satisfied: numpy>=1.16.6 in /Users/Library/Python/3.9/lib/python/site-packages (from pyarrow>=11.0.0->datafusion) (1.23.4) -# Downloading datafusion-33.0.0-cp38-abi3-macosx_11_0_arm64.whl (13.5 MB) -# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.5/13.5 MB 3.6 MB/s eta 0:00:00 -# Downloading pyarrow-14.0.1-cp39-cp39-macosx_11_0_arm64.whl (24.0 MB) -# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.0/24.0 MB 36.4 MB/s eta 0:00:00 -# Installing collected packages: pyarrow, datafusion -# Attempting uninstall: pyarrow -# Found existing installation: pyarrow 10.0.1 -# Uninstalling pyarrow-10.0.1: -# Successfully uninstalled pyarrow-10.0.1 -# Successfully installed datafusion-33.0.0 pyarrow-14.0.1 - -datafusion-cli -``` - -### Run using Docker - -There is no officially published Docker image for the DataFusion CLI, so it is necessary to build from source -instead. - -Use the following commands to clone this repository and build a Docker image containing the CLI tool. Note -that there is `.dockerignore` file in the root of the repository that may need to be deleted in order for -this to work. - -```bash -git clone https://github.com/apache/arrow-datafusion -cd arrow-datafusion -git checkout 12.0.0 -docker build -f datafusion-cli/Dockerfile . --tag datafusion-cli -docker run -it -v $(your_data_location):/data datafusion-cli -``` - -## Usage - -See the current usage using `datafusion-cli --help`: - -```bash -Apache Arrow -Command Line Client for DataFusion query engine. - -USAGE: - datafusion-cli [OPTIONS] - -OPTIONS: - -b, --batch-size - The batch size of each query, or use DataFusion default - - -c, --command ... - Execute the given command string(s), then exit - - --color - Enables console syntax highlighting - - -f, --file ... - Execute commands from file(s), then exit - - --format - [default: table] [possible values: csv, tsv, table, json, nd-json] - - -h, --help - Print help information - - -m, --memory-limit - The memory pool limitation (e.g. '10g'), default to None (no limit) - - --maxrows - The max number of rows to display for 'Table' format - [default: 40] [possible values: numbers(0/10/...), inf(no limit)] - - --mem-pool-type - Specify the memory pool type 'greedy' or 'fair', default to 'greedy' - - -p, --data-path - Path to your data, default to current directory - - -q, --quiet - Reduce printing other than the results and work quietly - - -r, --rc ... - Run the provided files on startup instead of ~/.datafusionrc - - -V, --version - Print version information -``` - -## Querying data from the files directly - -Files can be queried directly by enclosing the file, directory name -or a remote location in single `'` quotes as shown in the examples. - -## Examples - -Create a CSV file to query. - -```shell -$ echo "a,b" > data.csv -$ echo "1,2" >> data.csv -``` - -Query that single file (the CLI also supports parquet, compressed csv, avro, json and more) - -```shell -$ datafusion-cli -DataFusion CLI v17.0.0 -> select * from 'data.csv'; -+---+---+ -| a | b | -+---+---+ -| 1 | 2 | -+---+---+ -1 row in set. Query took 0.007 seconds. -``` - -You can also query directories of files with compatible schemas: - -```shell -$ ls data_dir/ -data.csv data2.csv -``` - -```shell -$ datafusion-cli -DataFusion CLI v16.0.0 -> select * from 'data_dir'; -+---+---+ -| a | b | -+---+---+ -| 3 | 4 | -| 1 | 2 | -+---+---+ -2 rows in set. Query took 0.007 seconds. -``` - -You can also query directly from any remote location supported by DataFusion without -registering the location as a table. -For example, to read from a remote parquet file via HTTP(S) you can use the following: - -```sql -select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet' -+----------+ -| COUNT(*) | -+----------+ -| 1000000 | -+----------+ -1 row in set. Query took 0.595 seconds. -``` - -To read from an AWS S3 or GCS, use `s3` or `gs` as a protocol prefix. For example, this will read a file -in S3 bucket named `my-data-bucket`. Note that this is not a real file location and therefore the query -will fail, you need to use your own file location in S3. Also, you need to set the relevent access credentials -as environmental variables (e.g. for AWS S3 you need to at least `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`) - -```sql -select count(*) from 's3://my-data-bucket/athena_partitioned/hits.parquet' -``` - -## Creating External Tables - -It is also possible to create a table backed by files by explicitly -via `CREATE EXTERNAL TABLE` as shown below. Filemask wildcards supported - -## Registering Parquet Data Sources - -Parquet data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. The schema information will be derived automatically. - -Register a single file parquet datasource - -```sql -CREATE EXTERNAL TABLE taxi -STORED AS PARQUET -LOCATION '/mnt/nyctaxi/tripdata.parquet'; -``` - -Register a single folder parquet datasource. All files inside must be valid parquet files! - -```sql -CREATE EXTERNAL TABLE taxi -STORED AS PARQUET -LOCATION '/mnt/nyctaxi/'; -``` - -Register a single folder parquet datasource by specifying a wildcard for files to read - -```sql -CREATE EXTERNAL TABLE taxi -STORED AS PARQUET -LOCATION '/mnt/nyctaxi/*.parquet'; -``` - -## Registering CSV Data Sources - -CSV data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. - -```sql -CREATE EXTERNAL TABLE test -STORED AS CSV -WITH HEADER ROW -LOCATION '/path/to/aggregate_test_100.csv'; -``` - -It is also possible to provide schema information. - -```sql -CREATE EXTERNAL TABLE test ( - c1 VARCHAR NOT NULL, - c2 INT NOT NULL, - c3 SMALLINT NOT NULL, - c4 SMALLINT NOT NULL, - c5 INT NOT NULL, - c6 BIGINT NOT NULL, - c7 SMALLINT NOT NULL, - c8 INT NOT NULL, - c9 BIGINT NOT NULL, - c10 VARCHAR NOT NULL, - c11 FLOAT NOT NULL, - c12 DOUBLE NOT NULL, - c13 VARCHAR NOT NULL -) -STORED AS CSV -LOCATION '/path/to/aggregate_test_100.csv'; -``` - -## Registering Remote Data Sources - -`datafusion-cli` can read from remote locations using a variety of protocols. -For example to read from a remote parquet file via HTTP(S) you can use the following: - -```sql -CREATE EXTERNAL TABLE hits -STORED AS PARQUET -LOCATION 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet'; -``` - -```sql -select count(*) from hits; -+----------+ -| COUNT(*) | -+----------+ -| 1000000 | -+----------+ -1 row in set. Query took 0.344 seconds. -``` - -## Registering S3 Data Sources - -[AWS S3](https://aws.amazon.com/s3/) data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. - -```sql -CREATE EXTERNAL TABLE test -STORED AS PARQUET -OPTIONS( - 'aws.access_key_id' '******', - 'aws.secret_access_key' '******', - 'aws.region' 'us-east-2' -) -LOCATION 's3://bucket/path/file.parquet'; -``` - -The supported OPTIONS are: - -- access_key_id -- secret_access_key -- session_token -- region - -It is also possible to simplify sql statements by environment variables. - -```bash -$ export AWS_DEFAULT_REGION=us-east-2 -$ export AWS_SECRET_ACCESS_KEY=****** -$ export AWS_ACCESS_KEY_ID=****** - -$ datafusion-cli -DataFusion CLI v21.0.0 -> create external table test stored as parquet location 's3://bucket/path/file.parquet'; -0 rows in set. Query took 0.374 seconds. -> select * from test; -+----------+----------+ -| column_1 | column_2 | -+----------+----------+ -| 1 | 2 | -+----------+----------+ -1 row in set. Query took 0.171 seconds. -``` - -Details of the environment variables that can be used are: - -- AWS_ACCESS_KEY_ID -> access_key_id -- AWS_SECRET_ACCESS_KEY -> secret_access_key -- AWS_DEFAULT_REGION -> region -- AWS_ENDPOINT -> endpoint -- AWS_SESSION_TOKEN -> token -- AWS_CONTAINER_CREDENTIALS_RELATIVE_URI -> -- AWS_ALLOW_HTTP -> set to "true" to permit HTTP connections without TLS -- AWS_PROFILE -> Support for using a [named profile](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html) to supply credentials - -## Registering OSS Data Sources - -[Alibaba cloud OSS](https://www.alibabacloud.com/product/object-storage-service) data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. - -```sql -CREATE EXTERNAL TABLE test -STORED AS PARQUET -OPTIONS( - 'aws.access_key_id' '******', - 'aws.secret_access_key' '******', - 'aws.oss.endpoint' 'https://bucket.oss-cn-hangzhou.aliyuncs.com' -) -LOCATION 'oss://bucket/path/file.parquet'; -``` - -The supported OPTIONS are: - -- access_key_id -- secret_access_key -- endpoint - -Note that the `endpoint` format of oss needs to be: `https://{bucket}.{oss-region-endpoint}` - -## Registering COS Data Sources - -[Tencent cloud COS](https://cloud.tencent.com/product/cos) data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. - -```sql -CREATE EXTERNAL TABLE test -STORED AS PARQUET -OPTIONS( - 'aws.access_key_id' '******', - 'aws.secret_access_key' '******', - 'aws.cos.endpoint' 'https://cos.ap-singapore.myqcloud.com' -) -LOCATION 'cos://bucket/path/file.parquet'; -``` - -The supported OPTIONS are: - -- access_key_id -- secret_access_key -- endpoint - -Note that the `endpoint` format of urls must be: `https://cos.{cos-region-endpoint}` - -## Registering GCS Data Sources - -[Google Cloud Storage](https://cloud.google.com/storage) data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. - -```sql -CREATE EXTERNAL TABLE test -STORED AS PARQUET -OPTIONS( - 'gcp.service_account_path' '/tmp/gcs.json', -) -LOCATION 'gs://bucket/path/file.parquet'; -``` - -The supported OPTIONS are: - -- service_account_path -> location of service account file -- service_account_key -> JSON serialized service account key -- application_credentials_path -> location of application credentials file - -It is also possible to simplify sql statements by environment variables. - -```bash -$ export GOOGLE_SERVICE_ACCOUNT=/tmp/gcs.json - -$ datafusion-cli -DataFusion CLI v21.0.0 -> create external table test stored as parquet location 'gs://bucket/path/file.parquet'; -0 rows in set. Query took 0.374 seconds. -> select * from test; -+----------+----------+ -| column_1 | column_2 | -+----------+----------+ -| 1 | 2 | -+----------+----------+ -1 row in set. Query took 0.171 seconds. -``` - -Details of the environment variables that can be used are: - -- GOOGLE_SERVICE_ACCOUNT: location of service account file -- GOOGLE_SERVICE_ACCOUNT_PATH: (alias) location of service account file -- SERVICE_ACCOUNT: (alias) location of service account file -- GOOGLE_SERVICE_ACCOUNT_KEY: JSON serialized service account key -- GOOGLE_BUCKET: bucket name -- GOOGLE_BUCKET_NAME: (alias) bucket name - -## Commands - -Available commands inside DataFusion CLI are: - -- Quit - -```bash -> \q -``` - -- Help - -```bash -> \? -``` - -- ListTables - -```bash -> \d -``` - -- DescribeTable - -```bash -> \d table_name -``` - -- QuietMode - -```bash -> \quiet [true|false] -``` - -- list function - -```bash -> \h -``` - -- Search and describe function - -```bash -> \h function -``` - -## Supported SQL - -In addition to the normal [SQL supported in DataFusion], `datafusion-cli` also -supports additional statements and commands: - -[sql supported in datafusion]: sql/index.rst - -- Show configuration options - -`SHOW ALL [VERBOSE]` - -```SQL -> show all; - -+-------------------------------------------------+---------+ -| name | value | -+-------------------------------------------------+---------+ -| datafusion.execution.batch_size | 8192 | -| datafusion.execution.coalesce_batches | true | -| datafusion.execution.time_zone | UTC | -| datafusion.explain.logical_plan_only | false | -| datafusion.explain.physical_plan_only | false | -| datafusion.optimizer.filter_null_join_keys | false | -| datafusion.optimizer.skip_failed_rules | true | -+-------------------------------------------------+---------+ - -``` - -- Show specific configuration option - -`SHOW xyz.abc.qwe [VERBOSE]` - -```SQL -> show datafusion.execution.batch_size; - -+-------------------------------------------------+---------+ -| name | value | -+-------------------------------------------------+---------+ -| datafusion.execution.batch_size | 8192 | -+-------------------------------------------------+---------+ - -``` - -- Set configuration options - -```SQL -> SET datafusion.execution.batch_size to 1024; -``` - -- `parquet_metadata` table function - -The `parquet_metadata` table function can be used to inspect detailed metadata -about a parquet file such as statistics, sizes, and other information. This can -be helpful to understand how parquet files are structured. - -For example, to see information about the `"WatchID"` column in the -`hits.parquet` file, you can use: - -```sql -SELECT path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size -FROM parquet_metadata('hits.parquet') -WHERE path_in_schema = '"WatchID"' -LIMIT 3; - -+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+ -| path_in_schema | row_group_id | row_group_num_rows | stats_min | stats_max | total_compressed_size | -+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+ -| "WatchID" | 0 | 450560 | 4611687214012840539 | 9223369186199968220 | 3883759 | -| "WatchID" | 1 | 612174 | 4611689135232456464 | 9223371478009085789 | 5176803 | -| "WatchID" | 2 | 344064 | 4611692774829951781 | 9223363791697310021 | 3031680 | -+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+ -3 rows in set. Query took 0.053 seconds. -``` - -The returned table has the following columns for each row for each column chunk -in the file. Please refer to the [Parquet Documentation] for more information. - -[parquet documentation]: https://parquet.apache.org/ - -| column_name | data_type | Description | -| ----------------------- | --------- | --------------------------------------------------------------------------------------------------- | -| filename | Utf8 | Name of the file | -| row_group_id | Int64 | Row group index the column chunk belongs to | -| row_group_num_rows | Int64 | Count of rows stored in the row group | -| row_group_num_columns | Int64 | Total number of columns in the row group (same for all row groups) | -| row_group_bytes | Int64 | Number of bytes used to store the row group (not including metadata) | -| column_id | Int64 | ID of the column | -| file_offset | Int64 | Offset within the file that this column chunk's data begins | -| num_values | Int64 | Total number of values in this column chunk | -| path_in_schema | Utf8 | "Path" (column name) of the column chunk in the schema | -| type | Utf8 | Parquet data type of the column chunk | -| stats_min | Utf8 | The minimum value for this column chunk, if stored in the statistics, cast to a string | -| stats_max | Utf8 | The maximum value for this column chunk, if stored in the statistics, cast to a string | -| stats_null_count | Int64 | Number of null values in this column chunk, if stored in the statistics | -| stats_distinct_count | Int64 | Number of distinct values in this column chunk, if stored in the statistics | -| stats_min_value | Utf8 | Same as `stats_min` | -| stats_max_value | Utf8 | Same as `stats_max` | -| compression | Utf8 | Block level compression (e.g. `SNAPPY`) used for this column chunk | -| encodings | Utf8 | All block level encodings (e.g. `[PLAIN_DICTIONARY, PLAIN, RLE]`) used for this column chunk | -| index_page_offset | Int64 | Offset in the file of the [`page index`], if any | -| dictionary_page_offset | Int64 | Offset in the file of the dictionary page, if any | -| data_page_offset | Int64 | Offset in the file of the first data page, if any | -| total_compressed_size | Int64 | Number of bytes the column chunk's data after encoding and compression (what is stored in the file) | -| total_uncompressed_size | Int64 | Number of bytes the column chunk's data after encoding | - -+-------------------------+-----------+-------------+ - -[`page index`]: https://github.com/apache/parquet-format/blob/master/PageIndex.md - -## Changing Configuration Options - -All available configuration options can be seen using `SHOW ALL` as described above. - -You can change the configuration options using environment -variables. `datafusion-cli` looks in the corresponding environment -variable with an upper case name and all `.` converted to `_`. - -For example, to set `datafusion.execution.batch_size` to `1024` you -would set the `DATAFUSION_EXECUTION_BATCH_SIZE` environment variable -appropriately: - -```shell -$ DATAFUSION_EXECUTION_BATCH_SIZE=1024 datafusion-cli -DataFusion CLI v12.0.0 -> show all; -+-------------------------------------------------+---------+ -| name | value | -+-------------------------------------------------+---------+ -| datafusion.execution.batch_size | 1024 | -| datafusion.execution.coalesce_batches | true | -| datafusion.execution.time_zone | UTC | -| datafusion.explain.logical_plan_only | false | -| datafusion.explain.physical_plan_only | false | -| datafusion.optimizer.filter_null_join_keys | false | -| datafusion.optimizer.skip_failed_rules | true | -+-------------------------------------------------+---------+ -8 rows in set. Query took 0.002 seconds. -``` - -You can change the configuration options using `SET` statement as well - -```shell -$ datafusion-cli -DataFusion CLI v13.0.0 -> show datafusion.execution.batch_size; -+---------------------------------+---------+ -| name | value | -+---------------------------------+---------+ -| datafusion.execution.batch_size | 8192 | -+---------------------------------+---------+ -1 row in set. Query took 0.011 seconds. - -> set datafusion.execution.batch_size to 1024; -0 rows in set. Query took 0.000 seconds. - -> show datafusion.execution.batch_size; -+---------------------------------+---------+ -| name | value | -+---------------------------------+---------+ -| datafusion.execution.batch_size | 1024 | -+---------------------------------+---------+ -1 row in set. Query took 0.005 seconds. -``` diff --git a/docs/source/user-guide/cli/datasources.md b/docs/source/user-guide/cli/datasources.md new file mode 100644 index 000000000000..c2c00b633479 --- /dev/null +++ b/docs/source/user-guide/cli/datasources.md @@ -0,0 +1,349 @@ + + +# Local Files / Directories + +Files can be queried directly by enclosing the file, directory name +or a remote location in single `'` quotes as shown in the examples. + +Create a CSV file to query. + +```shell +$ echo "a,b" > data.csv +$ echo "1,2" >> data.csv +``` + +Query that single file (the CLI also supports parquet, compressed csv, avro, json and more) + +```shell +$ datafusion-cli +DataFusion CLI v17.0.0 +> select * from 'data.csv'; ++---+---+ +| a | b | ++---+---+ +| 1 | 2 | ++---+---+ +1 row in set. Query took 0.007 seconds. +``` + +You can also query directories of files with compatible schemas: + +```shell +$ ls data_dir/ +data.csv data2.csv +``` + +```shell +$ datafusion-cli +DataFusion CLI v16.0.0 +> select * from 'data_dir'; ++---+---+ +| a | b | ++---+---+ +| 3 | 4 | +| 1 | 2 | ++---+---+ +2 rows in set. Query took 0.007 seconds. +``` + +# Remote Files / Directories + +You can also query directly any remote location supported by DataFusion without +registering the location as a table. +For example, to read from a remote parquet file via HTTP(S) you can use the following: + +```sql +select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet' ++----------+ +| COUNT(*) | ++----------+ +| 1000000 | ++----------+ +1 row in set. Query took 0.595 seconds. +``` + +To read from an AWS S3 or GCS, use `s3` or `gs` as a protocol prefix. For +example, to read a file in an S3 bucket named `my-data-bucket` use the URL +`s3://my-data-bucket`and set the relevant access credentials as environmental +variables (e.g. for AWS S3 you need to at least `AWS_ACCESS_KEY_ID` and +`AWS_SECRET_ACCESS_KEY`). + +```sql +select count(*) from 's3://my-data-bucket/athena_partitioned/hits.parquet' +``` + +See the [`CREATE EXTERNAL TABLE`](#create-external-table) section for +additional configuration options. + +# `CREATE EXTERNAL TABLE` + +It is also possible to create a table backed by files or remote locations via +`CREATE EXTERNAL TABLE` as shown below. Note that wildcards (e.g. `*`) are also +supported + +For example, to create a table `hits` backed by a local parquet file, use: + +```sql +CREATE EXTERNAL TABLE hits +STORED AS PARQUET +LOCATION 'hits.parquet'; +``` + +To create a table `hits` backed by a remote parquet file via HTTP(S), use + +```sql +CREATE EXTERNAL TABLE hits +STORED AS PARQUET +LOCATION 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet'; +``` + +In both cases, `hits` now can be queried as a regular table: + +```sql +select count(*) from hits; ++----------+ +| COUNT(*) | ++----------+ +| 1000000 | ++----------+ +1 row in set. Query took 0.344 seconds. +``` + +# Formats + +## Parquet + +The schema information for parquet will be derived automatically. + +Register a single file parquet datasource + +```sql +CREATE EXTERNAL TABLE taxi +STORED AS PARQUET +LOCATION '/mnt/nyctaxi/tripdata.parquet'; +``` + +Register a single folder parquet datasource. Note: All files inside must be valid +parquet files and have compatible schemas + +```sql +CREATE EXTERNAL TABLE taxi +STORED AS PARQUET +LOCATION '/mnt/nyctaxi/'; +``` + +Register a single folder parquet datasource by specifying a wildcard for files to read + +```sql +CREATE EXTERNAL TABLE taxi +STORED AS PARQUET +LOCATION '/mnt/nyctaxi/*.parquet'; +``` + +## CSV + +DataFusion will infer the CSV schema automatically or you can provide it explicitly. + +Register a single file csv datasource with a header row. + +```sql +CREATE EXTERNAL TABLE test +STORED AS CSV +WITH HEADER ROW +LOCATION '/path/to/aggregate_test_100.csv'; +``` + +Register a single file csv datasource with explicitly defined schema. + +```sql +CREATE EXTERNAL TABLE test ( + c1 VARCHAR NOT NULL, + c2 INT NOT NULL, + c3 SMALLINT NOT NULL, + c4 SMALLINT NOT NULL, + c5 INT NOT NULL, + c6 BIGINT NOT NULL, + c7 SMALLINT NOT NULL, + c8 INT NOT NULL, + c9 BIGINT NOT NULL, + c10 VARCHAR NOT NULL, + c11 FLOAT NOT NULL, + c12 DOUBLE NOT NULL, + c13 VARCHAR NOT NULL +) +STORED AS CSV +LOCATION '/path/to/aggregate_test_100.csv'; +``` + +# Locations + +## HTTP(s) + +To read from a remote parquet file via HTTP(S) you can use the following: + +```sql +CREATE EXTERNAL TABLE hits +STORED AS PARQUET +LOCATION 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet'; +``` + +## S3 + +[AWS S3](https://aws.amazon.com/s3/) data sources must have connection credentials configured. + +To create an external table from a file in an S3 bucket: + +```sql +CREATE EXTERNAL TABLE test +STORED AS PARQUET +OPTIONS( + 'aws.access_key_id' '******', + 'aws.secret_access_key' '******', + 'aws.region' 'us-east-2' +) +LOCATION 's3://bucket/path/file.parquet'; +``` + +It is also possible to specify the access information using environment variables: + +```bash +$ export AWS_DEFAULT_REGION=us-east-2 +$ export AWS_SECRET_ACCESS_KEY=****** +$ export AWS_ACCESS_KEY_ID=****** + +$ datafusion-cli +`datafusion-cli v21.0.0 +> create external table test stored as parquet location 's3://bucket/path/file.parquet'; +0 rows in set. Query took 0.374 seconds. +> select * from test; ++----------+----------+ +| column_1 | column_2 | ++----------+----------+ +| 1 | 2 | ++----------+----------+ +1 row in set. Query took 0.171 seconds. +``` + +Supported configuration options are: + +| Environment Variable | Configuration Option | Description | +| ---------------------------------------- | ----------------------- | --------------------------------------------------------- | +| `AWS_ACCESS_KEY_ID` | `aws.access_key_id` | | +| `AWS_SECRET_ACCESS_KEY` | `aws.secret_access_key` | | +| `AWS_DEFAULT_REGION` | `aws.region` | | +| `AWS_ENDPOINT` | `aws.endpoint` | | +| `AWS_SESSION_TOKEN` | `aws.token` | | +| `AWS_CONTAINER_CREDENTIALS_RELATIVE_URI` | | See [IAM Roles] | +| `AWS_ALLOW_HTTP` | | set to "true" to permit HTTP connections without TLS | +| `AWS_PROFILE` | | Support for using a [named profile] to supply credentials | + +[named profile]: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html +[iam roles]: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html + +## OSS + +[Alibaba cloud OSS](https://www.alibabacloud.com/product/object-storage-service) data sources must have connection credentials configured + +```sql +CREATE EXTERNAL TABLE test +STORED AS PARQUET +OPTIONS( + 'aws.access_key_id' '******', + 'aws.secret_access_key' '******', + 'aws.oss.endpoint' 'https://bucket.oss-cn-hangzhou.aliyuncs.com' +) +LOCATION 'oss://bucket/path/file.parquet'; +``` + +The supported OPTIONS are + +- access_key_id +- secret_access_key +- endpoint + +Note that the `endpoint` format of oss needs to be: `https://{bucket}.{oss-region-endpoint}` + +## COS + +[Tencent cloud COS](https://cloud.tencent.com/product/cos) data sources data sources must have connection credentials configured + +```sql +CREATE EXTERNAL TABLE test +STORED AS PARQUET +OPTIONS( + 'aws.access_key_id' '******', + 'aws.secret_access_key' '******', + 'aws.cos.endpoint' 'https://cos.ap-singapore.myqcloud.com' +) +LOCATION 'cos://bucket/path/file.parquet'; +``` + +The supported OPTIONS are: + +- access_key_id +- secret_access_key +- endpoint + +Note that the `endpoint` format of urls must be: `https://cos.{cos-region-endpoint}` + +## GCS + +[Google Cloud Storage](https://cloud.google.com/storage) data sources must have connection credentials configured + +For example, to create an external table from a file in a GCS bucket + +```sql +CREATE EXTERNAL TABLE test +STORED AS PARQUET +OPTIONS( + 'gcp.service_account_path' '/tmp/gcs.json', +) +LOCATION 'gs://bucket/path/file.parquet'; +``` + +It is also possible to specify the access information using environment variables: + +```bash +$ export GOOGLE_SERVICE_ACCOUNT=/tmp/gcs.json + +$ datafusion-cli +DataFusion CLI v21.0.0 +> create external table test stored as parquet location 'gs://bucket/path/file.parquet'; +0 rows in set. Query took 0.374 seconds. +> select * from test; ++----------+----------+ +| column_1 | column_2 | ++----------+----------+ +| 1 | 2 | ++----------+----------+ +1 row in set. Query took 0.171 seconds. +``` + +Supported configuration options are: + +| Environment Variable | Configuration Option | Description | +| -------------------------------- | ---------------------------------- | ---------------------------------------- | +| `GOOGLE_SERVICE_ACCOUNT` | `gcp.service_account_path` | location of service account file | +| `GOOGLE_SERVICE_ACCOUNT_PATH` | `gcp.service_account_path` | (alias) location of service account file | +| `SERVICE_ACCOUNT` | `gcp.service_account_path` | (alias) location of service account file | +| `GOOGLE_SERVICE_ACCOUNT_KEY` | `gcp.service_account_key` | JSON serialized service account key | +| `GOOGLE_APPLICATION_CREDENTIALS` | `gcp.application_credentials_path` | location of application credentials file | +| `GOOGLE_BUCKET` | | bucket name | +| `GOOGLE_BUCKET_NAME` | | (alias) bucket name | diff --git a/docs/source/user-guide/cli/index.rst b/docs/source/user-guide/cli/index.rst new file mode 100644 index 000000000000..874cfc0eae86 --- /dev/null +++ b/docs/source/user-guide/cli/index.rst @@ -0,0 +1,27 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +DataFusion CLI +============== + +.. toctree:: + :maxdepth: 3 + + overview + installation + usage + datasources diff --git a/docs/source/user-guide/cli/installation.md b/docs/source/user-guide/cli/installation.md new file mode 100644 index 000000000000..f7fcd290729d --- /dev/null +++ b/docs/source/user-guide/cli/installation.md @@ -0,0 +1,63 @@ + + +# Installation + +## Install and run using Cargo + +To build and install the latest release of `datafusion-cli` from source, do: + +```shell +cargo install datafusion-cli +# Updating crates.io index +# Installing datafusion-cli v37.0.0 +# Updating crates.io index +# ... +``` + +## Install and run using Homebrew (on MacOS) + +`datafusion-cli` can also be installed via [Homebrew] (on MacOS) like this: + +[homebrew]: https://docs.brew.sh/Installation + +```bash +brew install datafusion +# ... +# ==> Pouring datafusion--37.0.0.arm64_sonoma.bottle.tar.gz +# 🍺 /opt/homebrew/Cellar/datafusion/37.0.0: 9 files, 63.0MB +# ==> Running `brew cleanup datafusion`... +``` + +## Run using Docker + +There is no officially published Docker image for the DataFusion CLI, so it is necessary to build from source +instead. + +Use the following commands to clone this repository and build a Docker image containing the CLI tool. Note +that there is `.dockerignore` file in the root of the repository that may need to be deleted in order for +this to work. + +```bash +git clone https://github.com/apache/arrow-datafusion +cd arrow-datafusion +git checkout 12.0.0 +docker build -f datafusion-cli/Dockerfile . --tag datafusion-cli +docker run -it -v $(your_data_location):/data datafusion-cli +``` diff --git a/docs/source/user-guide/cli/overview.md b/docs/source/user-guide/cli/overview.md new file mode 100644 index 000000000000..86beea0e82d5 --- /dev/null +++ b/docs/source/user-guide/cli/overview.md @@ -0,0 +1,45 @@ + + +# Overview + +DataFusion CLI (`datafusion-cli`) is an interactive command-line utility for executing +SQL queries against any supported data files. + +While intended as an example of how to use DataFusion, `datafusion-cli` offers a +full range of SQL and support reading and writing CSV, Parquet, JSON, Arrow and +Avro, from local files, directories, or remote locations such as S3. + +Here is an example of how to run a SQL query against a local file, `hits.parquet`: + +```shell +$ datafusion-cli +DataFusion CLI v37.0.0 +> select count(distinct "URL") from 'hits.parquet'; ++----------------------------------+ +| COUNT(DISTINCT hits.parquet.URL) | ++----------------------------------+ +| 18342019 | ++----------------------------------+ +1 row(s) fetched. +Elapsed 1.969 seconds. +``` + +For more information, see the [Installation](installation), [Usage Guide](usage) +and [Data Sources](datasources) sections. diff --git a/docs/source/user-guide/cli/usage.md b/docs/source/user-guide/cli/usage.md new file mode 100644 index 000000000000..617b462875c7 --- /dev/null +++ b/docs/source/user-guide/cli/usage.md @@ -0,0 +1,288 @@ + + +# Usage + +See the current usage using `datafusion-cli --help`: + +```bash +Apache Arrow +Command Line Client for DataFusion query engine. + +USAGE: + datafusion-cli [OPTIONS] + +OPTIONS: + -b, --batch-size + The batch size of each query, or use DataFusion default + + -c, --command ... + Execute the given command string(s), then exit + + --color + Enables console syntax highlighting + + -f, --file ... + Execute commands from file(s), then exit + + --format + [default: table] [possible values: csv, tsv, table, json, nd-json] + + -h, --help + Print help information + + -m, --memory-limit + The memory pool limitation (e.g. '10g'), default to None (no limit) + + --maxrows + The max number of rows to display for 'Table' format + [default: 40] [possible values: numbers(0/10/...), inf(no limit)] + + --mem-pool-type + Specify the memory pool type 'greedy' or 'fair', default to 'greedy' + + -p, --data-path + Path to your data, default to current directory + + -q, --quiet + Reduce printing other than the results and work quietly + + -r, --rc ... + Run the provided files on startup instead of ~/.datafusionrc + + -V, --version + Print version information +``` + +## Commands + +Available commands inside DataFusion CLI are: + +- Quit + +```bash +> \q +``` + +- Help + +```bash +> \? +``` + +- ListTables + +```bash +> \d +``` + +- DescribeTable + +```bash +> \d table_name +``` + +- QuietMode + +```bash +> \quiet [true|false] +``` + +- list function + +```bash +> \h +``` + +- Search and describe function + +```bash +> \h function +``` + +## Supported SQL + +In addition to the normal [SQL supported in DataFusion], `datafusion-cli` also +supports additional statements and commands: + +[sql supported in datafusion]: ../sql/index.rst + +### `SHOW ALL [VERBOSE]` + +Show configuration options + +```SQL +> show all; + ++-------------------------------------------------+---------+ +| name | value | ++-------------------------------------------------+---------+ +| datafusion.execution.batch_size | 8192 | +| datafusion.execution.coalesce_batches | true | +| datafusion.execution.time_zone | UTC | +| datafusion.explain.logical_plan_only | false | +| datafusion.explain.physical_plan_only | false | +| datafusion.optimizer.filter_null_join_keys | false | +| datafusion.optimizer.skip_failed_rules | true | ++-------------------------------------------------+---------+ + +``` + +### `SHOW