-
Notifications
You must be signed in to change notification settings - Fork 10
Parameters
PBench uses JSON files to define each stage of a benchmark. These stage files contain JSON format parameters.
Use the JSON parameters defined here to write stage files. For more information about stage files, see Creating a Stage File.
Format
"abort_on_error": Boolean
Definition
Set abort_on_error
to true
to abort all running and future stages of the benchmark, and also any external running processes started by shell_scripts
when an error occurs.
This parameter and its value are inherited by child stages. Set the parameter to null
in a stage to unset the value inherited from a parent stage.
Example:
"abort_on_error": true
Format
"catalog": "catalog-name"
Definition
Set the catalog for queries in queries
and query_files
.
catalog
and schema
cannot be set to null
.
This parameter and its value are inherited by child stages.
New values for catalog
, schema
, session_params
, and timezone
assigned in a stage are not applied to the Presto client unless a stage also sets start_on_new_client = true
.
Example:
"catalog": "iceberg"
Format
"cold_runs": integer
Definition
The number of cold runs to run to populate the cache. The default is 1
.
This parameter and its value are inherited by child stages. Set the parameter to null
in a stage to unset the value inherited from a parent stage.
Example:
"cold_runs": 1
Format
"description": “Description of the JSON file.”
Definition
JSON does not support comments, so comments in PBench are formatted as data pairs. For more information see Comments Inside JSON - Commenting in a JSON File.
Begin every stage JSON file with a description
.
PBench ignores description
: it is not read, processed, or output in any way by PBench.
Example:
"description": "Specifies the catalog and the schema for TPC-DS Iceberg scale factor 1 TB partitioned."
Format
"expected_row_counts": {
"file1": [
1,
1
],
"file2": [
1,
1
]
}
Definition
A map from [catalog.schema] to arrays of integers that are expected row counts for the queries that are run under different schemas.
The key of this map can be either:
-
[schema] - match the schema name regardless of the catalog they are under
-
[catalog.schema] - match both catalog and schema
-
[regular expression] - used to match [catalog.schema]
List the expected row counts for queries
first, then list the expected row counts for the queries in each query file listed in query_files
.
Example:
"expected_row_counts": {
"tpcds_sf10000_": [
100,
100
],
"tpcds_sf1000_": [
100,
100
]
}
Use regular expressions to match multiple [catalog.schema] pairs. In this example, .*\\.tpcds_sf10000
matches hive.tcpdssf10000
and iceberg.tpcdssf10000
.
"expected_row_counts": {
".*\\.tpcds_sf10000": [
100,
100
],
"tpcds_sf1000_": [
100,
100
]
}
Format
"next": [
"stage_2.json",
"stage_3.json"
]
Definition
Specifies one or more child stages of the current stage. Child stages start after the parent stage finishes, and in parallel with each other. Child stages inherit some parameters from the parent stage if those parameters are not explicitly set in the child stage.
Example:
"next": [
"stage_2.json",
"stage_3.json"
]
Format
"queries": [
"query_string"
]
Definition
Run the SQL query in query_string
. If a query is long or complex, or there are several queries, consider saving the queries in a SQL file to be run using query_files
.
Do not end the SQL query in query_string
with a semi-colon.
SQL queries in queries
are executed first, then SQL queries in files listed in query_files
are read and executed, then external commands in shell_scripts
are run.
Example:
"queries": [
"select 'query 1'"
]
Format
"query_files": [
"file1",
"file2",
]
Definition
One or more files containing SQL queries.
SQL queries in queries
are executed first, then SQL queries in files listed in query_files
are read and executed, then external commands in shell_scripts
are run.
A relative file path in the query_files
array is evaluated based on the location of the stage JSON file.
Example:
"query_files": [
"queries/query_01.sql",
"queries/query_02.sql",
]
Format
"random_execution": Boolean
Definition
When random_execution
is set to false
, PBench runs the queries in queries
and query_files
sequentially.
When random_execution
is set to true
, PBench runs the queries
and query_files
randomly, until the duration
or integer
set using randomly_execute_until
is met.
Each query file counts as 1 regardless of the number of queries in that query file. For example, a stage has:
- 3 queries in
queries
- 2 query files in
query_files
, with 3 queries in each file
random_execution
selects from 5 (3 queries + 2 query files), not 9 (3 queries + 3 queries in one file + 3 queries in the other file).
If a query file is selected, all of the queries in the file are executed and it is counted as 1 selection towards the integer specified in RandomlyExecuteUntil
.
Expected row counts are ignored when random_execution
is set to true
.
The default value of random_execution
is false
.
Example:
"random_execution": true
Format
"randomly_execute_until": "duration"
"randomly_execute_until": "integer"
Definition
Specify either
- a
duration
like15m
,1h
,5d
- an
integer
as the number of queries
to randomly run SQL queries.
Example:
"randomly_execute_until": "15m"
"randomly_execute_until": "700"
Format
"save_column_metadata": Boolean
Definition
Save a JSON file of the query's column metadata in the columns
field of Presto's query API response.
Column metadata is saved once for a query on its first run, regardless of the number of cold_runs
and warm_runs
.
This parameter and its value are inherited by child stages. Set the parameter to null
in a stage to unset the value inherited from a parent stage.
The file name format uses the naming process as described in PBench Output File Name Format.
Example:
"save_column_metadata": true
Format
"save_json": Boolean
Definition
Set save_json
to true
to save a successful query's JSON after the query is executed. The file name is [query_name].json
. For example, ds_power_query_59.json
. This file is valuable when debugging a problem with a run of PBench.
A failed query also saves the error information for the query in a file named [query_name].error.json
.
This parameter and its value are inherited by child stages. Set the parameter to null
in a stage to unset the value inherited from a parent stage.
Example:
"save_json": true
Format
"save_output": Boolean
Definition
Set save_output
to true
to save the query result to files in raw form.
Set the parameter to null
in a stage to unset the value inherited from a parent stage.
This parameter and its value are inherited by child stages.
The file name format uses the naming process as described in PBench Output File Name Format.
Example:
"save_output": true
Format
"schema": “schema-name
Definition
Set the schema for queries in queries
and query_files
.
catalog
and schema
cannot be set to null
.
This parameter and its value are inherited by child stages.
New values for catalog
, schema
, session_params
, and timezone
assigned in a stage are not applied to the Presto client unless a stage also sets start_on_new_client = true
.
Example
"schema": "sf1"
Format
"session_params": {
"session-property-name": "session-property-value"
}
Definition
Session properties passed to Presto.
This parameter and its value are inherited by child stages.
Set a session parameter to null
to unset the value inherited from a parent stage.
New values for catalog
, schema
, session_params
, and timezone
assigned in a stage are not applied to the Presto client unless a stage also sets start_on_new_client = true
.
Example:
"session_params": {
"iceberg.hive_statistics_merge_strategy": "USE_NULLS_FRACTION_AND_NDV",
"hive.pushdown_filter_enabled": false,
}
Format
"shell_scripts": [
"shell_command"
]
Definition
Run a shell script after executing all SQL queries in queries
and query_files
.
SQL queries in queries
are executed first, then SQL queries in files listed in query_files
are read and executed, then external commands in shell_scripts
are run.
Example:
"shell_scripts": [
"echo \"this is a script\"",
"python3 test_script.py",
"ls -l"
]
Format
"start_on_new_client": Boolean
Definition
Set start_on_new_client
to true
for this stage will create a new client to execute itself. Each client has its own set of client information, tags, session properties, user credentials, and other parameters.
Example:
"start_on_new_client": true
Format
"timezone": timezone_string
Definition
The value of timezone_string
can be any value in the Time Zone ID column of Time Zone ID.
The default value of timezone
is the user's local timezone.
New values for catalog
, schema
, session_params
, and timezone
assigned in a stage are not applied to the Presto client unless a stage also sets start_on_new_client = true
.
This parameter and its value are inherited by child stages.
Example:
"timezone": "America/Los_Angeles"
Format
"warm_runs": integer
Definition
The number of query runs to perform after the number of cold runs. The default value is 0
.
This parameter and its value are inherited by child stages. Set the parameter to null
in a stage to unset the value inherited from a parent stage.
Example:
"warm_runs": 2