Faux Data is a library for generating data using configuration files.
The configuration files are called templates. Within a template, columns
define the structure of the data and targets
define where to load the data.
The main aims of Faux Data are:
- Make it easy to generate schematically correct datasets
- Provide easy integration with cloud services specifically on the Google Cloud Platform
- Support serverless generation of data e.g. using a Cloud Function invocation to generate data
Faux Data was originally a Python port of the scala application dunnhumby/data-faker, but has evolved from there. The templates are still similar but are not directly compatible.
Install faux-data locally via pip
> pip install faux-data
check the install has been successful with
> faux --help
Create a file mytable.yaml
with the following contents:
tables:
- name: mytable
rows: 100
targets: []
columns:
- name: id
column_type: Sequential
data_type: Int
start: 1
step: 1
- name: event_time
column_type: Random
data_type: Timestamp
min: '{{ start }}'
max: '{{ end }}'
You can render this template with:
> faux render mytable.yaml
====================== Rendered Template =======================
tables:
- name: mytable
rows: 100
targets: []
columns:
- name: id
column_type: Sequential
data_type: Int
start: 1
step: 1
- name: event_time
column_type: Random
data_type: Timestamp
min: '2022-05-20 00:00:00'
max: '2022-05-21 00:00:00'
Notice that {{ start }} and {{ end }} are replaced with start and end dates automatically. Start and end are built-in variables that you can use in templates. Start defaults to the start of yesterday and end defaults to the end of yesterday.
If you run:
> faux render mytable.yaml --start 2022-06-10
====================== Rendered Template =======================
...
- name: event_time
column_type: Random
data_type: Timestamp
min: '2022-06-10 00:00:00'
max: '2022-06-11 00:00:00'
Notice now that {{ start }} and {{ end }} are now based on the provided --start
value.
The two columns we have added so far use the long form syntax, which can get a bit verbose, there's a shorter syntax that can be used as well. Lets add another column using the more concise syntax add the following column to your file.
- col: currency Selection String
values:
- USD
- GBP
- EUR
Now let's test that the data is generated correctly run the following to see a sample of generated data.
> faux sample mytable.yaml
Table: mytable
Sample:
id event_time currency
0 1 2022-05-20 14:47:56.866 EUR
1 2 2022-05-20 09:24:11.971 GBP
2 3 2022-05-20 14:11:00.144 GBP
3 4 2022-05-20 22:32:35.579 EUR
4 5 2022-05-20 00:31:02.248 GBP
Schema:
id Int64
event_time datetime64[ns]
currency string
dtype: object
In order for the data to be useful we need to load it somewhere, let's add a target to load the data to bigquery.
Add the following into the template replacing targets: []
targets:
- target: BigQuery
dataset: mydataset
table: mytable
To run this step you will need a google cloud project and to have your environment set up with google application credentials.
Then run
> faux run mytable.yaml
This will create a dataset in your google cloud project named mydataset and a table within called mytable and will load 100 rows of data to it.
faux-data templates support the following column_type:
s:
-
Random - Generates columns of random data
Examples: Random Ints, Random Timestamps, Random Strings
-
Selection - Generates columns by selecting random values
Examples: Simple Selection, Selection with Weighting
-
Sequential - Generates a column by incrementing a field from a
start:
by astep:
for each row.Examples: Sequential Ints, Sequential Timestamps
-
MapValues - Maps the values in one column to values in another
Examples: A Simple MapValues, Mapping a subset of values, MapValues with Default
-
Series - Repeats a series of values to fill a column
Examples: Simple Series
-
Fixed - Generates a column with a single fixed value
Examples: A Fixed String
-
Empty - Generates and empty (null) column of data
Examples: An Empty Float
-
Map - Map columns create a record style field from other fields
Examples: A Simple Map, Map with Json Output, Nested Map, Usage of
select_one:
-
Array - Builds an array from the specified
source_columns:
.Examples: Simple Array with primitive types, Use of
drop: False
, With nulls insource_columns:
, Use ofdrop_nulls: True
, Outputting a Json array -
ExtractDate - Extracts a date from a timestamp
source_column:
Examples: Simple ExtractDate, ExtractDate with custom formatting, ExtractDate to an Int, ExtractDate concise syntax
-
TimestampOffset - TimestampOffset produces a Timestamp column by adding some timedelta onto an existing timestamp column.
Examples: Simple TimestampOffset, Fixed TimestampOffset
Generates a column of values uniformly between min:
and max:
.
Random supports the following data_types:
- Int, Bool, Float, Decimal, String, Timestamp and TimestampAsInt.
Usage:
- name: my_random_col
column_type: Random
data_type: Decimal
min: 1000
max: 5000
Concise syntax:
- col: my_random_col Random Timestamp "2022-01-01 12:00:00" 2022-04-05
time_unit: us
Required params:
min:
the lower boundmax:
the uppper bound
Optional params:
decimal_places:
applies to Decimal and Float, rounds the output values to the specified decimal places, default 4.time_unit:
one of 's', 'ms', 'us', 'ns'. Applies to Timestamp and TimestampAsInt, controls the resolution of the resulting Timestamps or Ints, default is 'ms'.str_max_chars:
when using the String data_type this column will generate random strings of length between min and max. This value provides an extra limit on how long the strings can be. It exists to prevent enormous strings being generated accidently. Default limit is 5000.
Random Ints
A Random column generates random values between min:
and max:
.
Template:
name: simple_random_int
column_type: Random
data_type: Int
min: 5
max: 200
Result:
simple_random_int | |
---|---|
0 | 180 |
1 | 30 |
2 | 72 |
3 | 156 |
4 | 108 |
Random Timestamps
You can generate random timestamps as well.
Template:
name: event_time
column_type: Random
data_type: Timestamp
min: 2022-01-01
max: 2022-01-02 12:00:00
Result:
event_time | |
---|---|
0 | 2022-01-02 11:12:27.440000 |
1 | 2022-01-01 19:23:09.064000 |
2 | 2022-01-01 18:02:25.212000 |
3 | 2022-01-01 02:35:37.826000 |
4 | 2022-01-01 09:39:49.691000 |
Random Strings
Or random strings, where min and max are the length of the string.
Template:
name: message_id
column_type: Random
data_type: String
min: 4
max: 12
Result:
message_id | |
---|---|
0 | EgVyEFV |
1 | nmZyRyHhHTB |
2 | IdNE |
3 | OQoEVuOw |
4 | YgtbznIOSvR |
Generates a column by randomly selecting from a list of values:
.
Random supports the following data_types:
- Int, Bool, Float, Decimal, String and Timestamp.
Usage:
- name: my_selection_col
column_type: Selection
data_type: String
values:
- foo
- bar
- baz
weights:
- 10
- 2
Concise syntax:
- col: my_selection_col Selection Int
values:
- 0
- 10
- 100
Required params:
values:
the list of values to pick from, if the Booldata_type
is specified thenvalues
is automatically set to [True, False].
Optional params:
weights:
increases the likelyhood that certainvalues
will be selected. Weights are applied in the same order as the list ofvalues
. Allvalues
are assigned a weight of 1 by default so only differing weights need to be specified.
Simple Selection
A simple Selection column fills a column with a random selection from the provided values
.
Template:
name: simple_selection
column_type: Selection
values:
- first
- second
Result:
simple_selection | |
---|---|
0 | second |
1 | first |
2 | first |
3 | first |
4 | second |
Selection with Weighting
You can apply weighting to the provided values
to make some more likely to be selected. In this example USD is 10 times more likely than GBP and EUR is 3 times more likely than GBP. A specific weighting is not needed for GBP since it defaults to 1.
Template:
name: weighted_selection
column_type: Selection
values:
- USD
- EUR
- GBP
weights:
- 10
- 3
Result:
weighted_selection | |
---|---|
0 | USD |
1 | EUR |
2 | USD |
3 | EUR |
4 | USD |
Generates a column starting at start:
and increasing by step:
for each row.
Sequential supports the following data_types:
- Int, Float, Decimal and Timestamp.
Usage:
- name: my_sequential_col
column_type: Sequential
data_type: Int
start: 0
step: 1
Concise syntax:
- col: my_sequential_col Sequential Timestamp 1H30min
Required params:
start:
the value to start fromstep:
the amount to increment by per row
For the Timestamp data_type the step parameter should be specified in pandas offset format see https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.
Sequential Ints
A simple Sequential column can be used for generating incrementing Ids.
Template:
name: id
column_type: Sequential
data_type: Int
start: 1
step: 1
Result:
id | |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
3 | 4 |
4 | 5 |
Sequential Timestamps
They can also be used for timestamps and can be written in the following concise syntax.
Template:
col: event_time Sequential Timestamp "1999-12-31 23:56:29" 1min45S
Result:
event_time | |
---|---|
0 | 1999-12-31 23:56:29 |
1 | 1999-12-31 23:58:14 |
2 | 1999-12-31 23:59:59 |
3 | 2000-01-01 00:01:44 |
4 | 2000-01-01 00:03:29 |
A maps the source_column:
values to the output column.
MapValues supports the following data_types:
- Int, Bool, Float, Decimal, String and Timestamp.
Usage:
- col: currency Selection String
values:
- EUR
- GBP
- USD
- name: currency_symbol
column_type: MapValues
source_column: currency
values:
EUR: €
GBP: £
Concise syntax:
- col: my_fixed_col Fixed String banana
Required params:
source_column:
the column to base onvalues:
the mapping to use
Optional params:
default:
the output value to use for source values that don't exist invalues
A Simple MapValues
A simple mapping
Template:
name: mytbl
rows: 5
columns:
- col: currency Selection String
values:
- EUR
- USD
- GBP
- name: symbol
column_type: MapValues
source_column: currency
values:
EUR: €
USD: $
GBP: £
Result:
currency | symbol |
---|---|
GBP | £ |
GBP | £ |
EUR | € |
EUR | € |
USD | $ |
Mapping a subset of values
Any values not specified in the mapping are left empty / null
Template:
name: mytbl
rows: 5
columns:
- col: currency Selection String
values:
- EUR
- USD
- GBP
- name: symbol
column_type: MapValues
source_column: currency
data_type: String
values:
EUR: €
Result:
currency | symbol |
---|---|
EUR | € |
EUR | € |
USD | |
EUR | € |
USD |
MapValues with Default
You can provide a default value to fill any gaps
Template:
name: mytbl
rows: 5
columns:
- col: currency Selection String
values:
- EUR
- USD
- GBP
- name: symbol
column_type: MapValues
source_column: currency
data_type: String
values:
EUR: €
default: "n/a"
Result:
currency | symbol |
---|---|
EUR | € |
GBP | n/a |
EUR | € |
USD | n/a |
GBP | n/a |
Fills a column with a series of repeating values:
.
Usage:
- name: my_series_col
column_type: Series
values:
- A
- B
- C
Concise syntax:
- col: my_series_col Series Int
values:
- 1
- 10
- 100
Required params:
values:
the values to repeat
Simple Series
A simple series
Template:
name: group
column_type: Series
values:
- A
- B
Result:
group | |
---|---|
0 | A |
1 | B |
2 | A |
3 | B |
4 | A |
A column filled with a single fixed value:
.
Fixed supports the following data_types:
- Int, Bool, Float, Decimal, String and Timestamp.
Usage:
- name: my_fixed_col
column_type: Fixed
data_type: String
value: banana
Concise syntax:
- col: my_fixed_col Fixed String banana
A Fixed String
A fixed string
Template:
name: currency
column_type: Fixed
value: BTC
Result:
currency | |
---|---|
0 | BTC |
1 | BTC |
2 | BTC |
3 | BTC |
4 | BTC |
An empty column.
An Empty Float
A simple empty column
Template:
name: pending_balance
column_type: Empty
data_type: Float
Result:
pending_balance | |
---|---|
0 | nan |
1 | nan |
2 | nan |
3 | nan |
4 | nan |
Creates a Map based on the specified columns:
or source_columns:
.
Typically you would not specify data_type for this column, the only exception is if you want the output serialised as a JSON string, use data_type: String
.
You can either specify a list of source_columns:
that refer to previously created data, or defined further columns:
to generate them inline.
Usage with source_columns
:
# generate some data
- col: id Sequential Int 1 1
- col: name Random Sting 3 10
# use the columns in a Map col
- name: my_map_col
column_type: Map
source_columns: [id, name]
Usage with sub columns:
and concise syntax:
- col: my_map_col Map
columns:
- col: id Sequential Int 1 1
- col: name Random String 3 10
Required params:
source_columns:
orcolumns:
the columns to combine into a Map
Optional params:
drop:
whether to drop thesource_columns:
from the data after combining them into a Map, default False forsource_columns:
and True forcolumns:
.select_one:
whether to randomly pick one of the fields in the Map and set all the others to null. Default False.
A Simple Map
You can create a Map field by specifing sub columns:
. Note that the intermediate format here is a python dict and so it renders with single quotes.
Template:
name: mytbl
rows: 5
columns:
- col: mymap Map
columns:
- col: id Random Int 100 300
- col: name Random String 2 5
Result:
mymap |
---|
{'id': 138, 'name': 'Bvmr'} |
{'id': 104, 'name': 'SZ'} |
{'id': 237, 'name': 'KqC'} |
{'id': 187, 'name': 'Lk'} |
{'id': 295, 'name': 'odSZ'} |
Map with Json Output
If you want a valid json field specify data_type: String
.
Template:
name: mytbl
rows: 5
columns:
- col: mymap Map
data_type: String
columns:
- col: id Random Int 100 300
- col: name Random String 2 5
Result:
mymap |
---|
{"id":172,"name":"rd"} |
{"id":183,"name":"bwv"} |
{"id":276,"name":"cF"} |
{"id":101,"name":"tE"} |
{"id":292,"name":"AZqe"} |
Nested Map
Similarly you can nest to any depth by adding Map columns within Map columns.
Template:
name: mytbl
rows: 5
columns:
- col: mymap Map
data_type: String
columns:
- col: id Random Int 100 300
- col: nestedmap Map
columns:
- col: balance Sequential Float 5.4 1.15
- col: status Selection String
values:
- active
- inactive
Result:
mymap |
---|
{"id":244,"nestedmap":{"balance":5.4,"status":"active"}} |
{"id":218,"nestedmap":{"balance":6.55,"status":"active"}} |
{"id":259,"nestedmap":{"balance":7.7,"status":"inactive"}} |
{"id":193,"nestedmap":{"balance":8.85,"status":"inactive"}} |
{"id":272,"nestedmap":{"balance":10.0,"status":"active"}} |
Usage of `select_one:`
Specifying select_one: True
, picks one field and masks all the others.
Template:
name: mytbl
rows: 5
columns:
- col: mymap Map String
select_one: True
columns:
- col: id Random Int 100 300
- col: name Random String 3 6
- col: status Selection
values:
- Y
- N
Result:
mymap |
---|
{"id":null,"name":"XYp","status":null} |
{"id":null,"name":"icTqX","status":null} |
{"id":161,"name":null,"status":null} |
{"id":null,"name":"wLmpY","status":null} |
{"id":null,"name":null,"status":"Y"} |
Creates a Array based on the specified source_columns:
.
Typically you would not specify data_type for this column, the only exception is if you want the output serialised as a JSON string, use data_type: String
.
Usage:
# generate some data
- col: int1 Sequential Int 1 1
- col: int2 Random Sting 3 10
- col: int3 Random Sting 40 200
# use the columns in an Array col
- name: my_array_col
column_type: Array
source_columns: [int1, int2, int3]
Concise syntax:
- col: my_array_col Array String
source_columns: [int1, int2, int3]
Required params:
source_columns:
the columns to combine into an Array
Optional params:
drop:
whether to drop thesource_columns:
from the data after combining them into an Array, default True.drop_nulls:
whether to drop null values when combining them into an Array, some targets, like BigQuery do not accept null values within an Array. This can also be used in combination withnull_percentage:
to create variable length Arrays.
Simple Array with primitive types
A simple array of primitive types
Template:
name: mytbl
rows: 5
columns:
- col: int1 Random Int 20 50
- col: int2 Random Int 50 90
- name: array_col
column_type: Array
source_columns: [int1, int2]
Result:
array_col |
---|
[45 81] |
[25 52] |
[29 69] |
[49 68] |
[37 71] |
Use of `drop: False`
The source columns are removed by default but you can leave them with drop: False
Template:
name: mytbl
rows: 5
columns:
- col: int1 Random Int 20 50
- col: int2 Random Int 50 90
- name: array_col
drop: False
column_type: Array
source_columns: [int1, int2]
Result:
int1 | int2 | array_col |
---|---|---|
43 | 66 | [43 66] |
23 | 62 | [23 62] |
31 | 79 | [31 79] |
46 | 59 | [46 59] |
35 | 62 | [35 62] |
With nulls in `source_columns:`
Nulls in the source_columns are included in the array by default
Template:
name: mytbl
rows: 5
columns:
- col: int1 Random Int 20 50
- col: int2 Random Int 50 90
null_percentage: 90
- name: array_col
drop: False
column_type: Array
source_columns: [int1, int2]
Result:
int1 | int2 | array_col |
---|---|---|
23 | [23 ] | |
34 | [34 ] | |
34 | 54 | [34 54] |
48 | [48 ] | |
40 | [40 ] |
Use of `drop_nulls: True`
Add drop_nulls: True
to remove them from the array
Template:
name: mytbl
rows: 5
columns:
- col: int1 Random Int 20 50
- col: int2 Random Int 50 90
null_percentage: 90
- name: array_col
drop: False
drop_nulls: True
column_type: Array
source_columns: [int1, int2]
Result:
int1 | int2 | array_col |
---|---|---|
27 | [27] | |
23 | [23] | |
49 | [49] | |
32 | [32] | |
32 | 56 | [32 56] |
Outputting a Json array
You can get a Json formatted string using data_type: String
Template:
name: mytbl
rows: 5
columns:
- col: int1 Random Int 20 50
- col: int2 Empty Int
- col: str1 Fixed String foo
- name: array_col
data_type: String
column_type: Array
source_columns: [int1, int2, str1]
Result:
array_col |
---|
[35, null, "foo"] |
[25, null, "foo"] |
[24, null, "foo"] |
[40, null, "foo"] |
[29, null, "foo"] |
Extracts and manipulates Dates from a Timestamp source_column:
.
ExtractDate supports the following data_types:
- Date, String, Int.
Usage:
# given a source timestamp column
- col: event_time Random Timestamp 2022-01-01 2022-02-01
# extract the date from it
- name: dt
column_type: ExtractDate
data_type: Date
source_column: event_time
Concise syntax:
- col: my_date_col ExtractDate Date my_source_col
Required params:
source_column:
- the column to base on
Optional params:
date_format:
used whendata_type:
is String or Int to format the timestamp. Follows python's strftime syntax. ForInt
the result of the formatting must be castable to an Integer i.e%Y%m
but not%Y-%m
.
Simple ExtractDate
You may want to use a timestamp column to populate another column. For example populating a dt
column with the date. The ExtractDate column provides an easy way to do this.
Template:
name: mytbl
rows: 5
columns:
- col: event_time Random Timestamp 2022-02-02 2022-04-01
- name: dt
column_type: ExtractDate
data_type: Date
source_column: event_time
Result:
event_time | dt |
---|---|
2022-02-26 01:32:35.762000 | 2022-02-26 |
2022-02-27 23:47:02.238000 | 2022-02-27 |
2022-03-18 22:39:40.763000 | 2022-03-18 |
2022-03-20 04:34:33.079000 | 2022-03-20 |
2022-03-04 07:10:01.409000 | 2022-03-04 |
ExtractDate with custom formatting
You can also extract the date as a string and control the formatting
Template:
name: mytbl
rows: 5
columns:
- col: event_time Random Timestamp 2022-02-02 2022-04-01
- name: day_of_month
column_type: ExtractDate
data_type: String
date_format: "A %A in %B"
source_column: event_time
Result:
event_time | day_of_month |
---|---|
2022-02-28 17:11:50.540000 | A Monday in February |
2022-03-19 03:16:24 | A Saturday in March |
2022-03-25 11:06:22.395000 | A Friday in March |
2022-03-13 03:29:12.939000 | A Sunday in March |
2022-03-20 10:16:00.610000 | A Sunday in March |
ExtractDate to an Int
Or extract part of the date as an integer
Template:
name: mytbl
rows: 5
columns:
- col: event_time Random Timestamp 2000-01-01 2010-12-31
- name: year
column_type: ExtractDate
data_type: Int
date_format: "%Y"
source_column: event_time
Result:
event_time | year |
---|---|
2010-04-30 09:50:50.606000 | 2010 |
2000-06-12 07:32:42.994000 | 2000 |
2009-08-18 13:45:34.744000 | 2009 |
2003-01-15 22:53:33.679000 | 2003 |
2005-03-26 03:30:13.897000 | 2005 |
ExtractDate concise syntax
The concise format for an ExtractDate column
Template:
name: mytbl
rows: 5
columns:
- col: event_time Random Timestamp 2000-01-01 2010-12-31
- col: dt ExtractDate Date event_time
Result:
event_time | dt |
---|---|
2008-10-05 14:07:49.209000 | 2008-10-05 |
2007-11-21 03:53:20.865000 | 2007-11-21 |
2001-08-14 02:12:11.216000 | 2001-08-14 |
2007-03-31 04:36:07.737000 | 2007-03-31 |
2000-10-05 04:27:12.835000 | 2000-10-05 |
Create a new column by adding or removing random time deltas from a Timestamp source_column:
.
Usage:
- col: start_time Random Timestamp 2021-01-01 2021-12-31
- name: end_time
column_type: TimestampOffset
min: 4H
max: 30D
Concise syntax:
- col: end_time TimestampOffset Timestamp 4H 30D
Required params:
min:
the lower boundmax:
the uppper bound
min:
and max:
should be specified in pandas offset format see https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.
Simple TimestampOffset
This example shows adding a random timedelta onto a timestamp field
Template:
name: mytbl
rows: 5
columns:
- col: game_id Random Int 2 8
- col: game_start Random Timestamp 2022-02-02 2022-04-01
- name: game_end
column_type: TimestampOffset
source_column: game_start
min: 1min30s
max: 25min
Result:
game_id | game_start | game_end |
---|---|---|
8 | 2022-03-28 17:27:34.087000 | 2022-03-28 17:33:16.087000 |
2 | 2022-03-16 19:00:59.738000 | 2022-03-16 19:11:42.738000 |
3 | 2022-02-04 09:52:10.625000 | 2022-02-04 10:15:45.625000 |
7 | 2022-03-17 22:49:48.922000 | 2022-03-17 22:56:18.922000 |
7 | 2022-03-03 18:39:15.099000 | 2022-03-03 18:59:52.099000 |
Fixed TimestampOffset
If you want a fixed offset from the source timestamp you can set min and max to the same value
Template:
name: mytbl
rows: 5
columns:
- col: promo_id Sequential Int 100 1
- col: promo_start Sequential Timestamp 2022-02-02T12:00:00 1min15s
- name: promo_end
column_type: TimestampOffset
source_column: promo_start
min: 1min15s
max: 1min15s
Result:
promo_id | promo_start | promo_end |
---|---|---|
100 | 2022-02-02 12:00:00 | 2022-02-02 12:01:15 |
101 | 2022-02-02 12:01:15 | 2022-02-02 12:02:30 |
102 | 2022-02-02 12:02:30 | 2022-02-02 12:03:45 |
103 | 2022-02-02 12:03:45 | 2022-02-02 12:05:00 |
104 | 2022-02-02 12:05:00 | 2022-02-02 12:06:15 |
faux-data templates support the following targets:
:
Target that loads data to BigQuery tables.
This will create datasets / tables that don't currently exist, or load data to existing tables.
Usage:
targets:
- target: BigQuery
dataset: mydataset # the name of the dataset where the table belongs
table: mytable # the name of the table to load to
# Optional parameters
project: myproject # the GCP project where the dataset exists defaults to the system default
truncate: True # whether to clear the table before loading, defaults to False
post_generation_sql: "INSERT INTO xxx" # A query that will be run after the data has been inserted
Target that creates files in cloud storage.
Supports csv and parquet filetype
s.
Usage:
targets:
- target: CloudStorage
filetype: csv / parquet
bucket: mybucket # the cloud storage bucket to save to
prefix: my/prefix # the path prefix to give to all objects
filename: myfile.csv # the name of the file
# Optional params
partition_cols: [col1, col2] # Optional. The columns within the dataset to partition on.
If partition_cols is specified then data will be split into separate files and loaded to cloud storage with filepaths that follow the hive partitioning structure. e.g. If a dataset has dt and currency columns and these are specified as partition_cols then you might expect the following files to be created:
- gs://bucket/prefix/dt=2022-01-01/currency=USD/filename
- gs://bucket/prefix/dt=2022-01-01/currency=EUR/filename
Target that creates files on the local file system
Supports csv and parquet filetype
s.
Usage:
targets:
- target: LocalFile
filetype: csv / parquet
filepath: path/to/myfile # an absolute or relative base path
filename: myfile.csv # the name of the file
# Optional params
partition_cols: [col1, col2] # Optional. The columns within the dataset to partition on.
If partition_cols is specified then data will be split into separate files and separate files / directories will be created with filepaths that follow the hive partitioning structure. e.g. If a dataset has dt and currency columns and these are specified as partition_cols then you might expect the following files to be created:
- filepath/dt=2022-01-01/currency=USD/filename
- filepath/dt=2022-01-01/currency=EUR/filename
Target that publishes data to Pubsub.
This target converts the data into json format and publishes each row as a separate pubsub message. It expects the topic to already exist.
Usage:
targets:
- target: Pubsub
topic: mytopic # the name of the topic
# Optional parameters
project: myproject # the GCP project where the topic exists defaults to the system default
output_cols: [col1, col2] # the columns to convert to json and use for the message body
attribute_cols: [col3, col4] # the columns to pass as pubsub message attributes, these columns will be removed from the message body unless they are also specified in the output_cols
attributes: # additional attributes to add to the pbsub messages
key1: value1
key2: value2
delay: 0.01 # the time in seconds to wait between each publish, default is 0.01
date_format: iso # how timestamp fields should be formatted in the json eith iso or epoch
time_unit: s # the resolution to use for timestamps, s, ms, us etc.
The library can be used in various ways
- via the faux cli
- as a library by importing it into your python project, instantiating templates and calling the
.generate()
or.run()
methods on them - running the code in a cloud function, passing a template to the cloud function at call time, or using a template stored in cloud storage
To use Google Cloud targets do one of the following:
- ensure that the
GOOGLE_CLOUD_PROJECT
environment variable is set - set the
FAUXDATA_GCP_PROJECT_ID
environment variable - place a toml file at
~/.fauxdata/config.toml
with the following contents:
gcp_project_id = "myproject"
To deploy a cloud function
gcloud functions deploy faux-data \
--region europe-west2 \
--project XXX \
--runtime python310 \
--trigger-http \
--set-env-vars='FAUXDATA_DEPLOYMENT_MODE=cloud_function,FAUXDATA_TEMPLATE_BUCKET=df2test,FAUXDATA_TEMPLATE_LOCATION=templates' \
--entry-point generate
You can specify variables in a template to make parts of it modifyable at runtime. A common use case for this is to control the number of rows of data generated as follows:
variables:
row_count: 100
tables:
- name: mytable
rows: {{ row_count }}
columns:
...
By default the above will generate 100 rows, but when running the template you can specify row_count to override this number. For examples through the cli you might run faux run mytemplate.yaml --row_count 40000
.
You may want to contain hardcoded data for some or all columns of a table. You can do this by specifying the path to the file as the rows:
attribute. The path should be either a path relative to the template directory or an absolute or cloud storage path.
The following example will look for mytable.csv in the same directory as the template yaml file. This will work whether the template file is local or stored in cloud storage.
tables:
- name: mytable
rows: mytable.csv
columns:
...
The following example, being an absolute pathm, will load the csv from cloud storage regardless of where the template yaml is loaded from.
tables:
- name: mytable
rows: gs://mybucket/templates/data/mytable.csv
columns:
...