Fake Pandas / PySpark DataFrame creator.
pip install farsante
Here's how to quickly create a 7 row DataFrame with first_name
and last_name
fields.
import farsante
df = farsante.quick_pyspark_df(['first_name', 'last_name'], 7)
df.show()
+----------+---------+
|first_name|last_name|
+----------+---------+
| Tommy| Hess|
| Arthur| Melendez|
| Clemente| Blair|
| Wesley| Conrad|
| Willis| Dunlap|
| Bruna| Sellers|
| Tonda| Schwartz|
+----------+---------+
Here's how to create a DataFrame with 5 rows of data with first names and last names using Mexican Spanish.
import farsante
from mimesis import Person
mx = Person('es-mx')
df = farsante.pyspark_df([mx.first_name, mx.last_name], 5)
df.show()
+-----------+---------+
| first_name|last_name|
+-----------+---------+
| Connie| Xicoy|
| Oliverios| Merino|
| Castel| Yáñez|
|Guillelmina| Prieto|
| Gezane| Campos|
+-----------+---------+
Here's how to quickly create a 3 row DataFrame with first_name
and last_name
fields.
import farsante
df = farsante.quick_pandas_df(['first_name', 'last_name'], 3)
print(df)
first_name last_name
0 Toby Rosales
1 Gregg Hughes
2 Terence Ray
Here's how to create a 5 row DataFrame with first names and last names using Russian.
from mimesis import Person
ru = Person('ru')
df = farsante.pandas_df([ru.first_name, ru.last_name], 5)
print(df)
first_name last_name
0 Амиль Ханженкова
1 Славентий Голумидова
2 Паладин Волосиков
3 Акша Бабашова
4 Ника Синусова
Here's how to create a CSV file with some fake data:
import farsante
from mimesis import Person
from mimesis import Address
from mimesis import Datetime
person = Person()
address = Address()
datetime = Datetime()
df = farsante.pandas_df([person.full_name, person.email, address.city, address.state, datetime.datetime], 3)
df.to_csv('./tmp/fake_data.csv', index=False)
Here's how to create a Parquet file with fake data:
df = farsante.pandas_df([person.full_name, person.email, address.city, address.state, datetime.datetime], 3)
df.to_parquet('./tmp/fake_data.parquet', index=False)
h2o is a popular library to benchmark data processing engines. Farsante uses rust to generate h2o datasets.
The following datasets are currently supported:
name | rows | cols | cols types | nulls |
---|---|---|---|---|
groupby | n | 9 | 6 id cols, 2 int cols, 1 float col | optional |
join_big | n | 7 | 6 id cols, 1 float col | no |
join_big_na | n | 7 | 6 id cols, 1 float col | optional |
join_medium | n / 1000 | 5 | 4 id cols, 1 float col | optional |
join_small | n / 1_000_000 | 4 | 3 id cols, 1 float col | optional |
To create one of the above datasets, use the generate_h2o_dataset()
function in farsante.h2o_dataset_create
from farsante import generate_h2o_dataset
generate_h2o_dataset(
ds_type="join_big",
n=10_000_000,
k=10,
nas=10,
seed=10,
)
To create all of the above datasets in parallel, use the h2o_dataset_create_all.py
script
python h2o_dataset_create_all.py --n 10000000 --k 10 --nas 10 --seed 42
To generate these datasets in rust:
- Install rust
- Install cargo
- Install the rust dependencies:
cargo install --path .
- Run the rust program:
cargo run --release -- --help
to see run options
cargo run --release -- --n 10000000 --k 10 --nas 10 --seed 42
If you would like to help make Farsante better, take a look at our Contributing Guide.