#CSVAutoSchemer
To install all dependencies, run
pip install -r requirements.txt
You may find it useful to create a separate database for testing. To do this,
user$ sudo su - postgres
postgres$ psql
psql# CREATE DATABASE db_name
Now you have created a db called db_name You can also choose to create a new postgres user.
psql# CREATE USER user_name WITH PASSWORD 'password';
psql# GRANT ALL PRIVILEGES ON DATABASE db_name TO user_name;
You'll need to set up environmental variables for your postgres db credentials. This is so that pgdb can connect to your specified db and create the schema. The required variables are
- dbname: the name of the db you're connecting to, as created above in the postgres configuration step
- dbuser: name of the user with the sufficient privilege to access the above dbname
- dbhost: this will be localhost for our purposes
- dbpwd: depending on the method of authentication you configured at '/etc/postgresql/x.x/main/pg_hba.conf', the program will require a password corresponding to the dbuser. If the method is 'trust', then dbpwd is not required
An example script would look like:
## filename: db_credentials.sh
#!/bin/bash
export dbname=testdb # custom db name
export dbuser=postgres # typically, postgres, or custom username if created
export dbhost=localhost
export dbpwd=textpwd # password, if method of authentication is not trust, then need to pass in password too
Do not push this file to github.
You can then call . db_credentials.sh
to load these variables into your environment
To run the autoschemer, you can simply run
$ python AutoSchemer.py csv-file
parses CSV data file, Returns:
- distinct rows => (col_num, num_distinct_rows)
- column numbers ordered by most distinct rows
- list of type guessers
aggregate histogram heuristic to guess what the type of a column is. Used in Parser while iterating through each row/column.
the wrapper for running the program
tables have ids, a list of columns, a list of foreign keys which point to different tables if there is a relation between them, and a list of primary_keys which is not necessarily distinct from the list of columns.
for simplicity, currently schema contains two objects: 1) a list of tables, and 2) a list of types corresponding to the columns of the dataset. This is computed with the TypeGuesser class while parsing, as mentioned above