GitHub

#CSVAutoSchemer

Setup and configurations

Python libraries

To install all dependencies, run pip install -r requirements.txt

Postgres Configuration

You may find it useful to create a separate database for testing. To do this,

user$ sudo su - postgres
postgres$ psql
psql# CREATE DATABASE db_name

Now you have created a db called db_name You can also choose to create a new postgres user.

psql# CREATE USER user_name WITH PASSWORD 'password';
psql# GRANT ALL PRIVILEGES ON DATABASE db_name TO user_name;

Environmental Variables

You'll need to set up environmental variables for your postgres db credentials. This is so that pgdb can connect to your specified db and create the schema. The required variables are

dbname: the name of the db you're connecting to, as created above in the postgres configuration step
dbuser: name of the user with the sufficient privilege to access the above dbname
dbhost: this will be localhost for our purposes
dbpwd: depending on the method of authentication you configured at '/etc/postgresql/x.x/main/pg_hba.conf', the program will require a password corresponding to the dbuser. If the method is 'trust', then dbpwd is not required

An example script would look like:

## filename: db_credentials.sh
#!/bin/bash

export dbname=testdb # custom db name
export dbuser=postgres # typically, postgres, or custom username if created
export dbhost=localhost
export dbpwd=textpwd # password, if method of authentication is not trust, then need to pass in password too

Do not push this file to github.

You can then call . db_credentials.sh to load these variables into your environment

Running AutoSchemer

To run the autoschemer, you can simply run

$ python AutoSchemer.py csv-file

Components Overview (TODO: Incomplete):

Parser

parses CSV data file, Returns:

distinct rows => (col_num, num_distinct_rows)
column numbers ordered by most distinct rows
list of type guessers

TypeGuesser:

aggregate histogram heuristic to guess what the type of a column is. Used in Parser while iterating through each row/column.

AutoSchemer

the wrapper for running the program

Table

tables have ids, a list of columns, a list of foreign keys which point to different tables if there is a relation between them, and a list of primary_keys which is not necessarily distinct from the list of columns.

Schema

for simplicity, currently schema contains two objects: 1) a list of tables, and 2) a list of types corresponding to the columns of the dataset. This is computed with the TypeGuesser class while parsing, as mentioned above

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
data		data
tools		tools
.gitignore		.gitignore
AutoSchemer.py		AutoSchemer.py
AutoSchemerCORDS.py		AutoSchemerCORDS.py
AutoSchemerSimple.py		AutoSchemerSimple.py
AutoSchemerTester.py		AutoSchemerTester.py
DBConnection.py		DBConnection.py
DBHandler.py		DBHandler.py
Parser.py		Parser.py
README.md		README.md
SchemaObjects.py		SchemaObjects.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setup and configurations

Python libraries

Postgres Configuration

Environmental Variables

Running AutoSchemer

Components Overview (TODO: Incomplete):

Parser

TypeGuesser:

AutoSchemer

Table

Schema

About

Releases

Packages

Languages

dmchoiboi/AutoSchemer

Folders and files

Latest commit

History

Repository files navigation

Setup and configurations

Python libraries

Postgres Configuration

Environmental Variables

Running AutoSchemer

Components Overview (TODO: Incomplete):

Parser

TypeGuesser:

AutoSchemer

Table

Schema

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages