Skip to content

Commit

Permalink
Merge pull request #21 from Baukebrenninkmeijer/develop
Browse files Browse the repository at this point in the history
  • Loading branch information
Baukebrenninkmeijer authored Dec 3, 2021
2 parents 81b250e + be93859 commit 7896d1b
Show file tree
Hide file tree
Showing 4 changed files with 144 additions and 90 deletions.
13 changes: 9 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,7 @@
[![Supported versions](https://img.shields.io/pypi/pyversions/table_evaluator.svg)](https://pypi.python.org/pypi/table_evaluator)
![Package deployment](https://github.com/Baukebrenninkmeijer/table-evaluator/actions/workflows/python-publish.yml/badge.svg?branch=master)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/table_evaluator)](https://pypistats.org/packages/table_evaluator)

[Official documentation](https://baukebrenninkmeijer.github.io/table-evaluator/)
[![Documentation](https://img.shields.io/badge/Documentation-%20-blue)](https://baukebrenninkmeijer.github.io/table-evaluator/)

TableEvaluator is a library to evaluate how similar a synthesized dataset is to a real data. In other words, it tries to give an indication into how real your fake data is. With the rise of GANs, specifically designed for tabular data, many applications are becoming possibilities. For industries like finance, healthcare and goverments, having the capacity to create high quality synthetic data that does **not** have the privacy constraints of normal data is extremely valuable. Since this field is this quite young and developing, I created this library to have a consistent evaluation method for your models.

Expand All @@ -19,9 +18,15 @@ The test can be run by cloning the repo and running:
```
pytest tests
```
if this does not work, the package might not currently be findable. In that case, please install it locally with:

```
pip install -e .
```

## Usage
**Please see the example notebook for the most up-to-date examples. The README example is just that notebook, but sometimes a bit outdated.**
**Please see the [example notebook](https://github.com/Baukebrenninkmeijer/table-evaluator/blob/master/example_table_evaluator.ipynb) for the most up-to-date examples. The README example is just that notebook as markdown.**

Start by importing the class
```Python
from table_evaluator import load_data, TableEvaluator
Expand Down Expand Up @@ -142,6 +147,6 @@ table_evaluator.evaluate(target_col='trans_type')
Please see the full documentation on [https://baukebrenninkmeijer.github.io/table-evaluator/](https://baukebrenninkmeijer.github.io/table-evaluator/).

## Motivation
To see the motivation for my decisions, please have a look at my master thesis, found at [https://www.ru.nl/publish/pages/769526/z04_master_thesis_brenninkmeijer.pdf](https://www.ru.nl/publish/pages/769526/z04_master_thesis_brenninkmeijer.pdf)
To see the motivation for my decisions, please have a look at my master thesis, found at the [Radboud University](https://www.ru.nl/publish/pages/769526/z04_master_thesis_brenninkmeijer.pdf)

If you have any tips or suggestions, please contact send me on email.
94 changes: 46 additions & 48 deletions example_table_evaluator.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -10,17 +10,26 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 7,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The autoreload extension is already loaded. To reload it, use:\n",
" %reload_ext autoreload\n"
]
}
],
"source": [
"%load_ext autoreload\n",
"%autoreload 2"
]
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -29,7 +38,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -38,7 +47,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 10,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -148,7 +157,7 @@
"4 WITHDRAWAL_IN_CASH UNKNOWN 654 "
]
},
"execution_count": 4,
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -159,7 +168,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 11,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -269,7 +278,7 @@
"4 REMITTANCE_TO_OTHER_BANK HOUSEHOLD 1211 "
]
},
"execution_count": 5,
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -280,7 +289,7 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -289,7 +298,7 @@
},
{
"cell_type": "code",
"execution_count": 44,
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -305,49 +314,38 @@
},
{
"cell_type": "code",
"execution_count": 45,
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Classifier F1-scores and their Jaccard similarities::\n",
" f1_real f1_fake jaccard_similarity\n",
"index \n",
"LogisticRegression_real_testset 0.7800 0.7750 0.9704\n",
"LogisticRegression_fake_testset 0.7550 0.7450 0.9048\n",
"RandomForestClassifier_real_testset 0.9850 0.9850 1.0000\n",
"RandomForestClassifier_fake_testset 0.9650 0.9650 1.0000\n",
"DecisionTreeClassifier_real_testset 0.9800 0.9650 0.9512\n",
"DecisionTreeClassifier_fake_testset 0.9600 0.9150 0.9139\n",
"MLPClassifier_real_testset 0.4000 0.5000 0.5326\n",
"MLPClassifier_fake_testset 0.4300 0.5450 0.4925\n",
"\n",
"Privacy results:\n",
" result\n",
"Duplicate rows between sets (real/fake) (0, 0)\n",
"nearest neighbor mean 0.5655\n",
"nearest neighbor std 0.3726\n",
"\n",
"Miscellaneous results:\n",
" Result\n",
"Column Correlation Distance RMSE 0.0399\n",
"Column Correlation distance MAE 0.0296\n",
"\n",
"Results:\n",
" result\n",
"Basic statistics 0.9940\n",
"Correlation column correlations 0.9904\n",
"Mean Correlation between fake and real columns 0.9566\n",
"1 - MAPE Estimator results 0.9251\n",
"Similarity Score 0.9665\n"
]
"data": {
"text/html": [
"<h1 style=\"text-align: center\">Synthetic Data Report</h1>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "5674247a319f428a96b21a6d4b2dc626",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Tab(children=(VBox(children=(Output(),)), VBox(children=(Output(),)), VBox(children=(Output(),)), VBox(childre…"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"evaluator.evaluate(target_col='trans_type')"
"evaluator.evaluate(target_col='trans_type', notebook=True)"
]
},
{
Expand Down Expand Up @@ -437,7 +435,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
"version": "3.7.3"
}
},
"nbformat": 4,
Expand Down
Loading

0 comments on commit 7896d1b

Please sign in to comment.