Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2011 census microdata play #68

Open
wants to merge 41 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
86366c2
census microdata run inputs
edwardchalstrey1 Jul 14, 2021
fda7a59
add census microdata dataset
edwardchalstrey1 Jul 14, 2021
3c5ff84
get synthpop working
edwardchalstrey1 Jul 15, 2021
934f444
fix microdata
edwardchalstrey1 Jul 15, 2021
dc3f9a5
specify classifier
edwardchalstrey1 Jul 15, 2021
493eff6
fix column error
edwardchalstrey1 Jul 15, 2021
9cb2bd7
small dataset
edwardchalstrey1 Jul 15, 2021
dfc0cd5
hist plots notebook
edwardchalstrey1 Jul 15, 2021
34ffb30
use small dataset census synthpop
edwardchalstrey1 Jul 15, 2021
62baa6c
synthpop with cart multiple synth datasets
edwardchalstrey1 Jul 15, 2021
925c2b6
categorical columns
edwardchalstrey1 Jul 16, 2021
cbee886
heatmaps
edwardchalstrey1 Jul 16, 2021
8fe9d51
add categorical columns explanation
edwardchalstrey1 Jul 16, 2021
c9167e6
some edits
edwardchalstrey1 Jul 16, 2021
1a1dd4d
fix ctgan import
edwardchalstrey1 Aug 3, 2021
27f6021
make examples to run false by default
edwardchalstrey1 Aug 3, 2021
d558f05
standardise census dataset example runs
edwardchalstrey1 Aug 3, 2021
3d3c735
notebook comparing methods
edwardchalstrey1 Aug 3, 2021
86f81cf
add disclosure risk comparison
edwardchalstrey1 Aug 4, 2021
782da0d
utility comparison
edwardchalstrey1 Aug 4, 2021
a2098ae
notes
edwardchalstrey1 Aug 16, 2021
12e4743
update columns
edwardchalstrey1 Aug 17, 2021
e6a1809
Add helper 'run' targets to Makefile
ots22 Jan 19, 2021
987929a
Fix to Makefile run dependencies
ots22 Jan 20, 2021
0ed0b5a
Adjust dependencies of Makefile 'run' target
ots22 Aug 17, 2021
57d0693
Switch to git+https protocol for DataSynthesizer requirement (to work…
ots22 Aug 17, 2021
c3885a5
change utility metric and try variants census ds
edwardchalstrey1 Aug 19, 2021
cde10bc
extra input columns utility classifier
edwardchalstrey1 Aug 19, 2021
7f48fc4
Add a 'privacy metric' that produces synthetic data with leaked records
ots22 Sep 2, 2021
b2b72c7
Makefile rules for leaky output
ots22 Sep 2, 2021
7cfff55
Update Makefile all target
ots22 Sep 2, 2021
c4dd3a9
Fix typo in Makefile
ots22 Sep 2, 2021
a940ddb
Add introductory notebook
ots22 Sep 2, 2021
258eb4c
Move census notebooks
ots22 Sep 2, 2021
9eb7492
Add Sharepoint links
ots22 Sep 2, 2021
2e766e1
Fix Sharepoint links
ots22 Sep 6, 2021
9f23739
Update Overview.ipynb
ots22 Sep 6, 2021
a451f90
Add notebook example
ots22 Sep 15, 2021
e616777
add some explainers
Feb 11, 2022
a7e4348
additional explanation about file naming
Feb 11, 2022
396590d
Merge pull request #74 from callummole/2011-census-microdata
Feb 11, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 25 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,12 @@ SYNTH_OUTPUTS_CSV = $(addsuffix /synthetic_data_1.csv,$(SYNTH_OUTPUTS_PREFIX))
SYNTH_OUTPUTS_PRIV_DISCL_RISK = $(addsuffix /privacy_disclosure_risk.json,$(SYNTH_OUTPUTS_PREFIX))
SYNTH_OUTPUTS_UTIL_CLASS = $(addsuffix /utility_classifiers.json,$(SYNTH_OUTPUTS_PREFIX))
SYNTH_OUTPUTS_UTIL_CORR = $(addsuffix /utility_correlations.json,$(SYNTH_OUTPUTS_PREFIX))
SYNTH_OUTPUTS_LEAKY = $(addsuffix /synth_data_leaked_1.csv,$(SYNTH_OUTPUTS_PREFIX))


.PHONY: all all-synthetic generated-data clean

all: $(SYNTH_OUTPUTS_PRIV_DISCL_RISK) $(SYNTH_OUTPUTS_UTIL_CLASS) $(SYNTH_OUTPUTS_UTIL_CORR)
all: $(SYNTH_OUTPUTS_PRIV_DISCL_RISK) $(SYNTH_OUTPUTS_UTIL_CLASS) $(SYNTH_OUTPUTS_UTIL_CORR) $(SYNTH_OUTPUTS_LEAKY)

all-synthetic: $(SYNTH_OUTPUTS_CSV)

Expand Down Expand Up @@ -81,6 +83,11 @@ synth-output/%/privacy_disclosure_risk.json : \
run-inputs/%.json synth-output/%/synthetic_data_1.csv
python metrics/privacy-metrics/disclosure_risk.py -i $< -o $$(dirname $@)

$(SYNTH_OUTPUTS_LEAKY) : \
synth-output/%/synth_data_leaked_1.csv : \
run-inputs/%.json synth-output/%/synthetic_data_1.csv
python metrics/privacy-metrics/leaky_output.py -i $< -o $$(dirname $@)

$(SYNTH_OUTPUTS_UTIL_CLASS) : \
synth-output/%/utility_classifiers.json : \
run-inputs/%.json synth-output/%/synthetic_data_1.csv
Expand All @@ -92,6 +99,23 @@ run-inputs/%.json synth-output/%/synthetic_data_1.csv
python metrics/utility-metrics/correlations.py -i $< -o $$(dirname $@)


##-------------------------------------
## Helper targets for individual inputs
##-------------------------------------

## make run-example
##
## produces synthetic data and metrics from run-inputs/example.json with output in synth-output/example/

run-% :\
synth-output/%/synthetic_data_1.csv\
synth-output/%/utility_correlations.json\
synth-output/%/utility_classifiers.json\
synth-output/%/privacy_disclosure_risk.json\
synth-output/%/synth_data_leaked_1.csv\
;


##-------------------------------------
## Clean
##-------------------------------------
Expand Down
569,743 changes: 569,743 additions & 0 deletions datasets-raw/2011-census-microdata/2011-census-microdata.csv

Large diffs are not rendered by default.

50,001 changes: 50,001 additions & 0 deletions datasets/2011-census-microdata/2011-census-microdata-small.csv

Large diffs are not rendered by default.

76 changes: 76 additions & 0 deletions datasets/2011-census-microdata/2011-census-microdata-small.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
{
"columns": [
{
"name": "Person ID",
"type": "DiscreteNumerical"
},
{
"name": "Region",
"type": "Categorical"
},
{
"name": "Residence Type",
"type": "Categorical"
},
{
"name": "Family Composition",
"type": "DiscreteNumerical"
},
{
"name": "Population Base",
"type": "DiscreteNumerical"
},
{
"name": "Sex",
"type": "Categorical"
},
{
"name": "Age",
"type": "DiscreteNumerical"
},
{
"name": "Marital Status",
"type": "Categorical"
},
{
"name": "Student",
"type": "Categorical"
},
{
"name": "Country of Birth",
"type": "Categorical"
},
{
"name": "Health",
"type": "Categorical"
},
{
"name": "Ethnic Group",
"type": "Categorical"
},
{
"name": "Religion",
"type": "Categorical"
},
{
"name": "Economic Activity",
"type": "Categorical"
},
{
"name": "Occupation",
"type": "Categorical"
},
{
"name": "Industry",
"type": "Categorical"
},
{
"name": "Hours worked per week",
"type": "DiscreteNumerical"
},
{
"name": "Approximated Social Grade",
"type": "Categorical"
}
]
}
Loading