Skip to content

Latest commit

 

History

History
18 lines (10 loc) · 1.02 KB

NOTES.md

File metadata and controls

18 lines (10 loc) · 1.02 KB

Installation

s3cmd get s3://passim-rebuilt/IMP/IMP-1900.jsonl.bz2 ../impresso-passim/sample_data/ s3cmd get s3://passim-rebuilt/GDL/GDL-1900.jsonl.bz2 ../impresso-passim/sample_data/

bzcat ../impresso-passim/sample_data/GDL-1900.jsonl.bz2|head|jq --slurp ".[2]" > ../impresso-passim/sample_data/GDL-1900-12-12-a-i0029.json

json-df-schema ../impresso-passim/sample_data/GDL-1900-12-12-a-i0029.json > ../impresso-passim/sample_data/passim.schema.orig

Running

SPARK_SUBMIT_ARGS='--master local[36] --driver-memory 50G --executor-memory 50G --conf spark.local.dir=/scratch/matteo/spark-tmp/' passim --schema-path=/home/romanell/impresso_code/impresso-passim/sample_data/passim.schema "/home/romanell/impresso_code/impresso-passim/sample_data/*.bz2" "/scratch/matteo/passim/impresso/"

cat /scratch/matteo/passim/impresso/out.json/part-00104-d7ac9716-69bb-442c-9c04-4e425433e21f-c000.json|jq --slurp ".[0:10]"

Parameters

--labelPropagation uses message passing algorithms to merge the clusters, while default is connected components