spark-submit \
    --class ch.epfl.lts2.wikipedia.DumpProcessor \
    --master 'local[*]' \
    --executor-memory 4g \
    --driver-memory 4g \
    --packages \
        org.rogach:scallop_2.11:3.1.5,com.datastax.spark:spark-cassandra-connector_2.11:2.4.0 \
    target/scala-2.11/sparkwiki_2.11-0.9.6.jar \
        --dumpPath data/bzipped \
        --outputPath data/processed \
        --namePrefix enwiki-20190820

The format is kind of meh, but it seems to be appropriate for their use-cases. This dumps out several csv files with the following schema.

$ tree -d data/processed

data/processed/
├── categorylinks
├── page
│   ├── category_pages
│   └── normal_pages
└── pagelinks

>>> spark.read.csv(
    "data/processed/categorylinks",
    sep="\t",
    schema="start_id INT,name STRING,end_id INT,type STRING"
).show(n=5)
+--------+--------------------+--------+------+
|start_id|                name|  end_id|  type|
+--------+--------------------+--------+------+
| 2137402|1000_V_DC_railway...|57839957|subcat|
|51991420|1000_V_DC_railway...|57839957|  page|
|25064564|1000_V_DC_railway...|57839957|  page|
|57839948|1000_V_DC_railway...|57839957|subcat|
|   60340|1000_V_DC_railway...|57839957|  page|
+--------+--------------------+--------+------+

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NOTES.md

NOTES.md

Files

NOTES.md

Latest commit

History

NOTES.md

File metadata and controls