Skip to content

Commit

Permalink
Update NOTES with pagecount processor
Browse files Browse the repository at this point in the history
  • Loading branch information
acmiyaguchi committed Sep 7, 2019
1 parent 9cafdea commit 86c646d
Showing 1 changed file with 60 additions and 33 deletions.
93 changes: 60 additions & 33 deletions NOTES.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# Notes

```bash
spark-submit \
--class ch.epfl.lts2.wikipedia.DumpProcessor \
Expand All @@ -6,44 +8,12 @@ spark-submit \
--driver-memory 4g \
--packages \
org.rogach:scallop_2.11:3.1.5,com.datastax.spark:spark-cassandra-connector_2.11:2.4.0 \
target/scala-2.11/sparkwiki_2.11-0.9.6.jar \
sparkwiki/target/scala-2.11/sparkwiki_2.11-0.9.6.jar \
--dumpPath data/bzipped \
--outputPath data/processed \
--namePrefix enwiki-20190820
```

The format is kind of meh, but it seems to be appropriate for their use-cases. This
dumps out several csv files with the following schema.

```bash
$ tree -d data/processed

data/processed/
├── categorylinks
├── page
│ ├── category_pages
│ └── normal_pages
└── pagelinks
```

```python
>>> spark.read.csv(
"data/processed/categorylinks",
sep="\t",
schema="start_id INT,name STRING,end_id INT,type STRING"
).show(n=5)
+--------+--------------------+--------+------+
|start_id| name| end_id| type|
+--------+--------------------+--------+------+
| 2137402|1000_V_DC_railway...|57839957|subcat|
|51991420|1000_V_DC_railway...|57839957| page|
|25064564|1000_V_DC_railway...|57839957| page|
|57839948|1000_V_DC_railway...|57839957|subcat|
| 60340|1000_V_DC_railway...|57839957| page|
+--------+--------------------+--------+------+

```

```bash
pipenv shell
python -m site
Expand All @@ -56,3 +26,60 @@ pyspark \
--conf spark.driver.memory=4g \
--conf spark.executor.memory=4g
```

```bash
spark-submit \
--class ch.epfl.lts2.wikipedia.DumpParser \
--master 'local[*]' \
--executor-memory 4g \
--driver-memory 4g \
--packages \
org.rogach:scallop_2.11:3.1.5,com.datastax.spark:spark-cassandra-connector_2.11:2.4.0 \
sparkwiki/target/scala-2.11/sparkwiki_2.11-0.9.6.jar \
--dumpFilePath data/bzipped/enwiki-20190820-page.sql.bz2 \
--dumpType page \
--outputPath data/processed/pagecount \
--outputFormat parquet
```

Run a cassandra daemon for processing pageviews

```bash
docker run \
-p 9042:9042 \
-d cassandra:latest
```

```python
from datetime import datetime as dt, timedelta

input = "2019-08-20"
days = 512

fmt = "%Y-%m-%d"
end = dt.strptime(input, fmt)
start = end - timedelta(days)
print(dt.strftime(start, fmt))
```

```
wget -r -np -nH --cut-dirs=3 https://dumps.wikimedia.org/other/pagecounts-ez/merged/2018/
wget -r -np -nH --cut-dirs=3 https://dumps.wikimedia.org/other/pagecounts-ez/merged/2019/
```

```
spark-submit \
--class ch.epfl.lts2.wikipedia.PagecountProcessor \
--master 'local[*]' \
--executor-memory 4g \
--driver-memory 4g \
--packages \
org.rogach:scallop_2.11:3.1.5,com.datastax.spark:spark-cassandra-connector_2.11:2.4.0,com.typesafe:config:1.2.1 \
sparkwiki/target/scala-2.11/sparkwiki_2.11-0.9.6.jar \
--config sparkwiki/config/pagecount.conf \
--basePath data/processed/pagecount-cassandra \
--pageDump data/processed/pagecount \
--startDate 2018-03-26 \
--endDate 2019-08-20
```

0 comments on commit 86c646d

Please sign in to comment.