Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
mosuka committed Mar 30, 2019
1 parent e881d17 commit 339775a
Showing 1 changed file with 0 additions and 41 deletions.
41 changes: 0 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -301,44 +301,3 @@ You can execute the command in docker container as follows:
```bash
$ docker exec -it cete-node1 cete node --grpc-addr=:5050
```


## Wikipedia example

This section explain how to index Wikipedia dump to Blast.


### Install wikiextractor

```bash
$ cd ${HOME}
$ git clone [email protected]:attardi/wikiextractor.git
```


### Download wikipedia dump

```bash
$ curl -o ~/tmp/enwiki-20190101-pages-articles.xml.bz2 https://dumps.wikimedia.org/enwiki/20190101/enwiki-20190101-pages-articles.xml.bz2
```


### Parsing wikipedia dump

```bash
$ cd wikiextractor
$ ./WikiExtractor.py -o ~/tmp/enwiki --json ~/tmp/enwiki-20190101-pages-articles.xml.bz2
```


### Indexing wikipedia dump

```bash
$ for FILE in $(find ~/tmp/enwiki -type f -name '*' | sort)
do
echo "Indexing ${FILE}"
TIMESTAMP=$(date -u "+%Y-%m-%dT%H:%M:%SZ")
DOCS=$(cat ${FILE} | jq -r '. + {fields: {url: .url, title_en: .title, text_en: .text, timestamp: "'${TIMESTAMP}'", _type: "enwiki"}} | del(.url) | del(.title) | del(.text) | del(.fields.id)' | jq -s)
curl -s -X PUT -H 'Content-Type: application/json' "http://127.0.0.1:8080/documents" -d "${DOCS}"
done
```

0 comments on commit 339775a

Please sign in to comment.