diff --git a/README.md b/README.md index 0f775fc..9e6f13f 100644 --- a/README.md +++ b/README.md @@ -301,44 +301,3 @@ You can execute the command in docker container as follows: ```bash $ docker exec -it cete-node1 cete node --grpc-addr=:5050 ``` - - -## Wikipedia example - -This section explain how to index Wikipedia dump to Blast. - - -### Install wikiextractor - -```bash -$ cd ${HOME} -$ git clone git@github.com:attardi/wikiextractor.git -``` - - -### Download wikipedia dump - -```bash -$ curl -o ~/tmp/enwiki-20190101-pages-articles.xml.bz2 https://dumps.wikimedia.org/enwiki/20190101/enwiki-20190101-pages-articles.xml.bz2 -``` - - -### Parsing wikipedia dump - -```bash -$ cd wikiextractor -$ ./WikiExtractor.py -o ~/tmp/enwiki --json ~/tmp/enwiki-20190101-pages-articles.xml.bz2 -``` - - -### Indexing wikipedia dump - -```bash -$ for FILE in $(find ~/tmp/enwiki -type f -name '*' | sort) - do - echo "Indexing ${FILE}" - TIMESTAMP=$(date -u "+%Y-%m-%dT%H:%M:%SZ") - DOCS=$(cat ${FILE} | jq -r '. + {fields: {url: .url, title_en: .title, text_en: .text, timestamp: "'${TIMESTAMP}'", _type: "enwiki"}} | del(.url) | del(.title) | del(.text) | del(.fields.id)' | jq -s) - curl -s -X PUT -H 'Content-Type: application/json' "http://127.0.0.1:8080/documents" -d "${DOCS}" - done -```