Skip to content

Latest commit

 

History

History
16 lines (10 loc) · 503 Bytes

File metadata and controls

16 lines (10 loc) · 503 Bytes

Wikipedia Links

[Dataset Download]

Wikipedia have a lot of references & citations from internet. It should contain some high quality web content, this dataset contains 58k urls from Indonesian Wikipedia external links dump.

You will need to install lm_dataformat to use it.

from lm_dataformat import Reader

rdr = Reader('wikipedia-links')

for doc in rdr.stream_data():
    print(doc)