Skip to content

Latest commit

 

History

History

wikipedia-links

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

Wikipedia Links

[Dataset Download]

Wikipedia have a lot of references & citations from internet. It should contain some high quality web content, this dataset contains 58k urls from Indonesian Wikipedia external links dump.

You will need to install lm_dataformat to use it.

from lm_dataformat import Reader

rdr = Reader('wikipedia-links')

for doc in rdr.stream_data():
    print(doc)