A small web crawler used to collect Kurdish text over the web
It has these commands:
- Crawl: used to crawl web pages and extact kurdish text from them and save them to a folder on disk.
- Normalize: used to convert the text collected in the previous command to standard unicode text.
- WordList: used to make a wordlist from the text file that's produced from the previous command.
./crawler.exe crawl -url <url> -output <output> [-delay <delay>] [-pages <pages>]
url
: The absolute URL for the site you want to crawl.output
: The folder to save the crawled pages. The crawler will also save a$Stats.txt
file that contains the crawling stats.delay
: Number of milliseconds to wait between crawling two pages. Default value is1000
pages
: Maximum number of pages to crawl. Default value is250
./crawler.exe crawl -url https://ckb.wikipedia.org -output ./Data
./crawler.exe crawl -url https://www.google.iq/ -output D:/CrawledPages/ -delay 250 -pages 1000
./crawler.exe normalize -inputdir <inputdirectory> -outdir <outputdirectory>
inputdirectory
: Path for the folder which contains collected text from the website.outputdirectory
: Output Directory files are saved in this folder after normalizing, files which have size of 0 will be discarded.
./crawler.exe normalize -inputdir ./myInputFolder -outdir ./myOutputFolder
./crawler.exe wordlist -inputdir <inputdirectory> -outfile <outputfile>
inputdirectory
: Path for the folder which contains collected text after normalizing the collected data.outputfile
: Output File which contains created wordlist form previous step.
./crawler.exe wordlist -inputdir ./NormalizedFolderData -outfile WORDLIST.txt