scrapy+mongodb+proxy+user_agent to crawl stocks from https://gupiao.baidu.com/stock 具体细节可以查看官方的帮助文档,中文本版:https://scrapy-chs.readthedocs.io/zh_CN/latest/intro/tutorial.html#intro-tutorial
- request and re :use 'requests' and 're' modules to extract each stocks code from http://quote.eastmoney.com/stocklist.html
- scrapy : use scrapy to get stocks information in detail from https://gupiao.baidu.com/stock
- beautiful soup : use beautiful soup to extract interested information from the html file.
- MongoDB : use mongodb to store data
- request header : many website will reject your request if you use the default request header
--solutions
: use header pool randomly in each request - agency : scrapy is a distributed crawler frame , so if you always use the same ip address ,there is a great chance that your ip will be banned.
--solutions
: To avoid banned , we can get many agencies from https://free-proxy-list.net ,and put the usable ips to our ip pool ,if our request rejected ,we can change a new ip .
- pipeline technique: after getting data, then we can use pipelines to filter the data
- scrapy frame offer many configurations for user to set ,you can use appropriate setting for your own project .