Skip to content

pep-pig/get-stocks-information-by-web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

get stocks information by web crawler

scrapy+mongodb+proxy+user_agent to crawl stocks from https://gupiao.baidu.com/stock 具体细节可以查看官方的帮助文档,中文本版:https://scrapy-chs.readthedocs.io/zh_CN/latest/intro/tutorial.html#intro-tutorial

technical route

issues and solutions

  • request header : many website will reject your request if you use the default request header
    --solutions: use header pool randomly in each request
  • agency : scrapy is a distributed crawler frame , so if you always use the same ip address ,there is a great chance that your ip will be banned.
    --solutions: To avoid banned , we can get many agencies from https://free-proxy-list.net ,and put the usable ips to our ip pool ,if our request rejected ,we can change a new ip .

data postprocess

  • pipeline technique: after getting data, then we can use pipelines to filter the data

configuration

  • scrapy frame offer many configurations for user to set ,you can use appropriate setting for your own project .