Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

抓取视频只能获取少量字节 #4

Open
yangqihua opened this issue Jun 15, 2017 · 2 comments
Open

抓取视频只能获取少量字节 #4

yangqihua opened this issue Jun 15, 2017 · 2 comments

Comments

@yangqihua
Copy link

你好,您的项目虽然说每个视频用一个线程去抓取,但是每个视频,只抓取到一部分二进制文件后,便出现了异常,有什么好的办法可以将每个视频都完整的抓取下来吗。部分异常信息如下:

Exception` in thread Thread-47:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 520384 out of 47830612 bytes

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 532059 out of 13004076 bytes

Exception in thread Thread-18:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 585460 out of 6128527 bytes

当前下载进度:---------------->>>>>>>> 6.47%Exception in thread Thread-48:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 582540 out of 24403607 bytes

Exception in thread Thread-36:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 532065 out of 10005207 bytes

Exception in thread Thread-35:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 532058 out of 49727052 bytes

Exception in thread Thread-40:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 586084 out of 62159002 bytes

当前下载进度:---------------->>>>>>>> 6.50%Exception in thread Thread-7:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 532063 out of 20505701 bytes

Exception in thread Thread-6:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 532065 out of 61492854 bytes

Exception in thread Thread-46:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 527684 out of 14292045 bytes

当前下载进度:---------------->>>>>>>> 6.53%Exception in thread Thread-2:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 586084 out of 10502982 bytes

Exception in thread Thread-5:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 586087 out of 9053251 bytes

@yangqihua
Copy link
Author

该原因应该是由于本地网速原因,可不可以将原程序改成单线程爬取,获取限制线程的个数,因为,假设慕课网某门课有100节,你本地网速只有200kb/s的话,则每个视频所分到的网速则只有2kb/s,必然会导致上面的错误,所以是不是可以考虑爬取的最大线程数(因为爬视频不像爬文字或者图片,瓶颈不在于cpu利用率不够,爬视频的瓶颈在于网速不够)。

@qiyeboy
Copy link
Owner

qiyeboy commented Jun 15, 2017

我明天看一下程序,做一些调整

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants