Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use Generator2 lead fetcher fail #32

Open
whjshj opened this issue Nov 13, 2024 · 13 comments
Open

use Generator2 lead fetcher fail #32

whjshj opened this issue Nov 13, 2024 · 13 comments
Labels

Comments

@whjshj
Copy link

whjshj commented Nov 13, 2024

When I use Generator2 to generate fetch requests to download a webpage, and set the number of threads to 1, it results in a timeout being triggered (long timeout = conf.getInt("mapreduce.task.timeout", 10 * 60 * 1000)), causing the download task to terminate.

@sebastian-nagel
Copy link

Hi @whjshj, could you share your configuration (at least, all custom-set properties related to Generator2, Generator, URLPartitioner and Fetcher? Also sharing the log files (job client stdout and hadoop.log or task logs) would help to debug the issue, thanks!

Two comments so far:

  • Generator2 (also Generator) is the wrong tool if it's about a short fetch list of a single or few web pages and if only a single fetcher task is used - I assume this because it would be quite inefficient to have two tasks with a single thread each instead of one task and two threads. There is a tool FreeGenerator which allows you to generate a segment (fetch list) quickly from a list of URLs.
  • by default, Nutch is configured to be polite. This includes slowing down if a server responds with a server error or a "Slow Down", see the properties fetcher.exceptions.per.queue.delay and http.robots.503.defer.visits. When the fetch list includes only URLs from such a site, then crawling becomes stale and the timeout will be reached. However, if not configured otherwise, the Fetcher is shutting down already at 50% of the MapReduce task timeout to prevent that a task is failed.

@whjshj
Copy link
Author

whjshj commented Nov 14, 2024

hello @sebastian-nagel I'm using the settings from the website https://github.com/commoncrawl/cc-nutch-example. I only changed the number of threads to 1. In the initial fetching phase, it runs normally. However, after some time, there is only one thread alive, but it's just waiting, even though there is still data in the queue. The data isn't being selected because it exceeds the maximum number of threads, and the only active thread isn't processing the data. This phenomenon is quite strange. Then, a timeout is triggered, which is what you mentioned as the maptask task timeout. Have you encountered this situation before?
Uploading hadoop.log…

@sebastian-nagel
Copy link

only one thread alive, but it's just waiting, even though there is still data in the queue

Then the queue is blocked because the host of this queue responded (repeatedly) with an HTTP status code indicating a server error. This is quite common for wider web crawls, but it shouldn't happen if you crawl your intranet or own server.

There are two options to ensure that the fetcher is fetching:

  • a hard time limit in minutes: fetcher.timelimit.mins
  • a throughput threshold: fetcher.throughput.threshold.pages. By default it is checked after 5 minutes. Fetching is stopped, if the throughput in fetched pages per second drops below the threshold.

If either time limit or throughput threshold are hit, the current fetching cycle is stopped and the output is written to disk/HDFS. The script will then continue.

In order to figure out the reason of the slow fetching, I need the log file.

@whjshj
Copy link
Author

whjshj commented Nov 18, 2024

hadoop.log
Uploading hadoop.log…
I have set the two parameters you mentioned, but I feel that they are not the cause. Even if a server error occurs and the HTTP request times out, it should move on to download the next webpage instead of behaving like a log. In my previous reply, I already uploaded the log. Can you see it? I will upload the log again this time when I go back.

@sebastian-nagel
Copy link

Hi @whjshj, according to the hadoop.log, the fetch job fails in the reduce phase when writing the WARC files. The native libraries for the language detector are not installed:

  • please see the README.md for how to install them, in short: sudo apt install libcld2-0 libcld2-dev. See also the README of cc-nutch-example
  • alternatively, disable the language identification per configuration property warc.detect.language. In the example, you'd need to modify the crawl.sh script, line 45

@whjshj
Copy link
Author

whjshj commented Nov 19, 2024

hello @sebastian-nagel Hello, I have identified the cause of the error when writing the WARC file, and I've already resolved the issue. Please take a look at the section before writing the WARC file. I have taken a screenshot. Can you see it?
截屏2024-11-19 10 14 45

@sebastian-nagel
Copy link

I have identified the cause of the error when writing the WARC file, and I've already resolved the issue.

Great!

Ok, I see:

  • when fetching the robots.txt of 1-dot-name-meaning.appspot.com the server responded with an error indicating that the server cannot be crawled temporarily and fetches for this host are delayed by 5 minutes:

    2024-11-13 19:41:31,060 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 54 fetching http://1-dot-name-meaning.appspot.com/numerology/expression/Goundamani%200 (queue crawl delay=5000ms)
    2024-11-13 19:41:31,435 INFO o.a.n.f.FetcherThread [FetcherThread] Defer visits for queue 1-dot-name-meaning.appspot.com : http://1-dot-name-meaning.appspot.com/numerology/expression/Goundamani%200
    2024-11-13 19:41:31,436 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: 1-dot-name-meaning.appspot.com >> delayed next fetch by 300000 ms
    
  • note: in order to get more details logged, set the log level for org.apache.nutch.fetcher to DEBUG. I recommend to set also org.apache.nutch.protocol to level DEBUG. This should give you more information why the fetching failed.

  • the last fetch:

    2024-11-13 19:45:34,062 INFO o.a.n.f.FetchItemQueues [FetcherThread] Fetching http://1027kord.com/high-school-teenage-contraception/%200
    
  • about one minute later the fetch is aborted. The 50 slots in the queues are all occupied with URLs from the host with the deferred visits:

    2024-11-13 19:46:34,793 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Aborting with 50 queued fetch items in 34 queues (queue feeder still alive).
    2024-11-13 19:46:34,794 INFO o.a.n.f.FetchItemQueues [LocalJobRunner Map Task Executor #0] * queue: 1-dot-name-meaning.appspot.com >> dropping!
    2024-11-13 19:46:34,794 INFO o.a.n.f.FetchItemQueues [LocalJobRunner Map Task Executor #0] Emptied all queues: 1 queues with 50 items
    
  • the total size of all queues is rather short, because you have only a single thread. There's a property fetcher.queue.depth.multiplier (default: 50) which is multiplied by the number of threads.

    • if you have a rather diverse fetch list with potentially slow or delayed hosts:
      • increase the number of fetcher threads
      • also increase the fetcher.queue.depth.multiplier
    • this avoids that one or few hosts can occupy all queues
    • having a single thread is typical for intranet or sitesearch crawls
  • there is an open issue and PR (not yet merged) which improves how the fetcher shuts down the threads in such situations, see NUTCH-3072. But it does not avoid it.

  • one point, I do not understand: if "mapreduce.task.timeout is configured to be 10 minutes and fetcher.threads.timeout.divisor is 2 (both defaults), then the "aborting" should happen 5 minutes after the last fetch

  • otherwise: please use fetcher.timelimit.mins and fetcher.throughput.threshold.pages to ensure that a slow fetcher is shutting down. See the comment few days ago. Please note that fetcher.timelimit.mins is dynamically set in the crawl.sh script of the example

@whjshj
Copy link
Author

whjshj commented Nov 20, 2024

"Thank you for your response; my confusion has been resolved. May I ask about the current situation of using Nutch-cc to crawl web pages? For example, in an iterative download, if a total of 1000 web pages need to be downloaded, how many of them are successfully downloaded in the end?"

@sebastian-nagel
Copy link

This totally depends on the fetch list:

  • 0 pages - if it's a single site disallowed by robots.txt
  • very few - if the site implements anti-bot measures
  • 1000: with careful and polite settings and if the crawled site generally admits crawling
  • 40-80% for a mixed fetch list is realistic. For the recent Common Crawl crawls it's about 70% successful fetches, see https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlermetrics

@whjshj
Copy link
Author

whjshj commented Nov 20, 2024

Thank you very much for your response. In the fetch stage, regarding the map phase tasks, which is the downloading process, do you have any recommendations for the map-related configuration? Currently, I have set each map task to 1 core and 2GB of memory. Is this configuration reasonable?

@sebastian-nagel
Copy link

I have set each map task to 1 core and 2GB of memory.

Yes, possible, under the assumption that

  • you are fine without any parallelization and consequently, overall slow fetching
  • you have only a single map task. That is, it runs in local mode but not as a job on a Hadoop cluster with multiple map tasks.

If you want to scale up, it's more efficient to parallelize first using threads (up to several hundred threads). Of course, more threads mean higher memory requirements to buffer the incoming data. Also scaling up requires to adjust many more parameters to your set up and context: connection pools, timeouts, etc.

@whjshj
Copy link
Author

whjshj commented Nov 21, 2024

Thank you for your response. I am currently looking to deploy Nutch-cc on a large scale. Could you suggest some recommended configurations? For example, how many CPU cores and how much memory should be allocated to each map task and each reduce task? Additionally, during the fetch phase, what would be an appropriate number of concurrent download threads to set?

@sebastian-nagel
Copy link

It is difficult to recommend a final cluster configuration, because it depends on the kind of your crawl and the Hadoop cluster setup. Few tips:

  • a throughput of 500k pages per hour on a 4-core with 32 GiB RAM is easily possible
  • disk and RAM are more important during fetching while operations on the CrawlDb are CPU-bound
  • do the scale up step-by-step, maximally doubling the size in every step
  • keep monitoring, profiling, etc. and review your configuration with every step. Esp.,
    • throughput thresholds and timelimits
    • the okhttp connection pool settings
  • try different compression codecs (I'd recommend zstd)
  • over time, when the CrawlDb fills up, updating it and generating the fetch lists takes longer and longer: you might change your workflow to generate multiple segments in one turn and fetch them in a row, the update the CrawlDb with all of them

Also important: choose a unique agent name together with contact information. You'll receive feedback from angry webmasters! Scaling up and staying polite is a challenge, but can be mastered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants