-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AbstractFileQueueCommitter committing 1 file, but AbstractCrawler processed 14 #10
Comments
After reading another issue ("Ensure committer queue uniqueness to avoid queue collisions" #9) I decided to try deleting work directories. Deleting the committer-queue directory had no effect. However, deleting the work directory
seems to fix the problem. So, upon application startup, the work directory is removed before the crawlers are started. I cannot be sure this is the solution so I will keep the issue open for a few more days. |
This is not a bug. Processed documents and committed ones are separate things. Every URL that is being queued for processing will be marked as processed when the crawler is done with it. Amongst other things, the crawler keeps track of "processed" URLs so they are not crawled again if encountered in another HTML page. So they are not an indication of success, but rather how many URLs passed through the crawler. Many of these URLs can be rejected by your filter directives, checksums, etc., and will never reach your Committer. Ultimately, the number of documents "commited" to your target repository should match the number of In your case, your difference in numbers seem to reflect running incremental crawls. By default, the crawler keeps a checksum of documents it processes. If it has not changed on a subsequent run, it will reject it to save the load on your target repository that should already have it (and won't send it to your committer). Deleting the working directory is a good way to get rid of that URL database cache and force it to crawl again "from scratch". That's why all your documents get sent again to your committer when you do that. Instead, if you do not want to reject documents that were not modified since the last run and keep sending the same documents over to Elasticsearch, you can disable the checksum feature: <metadataChecksummer disabled="true" />
<documentChecksummer disabled="true" /> |
Of course... and thanks for the complete response. While looking at the source code I noticed a commitCompleted method called by the committer implementation. I'd like to request a feature where an actionListener is available for capturing the commitCompleted event. Right now we can capture the CrawlerEvent.CRAWLER_FINISHED event and assume the committer state. Thanks again. |
Feature request accepted :-) |
I found a random behavior in the Committer. Running a crawl against the same url will give different results. Here is a section of output when the behavior is correct
Here is the output when the bug appears.
I don't understand why a crawler would process 14 items but the AbstractFileQueueCommitter would commit 1 file?
Is there some sort of JVM memory or space resource involved?
This is the configuration.
The text was updated successfully, but these errors were encountered: