AbstractFileQueueCommitter committing 1 file, but AbstractCrawler processed 14 #10

yvesnyc · 2015-06-30T14:17:18Z

I found a random behavior in the Committer. Running a crawl against the same url will give different results. Here is a section of output when the behavior is correct

INFO  - AbstractCrawler            - Crawler #1: 100% completed (14 processed/14 total)
INFO  - AbstractCrawler            - Crawler #1: Deleting orphan references (if any)...
INFO  - AbstractCrawler            - Crawler #1: Deleted 0 orphan URLs...
INFO  - AbstractCrawler            - Crawler #1: Crawler finishing: committing documents.
INFO  - AbstractFileQueueCommitter - Committing 14 files
INFO  - ElasticsearchCommitter     - Sending 14 operations to Elasticsearch.

Here is the output when the bug appears.

INFO  - AbstractCrawler            - Crawler #1: 100% completed (14 processed/14 total)
INFO  - AbstractCrawler            - Crawler #1: Deleting orphan references (if any)...
INFO  - AbstractCrawler            - Crawler #1: Deleted 0 orphan URLs...
INFO  - AbstractCrawler            - Crawler #1: Crawler finishing: committing documents.
INFO  - AbstractFileQueueCommitter - Committing 1 files
INFO  - ElasticsearchCommitter     - Sending 1 operations to Elasticsearch.

I don't understand why a crawler would process 14 items but the AbstractFileQueueCommitter would commit 1 file?

Is there some sort of JVM memory or space resource involved?

This is the configuration.

<httpcollector id="Complex Crawl">

    #set($http = "com.norconex.collector.http")
    #set($core = "com.norconex.collector.core")
    #set($urlNormalizer   = "${http}.url.impl.GenericURLNormalizer")
    #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
    #set($filterRegexRef  = "${core}.filter.impl.RegexReferenceFilter")

    <progressDir>${workdir}/progress</progressDir>
    <logsDir>${workdir}/logs</logsDir>

    <crawlerDefaults>

        <urlNormalizer class="$urlNormalizer" />
        <numThreads>4</numThreads>
        <maxDepth>2</maxDepth>
        <workDir>$workdir</workDir>
        <orphansStrategy>DELETE</orphansStrategy>
        <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
        <sitemap ignore="true" />

        <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
            <indexName>crawl</indexName>
            <typeName>webdoc</typeName>
            <clusterName>es-crawler</clusterName>
        </committer>

        <referenceFilters>
            <!-- Note we ignore Power Point and other possible repository desciptors -->
            <filter class="$filterExtension" onMatch="exclude">tar,TAR,zip,ZIP,rpm,RPM,gz,GZ,tgz,TGZ,ppt,PPT,mpg,MPG,jpg,JPG,gif,GIF,png,PNG,ico,ICO,css,CSS,js,JS,sit,SIT,eps,EPS,wmf,WMF,xls,XLS,mov,MOV,exe,EXE,jpeg,JPEG,bmp,BMP</filter>
            <filter class="$filterRegexRef" onMatch="include">http://([a-z0-9]+\.)*redis\.io(/.*)?</filter>
        </referenceFilters>

        <importer>
            <preParseHandlers>
                <!-- Normally comment out the tagger below -->
                <!--
                <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger" logLevel="WARN" > </tagger>>
                -->
                <transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer">
                    <replace>
                        <fromValue><![CDATA[<a .*href=['"]([^"']*)['"][^>]*>]]></fromValue>
                        <toValue>_ $1 _</toValue>
                    </replace>
                </transformer>
            </preParseHandlers>
        </importer>

    </crawlerDefaults>

    <crawlers>

        <crawler id="Norconex Redis Crawl">
            <startURLs>
                <url>http://redis.io</url>
            </startURLs>
        </crawler>

    </crawlers>

</httpcollector>

The text was updated successfully, but these errors were encountered:

yvesnyc · 2015-07-02T19:08:53Z

After reading another issue ("Ensure committer queue uniqueness to avoid queue collisions" #9) I decided to try deleting work directories. Deleting the committer-queue directory had no effect. However, deleting the work directory

crawlerConfig.getWorkDir().getAbsolutePath()

seems to fix the problem. So, upon application startup, the work directory is removed before the crawlers are started.

I cannot be sure this is the solution so I will keep the issue open for a few more days.

essiembre · 2015-07-02T20:29:08Z

This is not a bug. Processed documents and committed ones are separate things.

Every URL that is being queued for processing will be marked as processed when the crawler is done with it. Amongst other things, the crawler keeps track of "processed" URLs so they are not crawled again if encountered in another HTML page. So they are not an indication of success, but rather how many URLs passed through the crawler. Many of these URLs can be rejected by your filter directives, checksums, etc., and will never reach your Committer.

Ultimately, the number of documents "commited" to your target repository should match the number of DOCUMENT_COMMITTED_ADD and DOCUMENT_COMMITTED_REMOVE entries that appear in the logs (depending on your log level).

In your case, your difference in numbers seem to reflect running incremental crawls. By default, the crawler keeps a checksum of documents it processes. If it has not changed on a subsequent run, it will reject it to save the load on your target repository that should already have it (and won't send it to your committer). Deleting the working directory is a good way to get rid of that URL database cache and force it to crawl again "from scratch". That's why all your documents get sent again to your committer when you do that.

Instead, if you do not want to reject documents that were not modified since the last run and keep sending the same documents over to Elasticsearch, you can disable the checksum feature:

<metadataChecksummer disabled="true" />
<documentChecksummer disabled="true" />

yvesnyc · 2015-07-02T20:50:40Z

Of course... and thanks for the complete response.

While looking at the source code I noticed a commitCompleted method called by the committer implementation. I'd like to request a feature where an actionListener is available for capturing the commitCompleted event. Right now we can capture the CrawlerEvent.CRAWLER_FINISHED event and assume the committer state.

Thanks again.

essiembre · 2015-07-02T20:52:44Z

Feature request accepted :-)

yvesnyc closed this as completed Jul 2, 2015

essiembre added feature-request and removed feature-request labels Jul 2, 2015

essiembre mentioned this issue Jul 2, 2015

Add new Crawler event for when commit is completed. Norconex/collector-core#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AbstractFileQueueCommitter committing 1 file, but AbstractCrawler processed 14 #10

AbstractFileQueueCommitter committing 1 file, but AbstractCrawler processed 14 #10

yvesnyc commented Jun 30, 2015

yvesnyc commented Jul 2, 2015

essiembre commented Jul 2, 2015

yvesnyc commented Jul 2, 2015

essiembre commented Jul 2, 2015

AbstractFileQueueCommitter committing 1 file, but AbstractCrawler processed 14 #10

AbstractFileQueueCommitter committing 1 file, but AbstractCrawler processed 14 #10

Comments

yvesnyc commented Jun 30, 2015

yvesnyc commented Jul 2, 2015

essiembre commented Jul 2, 2015

yvesnyc commented Jul 2, 2015

essiembre commented Jul 2, 2015