Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AbstractFileQueueCommitter committing 1 file, but AbstractCrawler processed 14 #10

Closed
yvesnyc opened this issue Jun 30, 2015 · 4 comments

Comments

@yvesnyc
Copy link

yvesnyc commented Jun 30, 2015

I found a random behavior in the Committer. Running a crawl against the same url will give different results. Here is a section of output when the behavior is correct

INFO  - AbstractCrawler            - Crawler #1: 100% completed (14 processed/14 total)
INFO  - AbstractCrawler            - Crawler #1: Deleting orphan references (if any)...
INFO  - AbstractCrawler            - Crawler #1: Deleted 0 orphan URLs...
INFO  - AbstractCrawler            - Crawler #1: Crawler finishing: committing documents.
INFO  - AbstractFileQueueCommitter - Committing 14 files
INFO  - ElasticsearchCommitter     - Sending 14 operations to Elasticsearch.

Here is the output when the bug appears.

INFO  - AbstractCrawler            - Crawler #1: 100% completed (14 processed/14 total)
INFO  - AbstractCrawler            - Crawler #1: Deleting orphan references (if any)...
INFO  - AbstractCrawler            - Crawler #1: Deleted 0 orphan URLs...
INFO  - AbstractCrawler            - Crawler #1: Crawler finishing: committing documents.
INFO  - AbstractFileQueueCommitter - Committing 1 files
INFO  - ElasticsearchCommitter     - Sending 1 operations to Elasticsearch.

I don't understand why a crawler would process 14 items but the AbstractFileQueueCommitter would commit 1 file?

Is there some sort of JVM memory or space resource involved?

This is the configuration.

<httpcollector id="Complex Crawl">

    #set($http = "com.norconex.collector.http")
    #set($core = "com.norconex.collector.core")
    #set($urlNormalizer   = "${http}.url.impl.GenericURLNormalizer")
    #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
    #set($filterRegexRef  = "${core}.filter.impl.RegexReferenceFilter")

    <progressDir>${workdir}/progress</progressDir>
    <logsDir>${workdir}/logs</logsDir>

    <crawlerDefaults>

        <urlNormalizer class="$urlNormalizer" />
        <numThreads>4</numThreads>
        <maxDepth>2</maxDepth>
        <workDir>$workdir</workDir>
        <orphansStrategy>DELETE</orphansStrategy>
        <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
        <sitemap ignore="true" />

        <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
            <indexName>crawl</indexName>
            <typeName>webdoc</typeName>
            <clusterName>es-crawler</clusterName>
        </committer>

        <referenceFilters>
            <!-- Note we ignore Power Point and other possible repository desciptors -->
            <filter class="$filterExtension" onMatch="exclude">tar,TAR,zip,ZIP,rpm,RPM,gz,GZ,tgz,TGZ,ppt,PPT,mpg,MPG,jpg,JPG,gif,GIF,png,PNG,ico,ICO,css,CSS,js,JS,sit,SIT,eps,EPS,wmf,WMF,xls,XLS,mov,MOV,exe,EXE,jpeg,JPEG,bmp,BMP</filter>
            <filter class="$filterRegexRef" onMatch="include">http://([a-z0-9]+\.)*redis\.io(/.*)?</filter>
        </referenceFilters>

        <importer>
            <preParseHandlers>
                <!-- Normally comment out the tagger below -->
                <!--
                <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger" logLevel="WARN" > </tagger>>
                -->
                <transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer">
                    <replace>
                        <fromValue><![CDATA[<a .*href=['"]([^"']*)['"][^>]*>]]></fromValue>
                        <toValue>_ $1 _</toValue>
                    </replace>
                </transformer>
            </preParseHandlers>
        </importer>

    </crawlerDefaults>

    <crawlers>

        <crawler id="Norconex Redis Crawl">
            <startURLs>
                <url>http://redis.io</url>
            </startURLs>
        </crawler>

    </crawlers>

</httpcollector>
@yvesnyc
Copy link
Author

yvesnyc commented Jul 2, 2015

After reading another issue ("Ensure committer queue uniqueness to avoid queue collisions" #9) I decided to try deleting work directories. Deleting the committer-queue directory had no effect. However, deleting the work directory

crawlerConfig.getWorkDir().getAbsolutePath()

seems to fix the problem. So, upon application startup, the work directory is removed before the crawlers are started.

I cannot be sure this is the solution so I will keep the issue open for a few more days.

@essiembre
Copy link
Contributor

This is not a bug. Processed documents and committed ones are separate things.

Every URL that is being queued for processing will be marked as processed when the crawler is done with it. Amongst other things, the crawler keeps track of "processed" URLs so they are not crawled again if encountered in another HTML page. So they are not an indication of success, but rather how many URLs passed through the crawler. Many of these URLs can be rejected by your filter directives, checksums, etc., and will never reach your Committer.

Ultimately, the number of documents "commited" to your target repository should match the number of DOCUMENT_COMMITTED_ADD and DOCUMENT_COMMITTED_REMOVE entries that appear in the logs (depending on your log level).

In your case, your difference in numbers seem to reflect running incremental crawls. By default, the crawler keeps a checksum of documents it processes. If it has not changed on a subsequent run, it will reject it to save the load on your target repository that should already have it (and won't send it to your committer). Deleting the working directory is a good way to get rid of that URL database cache and force it to crawl again "from scratch". That's why all your documents get sent again to your committer when you do that.

Instead, if you do not want to reject documents that were not modified since the last run and keep sending the same documents over to Elasticsearch, you can disable the checksum feature:

<metadataChecksummer disabled="true" />
<documentChecksummer disabled="true" />

@yvesnyc
Copy link
Author

yvesnyc commented Jul 2, 2015

Of course... and thanks for the complete response.

While looking at the source code I noticed a commitCompleted method called by the committer implementation. I'd like to request a feature where an actionListener is available for capturing the commitCompleted event. Right now we can capture the CrawlerEvent.CRAWLER_FINISHED event and assume the committer state.

Thanks again.

@essiembre
Copy link
Contributor

Feature request accepted :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants