Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR - Could not commit batched operations #4

Open
akumar251 opened this issue Dec 26, 2018 · 3 comments
Open

ERROR - Could not commit batched operations #4

akumar251 opened this issue Dec 26, 2018 · 3 comments

Comments

@akumar251
Copy link

akumar251 commented Dec 26, 2018

Hi ,
I am trying to crawl sitemap xml file which includes bulk urls - 100% completed (563 processed/563 total)
I am getting error when committing to Azure.
I have tried many time running the norconex -
command being used: collector-http.bat -a start -c collectorconfig.xml

PFB error details from logs -

Crawler : 2018-12-25 23:34:04 INFO - Azure Search REST API Http Client closed.
Crawler : 2018-12-25 23:34:04 INFO - Azure Search REST API Http Client closed.
Crawler : 2018-12-25 23:34:04 ERROR - Could not commit batched operations.
com.norconex.committer.core.CommitterException: Invalid HTTP response: "HTTP/1.1 413 Request Entity Too Large". Azure Response: The page was not displayed because the request entity is too large.
	at com.norconex.committer.azuresearch.AzureSearchCommitter.handleResponse(AzureSearchCommitter.java:509)
	at com.norconex.committer.azuresearch.AzureSearchCommitter.commitBatch(AzureSearchCommitter.java:478)
	at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:179)
	at com.norconex.committer.core.AbstractBatchCommitter.cacheOperationAndCommitIfReady(AbstractBatchCommitter.java:208)
	at com.norconex.committer.core.AbstractBatchCommitter.commitAddition(AbstractBatchCommitter.java:143)
	at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:222)
	at com.norconex.committer.azuresearch.AzureSearchCommitter.commit(AzureSearchCommitter.java:405)
	at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:274)
	at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:228)
	at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:184)
	at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
	at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
	at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:296)
	at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:168)
	at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:132)
	at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:95)
	at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:74)

Can you please advise what needs to be done for this.

Br,
Akash

@akumar251
Copy link
Author

akumar251 commented Dec 26, 2018

Collector Config file:

<httpcollector id="Collector1">

  #set($http = "com.norconex.collector.http")
  #set($core = "com.norconex.collector.core")
  #set($urlNormalizer   = "${http}.url.impl.GenericURLNormalizer")
  #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
  #set($filterRegexRef  = "${core}.filter.impl.RegexReferenceFilter")
  #set($urlFilter = "com.norconex.collector.http.filter.impl.RegexURLFilter")


  <crawlerDefaults>

    <urlNormalizer class="$urlNormalizer" />
 
    <numThreads>4</numThreads>  
    <maxDepth>1</maxDepth>
    <maxDocuments>-1</maxDocuments>
    <workDir>./norconexcollector</workDir>
    <orphansStrategy>DELETE</orphansStrategy>


    <delay default="0" />	
   <sitemapResolverFactory ignore="false" />	  
    <robotsTxt ignore="true" /> 
    <referenceFilters>
      <filter class="$filterExtension" onMatch="exclude">jpg,jepg,svg,gif,png,ico,css,js,xlsx,pdf,zip,xml</filter>
    </referenceFilters>
  </crawlerDefaults>
  <crawlers>
    <crawler id="CrawlerID"> 
      <startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">     
	<sitemap>https://*******.com/sitemap.xml</sitemap>
       </startURLs>
      <importer>
      <postParseHandlers>
	<tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger" logLevel="INFO" />
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>document.reference,title,description,content</fields>
          </tagger>
	<tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
            <rename fromField="document.reference" toField="reference"/>          
          </tagger>
         
          
          <transformer class="com.norconex.importer.handler.transformer.impl.ReduceConsecutivesTransformer">
            <!-- carriage return -->
            <reduce>\r</reduce>
            <!-- new line -->
            <reduce>\n</reduce>
            <!-- tab -->
            <reduce>\t</reduce>
            <!-- whitespaces -->
            <reduce>\s</reduce>
          </transformer>
         
          <transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer">
            <replace>
                <fromValue>\n</fromValue>
                <toValue></toValue>
            </replace>
            <replace>
                <fromValue>\t</fromValue>
                <toValue></toValue>
            </replace>
          </transformer>
        </postParseHandlers>
      </importer>	
     <!-- Azure committer setting -->
     <committer class="com.norconex.committer.azuresearch.AzureSearchCommitter">
       <endpoint>********</endpoint>
       <apiKey>***********</apiKey>
       <indexName>**********</indexName>
         <maxRetries>3</maxRetries>
        <targetContentField>content</targetContentField>
	<queueDir>./queuedir</queueDir>
	<queueSize>6000</queueSize>
     </committer>
    </crawler>
  </crawlers>
</httpcollector>

@essiembre
Copy link
Contributor

This error is coming from Azure. Is it possible you have large documents? Online research suggests you are getting this when uploading something too big. I would suggest you try adding 10 (or lower) to your committer to see if it makes a difference (from a default of 100).

You can find many Azure/IIS users having this problem and the upload limit seems configurable. For instance, this Microsoft thread give you a few options: https://social.msdn.microsoft.com/Forums/sqlserver/en-US/d729a842-8ed9-466e-9ba8-4256ea294548/http11-413-request-entity-too-large?forum=biztalkgeneral

An excerpt:

Check the IIS request Filtering and set the Maximum allowed content length to higher value. Also there is a setting present in the IIS – “UploadReadAheadSize” that prevents upload and download of data greater than 49KB.The value present by default is 49152 bytes and can be increased up to 4 GB.

Hopefully this can give you a few pointers, else, you will have to ask Azure support for how to increase the limit.

@akumar251
Copy link
Author

Hello @essiembre ,

Thanks for this quick update.
I guess the issue is with size because i have many sitemap files which is getting uploaded successfully and maximum which got uploaded as checked in log files is 72 mb and the file which is causing issue is having size more than 90 mb.

I will check the setting and will ask Azure support if it will not solve.

Br,
Akash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants