V3 to V4

All notes we can take for users, to facilitate migration from V3 to V4.

Renamed

General

All interfaces prefixed with "I" were renamed to drop the "I".
All configurable classes with a "caseSensitive" attribute are having a "ignoreCase" attribute instead (defaults to false).

Packages:

com.norconex.collector → com.norconex.crawler
com.norconex.committer.core3 → com.norconex.committer.core

Classes/methods:

Collector* → CrawlSession*
CollectorEvent#COLLECTOR_* → CrawlSessionEvent#SESSION_*
Collector#maxMemoryPool → CrawlSession#maxStreamCachePoolSize
Collector#maxMemoryInstance → CrawlSession#maxStreamCacheSize
CollectorLifeCycleListener#onCollector* → CrawlSessionLifeCycleListener#onCrawlSession*
ImporterConfig#maxMemoryPool → ImporterConfig#maxStreamCachePoolSize
ImporterConfig#maxMemoryInstance → ImporterConfig#maxStreamCacheSize
ImporterConfig#DEFAULT_MAX_MEM_POOL → ImporterConfig#DEFAULT_MAX_STREAM_CACHE_POOL_SIZE
ImporterConfig#DEFAULT_MAX_MEM_INSTANCE → ImporterConfig#DEFAULT_MAX_STREAM_CACHE_SIZE
ImporterConfig#maxMemoryPool → ImporterConfig#maxStreamCachePoolSize

Removed

Removed classes methods deprecated in previous major release.
Removed "tempDir". Only "workDir" is configurable now. Classes in need of a temporary directory would derive it from the work dir (or use the OS-defined temporary directory).
Removed collection setters accepting both "vargars" and a collection to now only accept a collection.
Removed CrawlerConfigLoader.

XML Changes

collector → crawlSession
maxMemoryPool → maxStreamCachePoolSize
maxMemoryInstance → maxStreamCacheSize

Misc. Changes

Minimum Java version: 17
Different features originally found in HTTP Collector and Filesystem Collector were moved to Crawler Core.
Removed configuration flags to ignore/disabled a given feature when setting the corresponding object to null has the same effect.

Committer Core

MemoryCommitter#clean will now clear the cached requests.
New CommitterService and CommitterServiceEvent classes.
Unless explicitly overwritten by a committer, each committer defined will now have a working directory named after their simple class name. In case there are more than one of the same class defined, they'll be appended with a number (e.g., "XMLFileCommitter_2").
DataStoreEngine moved from crawler to crawl session.

Importer

Added Apache Velocity JSR 223 Script Engine.
JavaScript JSR 223 Script Engine now using GraalVM implementation.
The Operator inner class on DateMetadataFilter and NumericMetadataFilter were removed in favor of com.norconex.commons.lang.Operator
Renamed DocInfo to DocRecord
New SaveDocumentTagger class.
CommonMatchers pattern constants are now Collections instead of arrays.
Classes dealing with time zones now default to UTC when zone is not declared.
Removed GenericDocumentParserFactory (merged into core classes)
FallbackParser is now DefaultParser.
has been replaced with new section.
now under
now
and now under
, , and are now , , and , respectively.
Handlers now passed DocContext.
Taggers have been merged into Transformers.
Filters removed in favor of conditions.
Most handlers can now target either content or fields.
New "discardOriginal" flag for splitters.

Crawler Core

Renamed CrawlDocInfo to CrawlDocRecord.
Renamed .cmdline package to .cli
Store export no longer prepare the store for a new crawl (exports as is).
CrawlerCommitterService has been migrated from Crawler Core to Committer Core
CrawlState renamed to CrawlDocState
CrawlerLifeCycleListener is now abstract
New CRAWLER_ERROR event.
New crawler "idleTimeout" configuration option.
New crawler "minProgressLoggingInterval" configuration option.
MetadataFilter and ReferenceFilter renamed to GenericMetadataFilter and GenericReferenceFilter.
Renamed CollectorCommandLcuncher to CliLauncher.
New MVStoreDataStoreConfig#ephemeral property for in-memory storage.
The "maxDocuments" feature now represents the number of document processed within a crawling session. If the crawler did not reach completion, the next session will resume where it last ended.

Crawler Web

com.norconex.collector.http → com.norconex.crawler.web
*.Http* → *.Web*
Removed Crawler configuration option "keepDownloads" and corresponding CrawlerEvent.DOCUMENT_SAVED event in favor of new Importer SaveDocumentTagger.
References to HttpCollector or Collector changed to WebCrawlSession and CrawlSession, respectively.
GenericRecrawlableResolver minimum frequencies now expect TextMatcher instead of regular expressions.
RobotsTxt now instantiated via builder factory method.
SitemapChangeFrequency#getSitemapChangeFrequency renamed to #of.
URLNormalizer was renamed WebURLNormalizer to distinguish from com.norconex.commons.lang.url.URLNormalizer.
GenericURLNormalizer enum constants are now uppercase.
Moved fetchers to crawler-core: "httpFetchers" now just "fetchers".
Moved startURLs* configuration options to crawler-core.
Now supports HTTP/v2 thanks to Apache HttpClient upgrade to version 5.x
New stayOnSitemapWhenPresent option on .

Crawler File System

Too many changes to list. Major refactor to bring up to speed with V4-stack features.

Misc. Committers

Now part of the same project and share the same version.

Crawler Server

New, experimental project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V4_MIGRATION.md

V4_MIGRATION.md

V3 to V4

Renamed

General

Packages:

Classes/methods:

Removed

XML Changes

Misc. Changes

Committer Core

Importer

Crawler Core

Crawler Web

Crawler File System

Misc. Committers

Crawler Server

Files

V4_MIGRATION.md

Latest commit

History

V4_MIGRATION.md

File metadata and controls

V3 to V4

Renamed

General

Packages:

Classes/methods:

Removed

XML Changes

Misc. Changes

Committer Core

Importer

Crawler Core

Crawler Web

Crawler File System

Misc. Committers

Crawler Server