All notes we can take for users, to facilitate migration from V3 to V4.
- All interfaces prefixed with "I" were renamed to drop the "I".
- All configurable classes with a "caseSensitive" attribute are having a "ignoreCase" attribute instead (defaults to false).
com.norconex.collector
→com.norconex.crawler
com.norconex.committer.core3
→com.norconex.committer.core
-
Collector*
→CrawlSession*
-
CollectorEvent#COLLECTOR_*
→CrawlSessionEvent#SESSION_*
-
Collector#maxMemoryPool
→CrawlSession#maxStreamCachePoolSize
-
Collector#maxMemoryInstance
→CrawlSession#maxStreamCacheSize
-
CollectorLifeCycleListener#onCollector*
→CrawlSessionLifeCycleListener#onCrawlSession*
-
ImporterConfig#maxMemoryPool
→ImporterConfig#maxStreamCachePoolSize
-
ImporterConfig#maxMemoryInstance
→ImporterConfig#maxStreamCacheSize
-
ImporterConfig#DEFAULT_MAX_MEM_POOL
→ImporterConfig#DEFAULT_MAX_STREAM_CACHE_POOL_SIZE
-
ImporterConfig#DEFAULT_MAX_MEM_INSTANCE
→ImporterConfig#DEFAULT_MAX_STREAM_CACHE_SIZE
-
ImporterConfig#maxMemoryPool
→ImporterConfig#maxStreamCachePoolSize
- Removed classes methods deprecated in previous major release.
- Removed "tempDir". Only "workDir" is configurable now. Classes in need of a temporary directory would derive it from the work dir (or use the OS-defined temporary directory).
- Removed collection setters accepting both "vargars" and a collection to now only accept a collection.
- Removed CrawlerConfigLoader.
collector
→crawlSession
maxMemoryPool
→maxStreamCachePoolSize
maxMemoryInstance
→maxStreamCacheSize
- Minimum Java version: 17
- Different features originally found in HTTP Collector and Filesystem Collector were moved to Crawler Core.
- Removed configuration flags to ignore/disabled a given feature when setting the corresponding object to null has the same effect.
- MemoryCommitter#clean will now clear the cached requests.
- New CommitterService and CommitterServiceEvent classes.
- Unless explicitly overwritten by a committer, each committer defined will now have a working directory named after their simple class name. In case there are more than one of the same class defined, they'll be appended with a number (e.g., "XMLFileCommitter_2").
- DataStoreEngine moved from crawler to crawl session.
- Added Apache Velocity JSR 223 Script Engine.
- JavaScript JSR 223 Script Engine now using GraalVM implementation.
- The Operator inner class on DateMetadataFilter and NumericMetadataFilter were removed in favor of com.norconex.commons.lang.Operator
- Renamed DocInfo to DocRecord
- New SaveDocumentTagger class.
- CommonMatchers pattern constants are now Collections instead of arrays.
- Classes dealing with time zones now default to UTC when zone is not declared.
- Removed GenericDocumentParserFactory (merged into core classes)
- FallbackParser is now DefaultParser.
- has been replaced with new section.
- now under
- now
- and now under
- , , and are now , , and , respectively.
- Handlers now passed DocContext.
- Taggers have been merged into Transformers.
- Filters removed in favor of conditions.
- Most handlers can now target either content or fields.
- New "discardOriginal" flag for splitters.
- Renamed CrawlDocInfo to CrawlDocRecord.
- Renamed .cmdline package to .cli
- Store export no longer prepare the store for a new crawl (exports as is).
- CrawlerCommitterService has been migrated from Crawler Core to Committer Core
- CrawlState renamed to CrawlDocState
- CrawlerLifeCycleListener is now abstract
- New CRAWLER_ERROR event.
- New crawler "idleTimeout" configuration option.
- New crawler "minProgressLoggingInterval" configuration option.
- MetadataFilter and ReferenceFilter renamed to GenericMetadataFilter and GenericReferenceFilter.
- Renamed CollectorCommandLcuncher to CliLauncher.
- New MVStoreDataStoreConfig#ephemeral property for in-memory storage.
- The "maxDocuments" feature now represents the number of document processed within a crawling session. If the crawler did not reach completion, the next session will resume where it last ended.
com.norconex.collector.http
→com.norconex.crawler.web
*.Http*
→*.Web*
- Removed Crawler configuration option "keepDownloads" and corresponding CrawlerEvent.DOCUMENT_SAVED event in favor of new Importer SaveDocumentTagger.
- References to HttpCollector or Collector changed to WebCrawlSession and CrawlSession, respectively.
- GenericRecrawlableResolver minimum frequencies now expect TextMatcher instead of regular expressions.
- RobotsTxt now instantiated via builder factory method.
- SitemapChangeFrequency#getSitemapChangeFrequency renamed to #of.
- URLNormalizer was renamed WebURLNormalizer to distinguish from com.norconex.commons.lang.url.URLNormalizer.
- GenericURLNormalizer enum constants are now uppercase.
- Moved fetchers to crawler-core: "httpFetchers" now just "fetchers".
- Moved startURLs* configuration options to crawler-core.
- Now supports HTTP/v2 thanks to Apache HttpClient upgrade to version 5.x
- New stayOnSitemapWhenPresent option on .
- Too many changes to list. Major refactor to bring up to speed with V4-stack features.
- Now part of the same project and share the same version.
- New, experimental project.