NUTCH-2455 Use secondary sorting for memory-efficient HostDb integration in Generator#888
NUTCH-2455 Use secondary sorting for memory-efficient HostDb integration in Generator#888lewismc wants to merge 14 commits intoapache:masterfrom
Conversation
This reverts commit c3e1a6e.
… by reference, fixed with clone
…ion in Generator Add note that the current solution still loads the HostDb into memory, but every selector reducer loads only the part of the HostDb containing the hosts processed in this reducer.
|
Hi @lewismc, I've resolved the merge conflicts and added / specified the Javadoc regarding the memory requirements: the current solution still loads the HostDb into memory, but every reducer loads only that part it needs for the host it processes into the HashMap |
|
Fixed the logging of Generator counters. It shows now: |
|
@sebastian-nagel I'm fine to leave this out of 1.22 to permit time for more peer review. wdyt? |
…ion in Generator Make HostDatum in SelectorEntry optional for faster serialization
Agreed. This is an important feature and makes running Generator with a HostDb scalable. But since Generator is one of the core parts, more testing is recommended. There's a performance regression if Generator is used without HostDb - about 20% longer runtime when generating from a large CrawlDb.
Async-profiler flamegraphs:
|
This PR is proposed as a fix for NUTCH-2455 and also to supersede #254
In essence this PR implements scalable HostDb integration in the Generator using MapReduce secondary sorting, eliminating the need to load the entire HostDb into memory.
Problem
The previous implementation loaded the entire HostDb into memory at reducer startup. For crawls with millions of hosts, this caused:
Solution
Use MapReduce secondary sorting to stream HostDb entries through the pipeline:
FloatTextPair): Combines score and hostname to enable sortingScoreHostKeyComparator): Ensures HostDb entries arrive before CrawlDb entriesKey Components
FloatTextPair
ScoreHostKeyComparator
Sorting order:
HostDbReaderMapper
Reads HostDb and emits with special key to ensure sorting before CrawlDb entries:
Configuration
generate.hostdbgenerate.max.count.exprgenerate.fetch.delay.exprExample JEXL Expressions
Performance
Backward Compatibility
generate.hostdbis not set, behavior is unchangedTesting