Skip to content

NUTCH-2455 Use secondary sorting for memory-efficient HostDb integration in Generator#888

Open
lewismc wants to merge 14 commits intoapache:masterfrom
lewismc:NUTCH-2455
Open

NUTCH-2455 Use secondary sorting for memory-efficient HostDb integration in Generator#888
lewismc wants to merge 14 commits intoapache:masterfrom
lewismc:NUTCH-2455

Conversation

@lewismc
Copy link
Member

@lewismc lewismc commented Jan 13, 2026

This PR is proposed as a fix for NUTCH-2455 and also to supersede #254

In essence this PR implements scalable HostDb integration in the Generator using MapReduce secondary sorting, eliminating the need to load the entire HostDb into memory.

Problem

The previous implementation loaded the entire HostDb into memory at reducer startup. For crawls with millions of hosts, this caused:

  • High memory consumption (O(HostDb size) per reducer)
  • OutOfMemoryError for large HostDbs
  • Startup latency while loading data

Solution

Use MapReduce secondary sorting to stream HostDb entries through the pipeline:

  1. Composite Key (FloatTextPair): Combines score and hostname to enable sorting
  2. Custom Comparator (ScoreHostKeyComparator): Ensures HostDb entries arrive before CrawlDb entries
  3. MultipleInputs: Reads both HostDb and CrawlDb in a single MapReduce job
  4. Streaming Reducer: Processes HostDb entries as they arrive, no preloading required

Key Components

FloatTextPair

public static class FloatTextPair implements WritableComparable<FloatTextPair> {
    public FloatWritable first;  // score (negative for HostDb)
    public Text second;          // hostname (empty for CrawlDb)
}

ScoreHostKeyComparator

Sorting order:

  1. HostDb entries first (non-empty hostname), sorted by hostname
  2. CrawlDb entries second (empty hostname), sorted by score descending

HostDbReaderMapper

Reads HostDb and emits with special key to ensure sorting before CrawlDb entries:

context.write(new FloatTextPair(-Float.MAX_VALUE, hostname), entry);

Configuration

Property Description
generate.hostdb Path to HostDb (enables feature)
generate.max.count.expr JEXL expression for per-host URL limit
generate.fetch.delay.expr JEXL expression for per-host fetch delay

Example JEXL Expressions

<!-- Limit hosts with many failures to 10 URLs -->
<property>
  <name>generate.max.count.expr</name>
  <value>connectionFailures > 100 ? 10 : 1000</value>
</property>

<!-- Increase delay for unreliable hosts -->
<property>
  <name>generate.fetch.delay.expr</name>
  <value>connectionFailures > 50 ? 5000 : 1000</value>
</property>

Performance

Aspect Before After
Memory per reducer O(H) where H = total hosts O(P) where P = hosts in partition
Startup time Load entire HostDb None (streaming)
Scalability Limited by JVM heap Scales with cluster size

Backward Compatibility

  • When generate.hostdb is not set, behavior is unchanged
  • Existing configurations continue to work
  • JEXL expressions only evaluated when HostDb is provided

Testing

  • Unit tests (9): FloatTextPair serialization, equality, comparison; ScoreHostKeyComparator ordering
  • Integration tests (3): Variable max count, variable fetch delay, backward compatibility

@sebastian-nagel
Copy link
Contributor

Hi @lewismc, I've resolved the merge conflicts and added / specified the Javadoc regarding the memory requirements: the current solution still loads the HostDb into memory, but every reducer loads only that part it needs for the host it processes into the HashMap hostDomainCounts. Testing is in progress...

@sebastian-nagel
Copy link
Contributor

Fixed the logging of Generator counters. It shows now:

2026-02-08 23:24:36,595 INFO crawl.Generator: Generator: number of items rejected during selection:
2026-02-08 23:24:36,598 INFO crawl.Generator: Generator:      1  schedule_rejected_total
2026-02-08 23:24:36,598 INFO crawl.Generator: Generator:  31249  score_too_low_total

Copy link
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lewismc for getting this PR ready for merge. It has been waiting quite long. Thanks also to @okedoki!

Successfully tested, in both local and distributed mode. For distributed mode

  • 3 fetchers
  • 2 segments
  • with and without hostdb
  • and Jexl expression to set generate.max.count.expr per host

@lewismc
Copy link
Member Author

lewismc commented Feb 10, 2026

@sebastian-nagel I'm fine to leave this out of 1.22 to permit time for more peer review. wdyt?

…ion in Generator

Make HostDatum in SelectorEntry optional for faster serialization
@sebastian-nagel
Copy link
Contributor

leave this out of 1.22 to permit time for more peer review.

Agreed.

This is an important feature and makes running Generator with a HostDb scalable. But since Generator is one of the core parts, more testing is recommended.

There's a performance regression if Generator is used without HostDb - about 20% longer runtime when generating from a large CrawlDb.

  1. The HostDatum in SelectorEntry is not optional and its serialization is not trivial. This is addressed in 7dcc2f4.
  2. Deserializing and comparing floats from FloatTextPair when sorting has an even heavier impact. The FloatWritable class appears to be optimized in this respect. Eventually, we can use two Selector job definitions, in dependence whether there is a HostDb or not.

Async-profiler flamegraphs:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants