NUTCH-2455 Use secondary sorting for memory-efficient HostDb integration in Generator by lewismc · Pull Request #888 · apache/nutch

lewismc · 2026-01-13T05:12:34Z

This PR is proposed as a fix for NUTCH-2455 and also to supersede #254

In essence this PR implements scalable HostDb integration in the Generator using MapReduce secondary sorting, eliminating the need to load the entire HostDb into memory.

Problem

The previous implementation loaded the entire HostDb into memory at reducer startup. For crawls with millions of hosts, this caused:

High memory consumption (O(HostDb size) per reducer)
OutOfMemoryError for large HostDbs
Startup latency while loading data

Solution

Use MapReduce secondary sorting to stream HostDb entries through the pipeline:

Composite Key (FloatTextPair): Combines score and hostname to enable sorting
Custom Comparator (ScoreHostKeyComparator): Ensures HostDb entries arrive before CrawlDb entries
MultipleInputs: Reads both HostDb and CrawlDb in a single MapReduce job
Streaming Reducer: Processes HostDb entries as they arrive, no preloading required

Key Components

FloatTextPair

public static class FloatTextPair implements WritableComparable<FloatTextPair> {
    public FloatWritable first;  // score (negative for HostDb)
    public Text second;          // hostname (empty for CrawlDb)
}

ScoreHostKeyComparator

Sorting order:

HostDb entries first (non-empty hostname), sorted by hostname
CrawlDb entries second (empty hostname), sorted by score descending

HostDbReaderMapper

Reads HostDb and emits with special key to ensure sorting before CrawlDb entries:

context.write(new FloatTextPair(-Float.MAX_VALUE, hostname), entry);

Configuration

Property	Description
`generate.hostdb`	Path to HostDb (enables feature)
`generate.max.count.expr`	JEXL expression for per-host URL limit
`generate.fetch.delay.expr`	JEXL expression for per-host fetch delay

Example JEXL Expressions

<!-- Limit hosts with many failures to 10 URLs -->
<property>
  <name>generate.max.count.expr</name>
  <value>connectionFailures > 100 ? 10 : 1000</value>
</property>

<!-- Increase delay for unreliable hosts -->
<property>
  <name>generate.fetch.delay.expr</name>
  <value>connectionFailures > 50 ? 5000 : 1000</value>
</property>

Performance

Aspect	Before	After
Memory per reducer	O(H) where H = total hosts	O(P) where P = hosts in partition
Startup time	Load entire HostDb	None (streaming)
Scalability	Limited by JVM heap	Scales with cluster size

Backward Compatibility

When generate.hostdb is not set, behavior is unchanged
Existing configurations continue to work
JEXL expressions only evaluated when HostDb is provided

Testing

Unit tests (9): FloatTextPair serialization, equality, comparison; ScoreHostKeyComparator ordering
Integration tests (3): Variable max count, variable fetch delay, backward compatibility

This reverts commit c3e1a6e.

… by reference, fixed with clone

…ng process

…ion in Generator

…ion in Generator Add note that the current solution still loads the HostDb into memory, but every selector reducer loads only the part of the HostDb containing the hosts processed in this reducer.

sebastian-nagel · 2026-02-08T21:59:17Z

Hi @lewismc, I've resolved the merge conflicts and added / specified the Javadoc regarding the memory requirements: the current solution still loads the HostDb into memory, but every reducer loads only that part it needs for the host it processes into the HashMap hostDomainCounts. Testing is in progress...

sebastian-nagel · 2026-02-08T22:35:40Z

Fixed the logging of Generator counters. It shows now:

2026-02-08 23:24:36,595 INFO crawl.Generator: Generator: number of items rejected during selection:
2026-02-08 23:24:36,598 INFO crawl.Generator: Generator:      1  schedule_rejected_total
2026-02-08 23:24:36,598 INFO crawl.Generator: Generator:  31249  score_too_low_total

sebastian-nagel

Thanks @lewismc for getting this PR ready for merge. It has been waiting quite long. Thanks also to @okedoki!

Successfully tested, in both local and distributed mode. For distributed mode

3 fetchers
2 segments
with and without hostdb
and Jexl expression to set generate.max.count.expr per host

src/java/org/apache/nutch/crawl/Generator.java

lewismc · 2026-02-10T01:11:49Z

@sebastian-nagel I'm fine to leave this out of 1.22 to permit time for more peer review. wdyt?

…ion in Generator Make HostDatum in SelectorEntry optional for faster serialization

sebastian-nagel · 2026-02-10T11:03:43Z

leave this out of 1.22 to permit time for more peer review.

Agreed.

This is an important feature and makes running Generator with a HostDb scalable. But since Generator is one of the core parts, more testing is recommended.

There's a performance regression if Generator is used without HostDb - about 20% longer runtime when generating from a large CrawlDb.

The HostDatum in SelectorEntry is not optional and its serialization is not trivial. This is addressed in 7dcc2f4.
Deserializing and comparing floats from FloatTextPair when sorting has an even heavier impact. The FloatWritable class appears to be optimized in this respect. Eventually, we can use two Selector job definitions, in dependence whether there is a HostDb or not.

Async-profiler flamegraphs:

generator.selector.nutch-2455.20260210102545.flamegraph.html (this PR)
generator.selector.20260210103253.flamegraph.html (recent master, for comparison)

okedoki and others added 12 commits December 8, 2017 16:54

fix for NUTCH-2455 more efficient usage of hostdb in generate

c1ce018

added id to output files

c3e1a6e

Revert "added id to output files"

e20973c

This reverts commit c3e1a6e.

fix of the partitioner bug for NUTCH-2455

16f26f1

Merge branch 'master' into NUTCH-2455

d2451af

formating change #3

709aa0e

master conflict solved for NUTCH-2455

d608868

bug fix for NUTCH-2455 hostdatum in generate wasnot coppied correctly…

767e2e7

… by reference, fixed with clone

fix for NUTCH-2455 lost line hostDatum = entry.hostdatum in the mergi…

6fe1afd

…ng process

NUTCH-2455 Use secondary sorting for memory-efficient HostDb integrat…

a50c958

…ion in Generator

Merge branch 'master' into NUTCH-2455

b4fae43

NUTCH-2455 Use secondary sorting for memory-efficient HostDb integrat…

c7cfa39

…ion in Generator Add note that the current solution still loads the HostDb into memory, but every selector reducer loads only the part of the HostDb containing the hosts processed in this reducer.

Use metrics counter group constants introduced in NUTCH-3132

fe29303

sebastian-nagel approved these changes Feb 8, 2026

View reviewed changes

src/java/org/apache/nutch/crawl/Generator.java Outdated Show resolved Hide resolved

lewismc mentioned this pull request Feb 9, 2026

fix for NUTCH-2455 more efficient usage of hostdb in generate #254

Closed

NUTCH-2455 Use secondary sorting for memory-efficient HostDb integrat…

7dcc2f4

…ion in Generator Make HostDatum in SelectorEntry optional for faster serialization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NUTCH-2455 Use secondary sorting for memory-efficient HostDb integration in Generator#888

NUTCH-2455 Use secondary sorting for memory-efficient HostDb integration in Generator#888
lewismc wants to merge 14 commits intoapache:masterfrom
lewismc:NUTCH-2455

lewismc commented Jan 13, 2026

Uh oh!

sebastian-nagel commented Feb 8, 2026

Uh oh!

sebastian-nagel commented Feb 8, 2026

Uh oh!

sebastian-nagel left a comment

Uh oh!

Uh oh!

lewismc commented Feb 10, 2026 •

edited

Loading

Uh oh!

sebastian-nagel commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lewismc commented Jan 13, 2026

Problem

Solution

Key Components

FloatTextPair

ScoreHostKeyComparator

HostDbReaderMapper

Configuration

Example JEXL Expressions

Performance

Backward Compatibility

Testing

Uh oh!

sebastian-nagel commented Feb 8, 2026

Uh oh!

sebastian-nagel commented Feb 8, 2026

Uh oh!

sebastian-nagel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lewismc commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sebastian-nagel commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lewismc commented Feb 10, 2026 •

edited

Loading