-
Notifications
You must be signed in to change notification settings - Fork 219
Open
Description
Description
In Java VarOptItemsSketch heapify, full‑preamble deserialization currently only checks that rCount > 0, but does not validate the totalRWeight field. C++ (var_opt_sketch_impl.hpp) rejects NaN or non‑positive total_wt_r, and Go was updated to follow C++ behavior. This creates a cross‑language inconsistency and allows corrupted or malicious bytes to silently produce NaN/negative tau and invalid estimates.
Current Java code
File: src/main/java/org/apache/datasketches/sampling/VarOptItemsSketch.java
Method: heapify (around where numPreLongs == Family.VAROPT.getMaxPreLongs())
Current logic:
double totalRWeight = 0.0;
if (numPreLongs == Family.VAROPT.getMaxPreLongs()) {
if (rCount > 0) {
totalRWeight = extractTotalRWeight(srcSeg);
} else {
throw new SketchesArgumentException(
"Possible Corruption: " + Family.VAROPT.getMaxPreLongs()
+ " preLongs but no items in R region");
}
}Behavior in other languages
C++ (sampling/include/var_opt_sketch_impl.hpp):
- Full preamble → if
r == 0ortotal_wt_ris NaN or <= 0, throw"Possible corruption: deserializing in full mode but r = 0 or invalid R weight. Found r = ..., R region weight = ..."
Go (sampling/varopt_items_sketch.go):
- Full preamble → reject NaN or <= 0 totalRWeight (aligned with C++)
Proposed change
Add validation after extractTotalRWeight in Java:
double totalRWeight = 0.0;
if (numPreLongs == Family.VAROPT.getMaxPreLongs()) {
if (rCount > 0) {
totalRWeight = extractTotalRWeight(srcSeg);
if (Double.isNaN(totalRWeight) || totalRWeight <= 0.0) {
throw new SketchesArgumentException(
"Possible Corruption: deserializing in full mode but r = 0 or invalid R weight. "
+ "Found r = " + rCount + ", R region weight = " + totalRWeight);
}
} else {
throw new SketchesArgumentException(
"Possible Corruption: " + Family.VAROPT.getMaxPreLongs()
+ " preLongs but no items in R region");
}
}Why this matters
- Prevents NaN/negative tau from silently propagating
- Aligns Java behavior with C++
- Only rejects invalid/corrupted bytes
Metadata
Metadata
Assignees
Labels
No labels