Add warnings for Invalid language escape sequences in text strings#689
Add warnings for Invalid language escape sequences in text strings#689MaximPlusov merged 3 commits intointegrationfrom
Conversation
WalkthroughAdds private validation of language escape sequences for BOM-prefixed text strings in Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/main/java/org/verapdf/cos/COSString.java (1)
260-260: Fix grammar in warning message.The warning message uses "have" but should use "has" for grammatical correctness.
🔎 Apply this diff:
- LOGGER.warning("Text string language escape sequence have invalid length"); + LOGGER.warning("Text string language escape sequence has invalid length");- LOGGER.warning("Text string language escape sequence have invalid length"); + LOGGER.warning("Text string language escape sequence has invalid length");Also applies to: 291-291
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/main/java/org/verapdf/cos/COSString.java(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/main/java/org/verapdf/cos/COSString.java (1)
src/main/java/org/verapdf/tools/PDFDocEncoding.java (1)
PDFDocEncoding(28-112)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (2)
src/main/java/org/verapdf/cos/COSString.java (2)
298-301: LGTM!The helper method correctly identifies ASCII letters (A-Z and a-z) with proper bit masking to handle Java's signed bytes.
237-265: No changes needed—the language code encoding is correct per PDF specification.The PDF specification defines the escape sequence as: the Unicode value U+001B (byte sequence 0 followed by 27), followed by a 2-character ISO 639 language code—for example, en for English or ja for Japanese. Language escape sequences in UTF-16BE strings use a 2-byte language tag with an optional 2-byte country code.
The current implementation is correct. Language codes in UTF-16BE strings are stored as raw ASCII bytes (not UTF-16BE encoded characters). For example, the language code "en" is represented as two single bytes (0x65, 0x6E), not as UTF-16BE encoded characters that would require four bytes (0x00 0x65, 0x00 0x6E). The method correctly validates this format by checking individual bytes as ASCII letters.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/main/java/org/verapdf/cos/COSString.java (1)
222-236: Critical: Validation return values are ignored (issue not resolved from previous review).The
checkUTF16BEEscapeSequenceandcheckUTF8EscapeSequencemethods returnfalsewhen invalid escape sequences are detected, butisTextStringignores these return values and always returnstruefor BOM-prefixed strings. This completely defeats the purpose of the validation—strings with invalid language escape sequences are incorrectly classified as valid text strings.🔎 Apply this diff to fix the issue:
public boolean isTextString() { if (value.length > 2) { if ((value[0] & 0xFF) == 0xFE && (value[1] & 0xFF) == 0xFF) { - checkUTF16BEEscapeSequence(value); - return true; + return checkUTF16BEEscapeSequence(value); } } if (value.length > 3) { if ((value[0] & 0xFF) == 0xEF && (value[1] & 0xFF) == 0xBB && (value[2] & 0xFF) == 0xBF) { - checkUTF8EscapeSequence(value); - return true; + return checkUTF8EscapeSequence(value); } } return PDFDocEncoding.isPDFDocEncodingString(value);
🧹 Nitpick comments (3)
src/main/java/org/verapdf/as/warnings/StringWarnings.java (1)
3-7: Consider following utility class best practices.The class contains only static constants but can still be instantiated or subclassed. Consider making the class
finaland adding a private constructor to prevent instantiation, following standard utility class patterns.🔎 View suggested enhancement:
-public class StringWarnings { +public final class StringWarnings { + private StringWarnings() { + // Prevent instantiation + } + public static final String NOT_ASCII_LETTER = "Text string language escape sequence contains not ASCII letter"; public static final String INVALID_LANGUAGE_ESCAPE_SEQUENCE_LENGTH = "Text string language escape sequence has invalid length"; public static final String NOT_SUPPORTED_UTF16LE_ENCODING = "String object uses encoding UTF16-LE not supported by PDF"; }src/main/java/org/verapdf/cos/COSString.java (2)
261-261: Add space after comma.- LOGGER.log(Level.WARNING,StringWarnings.INVALID_LANGUAGE_ESCAPE_SEQUENCE_LENGTH); + LOGGER.log(Level.WARNING, StringWarnings.INVALID_LANGUAGE_ESCAPE_SEQUENCE_LENGTH);
292-292: Add space after comma.- LOGGER.log(Level.WARNING,StringWarnings.INVALID_LANGUAGE_ESCAPE_SEQUENCE_LENGTH); + LOGGER.log(Level.WARNING, StringWarnings.INVALID_LANGUAGE_ESCAPE_SEQUENCE_LENGTH);
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/main/java/org/verapdf/as/warnings/StringWarnings.java(1 hunks)src/main/java/org/verapdf/cos/COSString.java(3 hunks)
🔇 Additional comments (5)
src/main/java/org/verapdf/cos/COSString.java (5)
23-23: LGTM!Import correctly added to reference the new warning constants.
115-115: LGTM!Refactoring to use a centralized constant improves maintainability.
268-297: UTF-8 escape sequence validation logic looks correct.The validation properly handles UTF-8 encoding where ASCII characters are single bytes. The pattern matching for 2 or 4 letter sequences followed by ESC markers is appropriate.
299-302: LGTM!The helper method correctly identifies ASCII letters (A-Z, a-z) by checking the appropriate hex ranges.
238-266: Fix minor formatting issue: add space after comma in log statement (line 261)Line 261 is missing a space after the comma in the log statement:
LOGGER.log(Level.WARNING,StringWarnings.INVALID_LANGUAGE_ESCAPE_SEQUENCE_LENGTH);should beLOGGER.log(Level.WARNING, StringWarnings.INVALID_LANGUAGE_ESCAPE_SEQUENCE_LENGTH);
Summary by CodeRabbit
Bug Fixes
Chores
✏️ Tip: You can customize this high-level summary in your review settings.