Add warnings for Invalid language escape sequences in text strings by LonelyMidoriya · Pull Request #689 · veraPDF/veraPDF-parser

LonelyMidoriya · 2025-12-18T09:33:54Z

Summary by CodeRabbit

Bug Fixes
- Improved validation for BOM‑prefixed text strings (UTF‑16BE and UTF‑8), with stricter checks for language escape sequences and preserved behavior for valid/absent sequences.
- Now emits explicit warnings for invalid language escape sequences and for unsupported UTF‑16LE encoding.
Chores
- Added named warnings for clearer diagnostics when encoding or escape issues are detected.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-18T09:34:04Z

Walkthrough

Adds private validation of language escape sequences for BOM-prefixed text strings in COSString.isTextString, new warning constants in StringWarnings, and a UTF‑16LE warning replacement; validators log specific warnings and cause isTextString to return false on invalid sequences.

Changes

Cohort / File(s)	Summary
Text String Escape Sequence Validation `src/main/java/org/verapdf/cos/COSString.java`	Extends `isTextString` to detect BOM-prefixed UTF‑16BE and UTF‑8 strings and invoke new private validators `checkUTF16BEEscapeSequence(byte[])` and `checkUTF8EscapeSequence(byte[])`; adds `isASCIILetter(byte)` helper; validators check language escape sequences (0x00 0x1B for UTF‑16BE, 0x1B for UTF‑8), require 2 or 4 ASCII letters, log warnings and return false for invalid patterns.
String Warning Constants `src/main/java/org/verapdf/as/warnings/StringWarnings.java`	Adds `StringWarnings` class with `NOT_ASCII_LETTER`, `INVALID_LANGUAGE_ESCAPE_SEQUENCE_LENGTH`, and `NOT_SUPPORTED_UTF16LE_ENCODING` public static final string constants; `COSString` now uses these constants for logged warnings.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Check BOM detection bytes and that validators only run when BOMs are present.
Review escape-sequence parsing offsets, length checks (2 vs 4 letters) and ASCII-letter validation.
Confirm logging uses the new StringWarnings constants and UTF‑16LE message replacement.

Poem

🐰 I nibble bytes beneath the moonlit stack,

I sniff for BOMs and mark the letters back,
Two or four I count with careful paw,
I flag the wrongs and leave the right in awe,
A tiny hop — strings tidy on my track 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title accurately summarizes the main change: adding warning constants and validation logic for invalid language escape sequences in text strings (UTF-16BE and UTF-8 BOMs).

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch language-escape-sequence

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/main/java/org/verapdf/cos/COSString.java (1)

260-260: Fix grammar in warning message.

The warning message uses "have" but should use "has" for grammatical correctness.

🔎 Apply this diff:

-                LOGGER.warning("Text string language escape sequence have invalid length");
+                LOGGER.warning("Text string language escape sequence has invalid length");

-                LOGGER.warning("Text string language escape sequence have invalid length");
+                LOGGER.warning("Text string language escape sequence has invalid length");

Also applies to: 291-291

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fcfffe0 and 43aace0.

📒 Files selected for processing (1)

src/main/java/org/verapdf/cos/COSString.java (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

src/main/java/org/verapdf/cos/COSString.java (1)

src/main/java/org/verapdf/tools/PDFDocEncoding.java (1)

PDFDocEncoding (28-112)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Codacy Static Code Analysis

🔇 Additional comments (2)

src/main/java/org/verapdf/cos/COSString.java (2)

298-301: LGTM!

The helper method correctly identifies ASCII letters (A-Z and a-z) with proper bit masking to handle Java's signed bytes.

237-265: No changes needed—the language code encoding is correct per PDF specification.

The PDF specification defines the escape sequence as: the Unicode value U+001B (byte sequence 0 followed by 27), followed by a 2-character ISO 639 language code—for example, en for English or ja for Japanese. Language escape sequences in UTF-16BE strings use a 2-byte language tag with an optional 2-byte country code.

The current implementation is correct. Language codes in UTF-16BE strings are stored as raw ASCII bytes (not UTF-16BE encoded characters). For example, the language code "en" is represented as two single bytes (0x65, 0x6E), not as UTF-16BE encoded characters that would require four bytes (0x00 0x65, 0x00 0x6E). The method correctly validates this format by checking individual bytes as ASCII letters.

src/main/java/org/verapdf/cos/COSString.java

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

src/main/java/org/verapdf/cos/COSString.java (1)
222-236: Critical: Validation return values are ignored (issue not resolved from previous review).

The checkUTF16BEEscapeSequence and checkUTF8EscapeSequence methods return false when invalid escape sequences are detected, but isTextString ignores these return values and always returns true for BOM-prefixed strings. This completely defeats the purpose of the validation—strings with invalid language escape sequences are incorrectly classified as valid text strings.
🔎 Apply this diff to fix the issue:
     public boolean isTextString() {
         if (value.length > 2) {
             if ((value[0] & 0xFF) == 0xFE && (value[1] & 0xFF) == 0xFF) {
-                checkUTF16BEEscapeSequence(value);
-                return true;
+                return checkUTF16BEEscapeSequence(value);
             }
         }
         if (value.length > 3) {
             if ((value[0] & 0xFF) == 0xEF && (value[1] & 0xFF) == 0xBB && (value[2] & 0xFF) == 0xBF) {
-                checkUTF8EscapeSequence(value);
-                return true;
+                return checkUTF8EscapeSequence(value);
             }
         }
         return PDFDocEncoding.isPDFDocEncodingString(value);

🧹 Nitpick comments (3)

src/main/java/org/verapdf/as/warnings/StringWarnings.java (1)

3-7: Consider following utility class best practices.

The class contains only static constants but can still be instantiated or subclassed. Consider making the class final and adding a private constructor to prevent instantiation, following standard utility class patterns.
🔎 View suggested enhancement:
-public class StringWarnings {
+public final class StringWarnings {
+    private StringWarnings() {
+        // Prevent instantiation
+    }
+    
     public static final String NOT_ASCII_LETTER = "Text string language escape sequence contains not ASCII letter";
     public static final String INVALID_LANGUAGE_ESCAPE_SEQUENCE_LENGTH = "Text string language escape sequence has invalid length";
     public static final String NOT_SUPPORTED_UTF16LE_ENCODING = "String object uses encoding UTF16-LE not supported by PDF";
 }

src/main/java/org/verapdf/cos/COSString.java (2)

261-261: Add space after comma.

-                LOGGER.log(Level.WARNING,StringWarnings.INVALID_LANGUAGE_ESCAPE_SEQUENCE_LENGTH);
+                LOGGER.log(Level.WARNING, StringWarnings.INVALID_LANGUAGE_ESCAPE_SEQUENCE_LENGTH);

292-292: Add space after comma.

-                LOGGER.log(Level.WARNING,StringWarnings.INVALID_LANGUAGE_ESCAPE_SEQUENCE_LENGTH);
+                LOGGER.log(Level.WARNING, StringWarnings.INVALID_LANGUAGE_ESCAPE_SEQUENCE_LENGTH);

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7d90e11 and 656b2d3.

📒 Files selected for processing (2)

src/main/java/org/verapdf/as/warnings/StringWarnings.java (1 hunks)
src/main/java/org/verapdf/cos/COSString.java (3 hunks)

🔇 Additional comments (5)

src/main/java/org/verapdf/cos/COSString.java (5)

23-23: LGTM!

Import correctly added to reference the new warning constants.

115-115: LGTM!

Refactoring to use a centralized constant improves maintainability.

268-297: UTF-8 escape sequence validation logic looks correct.

The validation properly handles UTF-8 encoding where ASCII characters are single bytes. The pattern matching for 2 or 4 letter sequences followed by ESC markers is appropriate.

299-302: LGTM!

The helper method correctly identifies ASCII letters (A-Z, a-z) by checking the appropriate hex ranges.

238-266: Fix minor formatting issue: add space after comma in log statement (line 261)

Line 261 is missing a space after the comma in the log statement: LOGGER.log(Level.WARNING,StringWarnings.INVALID_LANGUAGE_ESCAPE_SEQUENCE_LENGTH); should be LOGGER.log(Level.WARNING, StringWarnings.INVALID_LANGUAGE_ESCAPE_SEQUENCE_LENGTH);

Add warnings for Invalid language escape sequences in text strings

43aace0

LonelyMidoriya requested a review from MaximPlusov December 18, 2025 09:33

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

src/main/java/org/verapdf/cos/COSString.java Show resolved Hide resolved

LonelyMidoriya added 2 commits December 18, 2025 12:43

Update COSString.java

7d90e11

Add StringWarnings

656b2d3

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

MaximPlusov approved these changes Dec 18, 2025

View reviewed changes

MaximPlusov merged commit 08a131a into integration Dec 18, 2025
9 checks passed

MaximPlusov deleted the language-escape-sequence branch December 18, 2025 11:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add warnings for Invalid language escape sequences in text strings#689

Add warnings for Invalid language escape sequences in text strings#689
MaximPlusov merged 3 commits intointegrationfrom
language-escape-sequence

LonelyMidoriya commented Dec 18, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 18, 2025 •

edited

Loading

Review ran into problems

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LonelyMidoriya commented Dec 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Review ran into problems

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LonelyMidoriya commented Dec 18, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 18, 2025 •

edited

Loading