Skip to content

port UTF8 string manipulation modules from bstrlib#142

Merged
rdmark merged 6 commits intomainfrom
port-utf8
Feb 25, 2026
Merged

port UTF8 string manipulation modules from bstrlib#142
rdmark merged 6 commits intomainfrom
port-utf8

Conversation

@rdmark
Copy link
Collaborator

@rdmark rdmark commented Feb 23, 2026

Port the UTF-8 string manipulation modules from bstrlib to this fork. Credits to Paul Hsieh.

utf8util is a standalone low-level module providing a forward iterator over UTF-8 byte sequences (utf8IteratorInit, utf8IteratorGetNextCodePoint, utf8IteratorGetCurrCodePoint, utf8ScanBackwardsForCodePoint) along with the cpUcs4/cpUcs2 type definitions and the isLegalUnicodeCodePoint macro.

buniutil builds on top of it and bstrlib to provide four higher-level functions: buIsUTF8Content, buAppendBlkUcs4, buGetBlkUTF16, and buAppendBlkUTF16. Both modules are compiled into the main libbstring binary, enabled by default and controlled by the new enable-utf8 build option.

Two adaptations were made to fit bstring's conventions: const_bstring was replaced with const bstring throughout (bstring dropped that typedef), and BSTR_PUBLIC visibility attributes were added to all public declarations.

A new test module tests/testutf8.c was written from scratch, covering the full API surface including ASCII and multi-byte iteration, error recovery, surrogate pair encoding/decoding, BOM handling, and null/invalid-argument guards.

compared to the original code by Paul Hsieh, the following additional improvements have been made

  • uppercase integer literal notation to prevent ambiguity
  • return when encountering invalid continuation bytes in utf8ScanBackwardsForCodePoint
  • fix truncation bounds check bug and use flag for error handling in utf8Iterator* functions
  • refactor surrogate substitution with flatter control flow
  • address a handful static analysis bugs flagged by SonarQube

@rdmark rdmark requested a review from msteinert as a code owner February 23, 2026 22:17
@github-actions
Copy link

github-actions bot commented Feb 23, 2026

File Coverage Lines Branches
All files 68% 73% 62%
bstring/bstraux.c 55% 64% 46%
bstring/bstrlib.c 73% 76% 69%
bstring/buniutil.c 80% 86% 73%
bstring/utf8util.c 60% 71% 50%

Minimum allowed coverage is 50%

Generated by 🐒 cobertura-action against 01c0a19

@msteinert
Copy link
Owner

LGTM

@rdmark
Copy link
Collaborator Author

rdmark commented Feb 24, 2026

Thanks for the review, Mike. I plan to squash a few more of the static analysis issues before merging.

@rdmark
Copy link
Collaborator Author

rdmark commented Feb 24, 2026

I am also toying with the idea of adding a build system flag for the UTF8 feature. E.g. have it enabled by default, but allow building without it for a leaner library with:

meson setup build -Denable-utf8=false

@rdmark rdmark force-pushed the port-utf8 branch 2 times, most recently from cd14867 to 83372fd Compare February 24, 2026 19:20
@rdmark
Copy link
Collaborator Author

rdmark commented Feb 24, 2026

I'm happy with the static analysis situation now. What remains are cognitive complexity, nested control structures, and a handful of spurious errors caused by SonarQube not understanding libcheck's START_TEST macros.

I refactored a few more of those tangled error handling that used goto between blocks. This led to a couple of corner case bugs being found that are now fixed and unit tests added.

@rdmark rdmark changed the title port UTF8 data handling modules from bstrlib port UTF8 string manipulation modules from bstrlib Feb 24, 2026
@rdmark rdmark force-pushed the port-utf8 branch 3 times, most recently from 127e1fd to 24f9f6d Compare February 25, 2026 05:36
Port the UTF-8 string manipulation modules from bstrlib to this fork. Credits to Paul Hsieh.

utf8util is a standalone low-level module providing a forward iterator over UTF-8 byte sequences (utf8IteratorInit, utf8IteratorGetNextCodePoint, utf8IteratorGetCurrCodePoint, utf8ScanBackwardsForCodePoint) along with the cpUcs4/cpUcs2 type definitions and the isLegalUnicodeCodePoint macro.

buniutil builds on top of it and bstrlib to provide four higher-level functions: buIsUTF8Content, buAppendBlkUcs4, buGetBlkUTF16, and buAppendBlkUTF16. Both modules are compiled into the main libbstring binary, enabled by default and controlled by the new enable-utf8 build option.

Two adaptations were made to fit bstring's conventions: const_bstring was replaced with const bstring throughout (bstring dropped that typedef), and BSTR_PUBLIC visibility attributes were added to all public declarations.

A new test module tests/testutf8.c was written from scratch, covering the full API surface including ASCII and multi-byte iteration, error recovery, surrogate pair encoding/decoding, BOM handling, and null/invalid-argument guards.
@sonarqubecloud
Copy link

@rdmark rdmark merged commit fe97cc9 into main Feb 25, 2026
19 of 20 checks passed
@rdmark rdmark deleted the port-utf8 branch February 25, 2026 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants