Batch concat_ws calls to support tables with 100+ columns#80
Batch concat_ws calls to support tables with 100+ columns#80mason-sharp merged 1 commit intomainfrom
Conversation
PostgreSQL limits functions to 100 arguments. The row-hashing expression
used concat_ws('|', col1, ..., colN), which fails on wide tables. Split
column expressions into groups of 99 with nested concat_ws calls. This
affects both table-diff and merkle tree hash queries.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughAdded batching logic for PostgreSQL concat_ws function calls to handle the 100-argument limit. Introduced a Changes
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
db/queries/queries.go (1)
583-599: Document the single-depth nesting assumption and guard against empty input.Two minor gaps worth addressing:
Single-depth nesting: The outer
concat_wsitself has no overflow protection — iflen(batches)ever exceeded 99, it would hit the same PostgreSQL 100-arg limit. In practice this can't happen because PostgreSQL's column limit (~1600) means at most⌈1600/99⌉ = 17batches, but the assumption is implicit. A brief comment would prevent confusion if this function is ever reused outside its current context.Empty-slice precondition:
concatWSBatched([]string{})emitsconcat_ws('|', )— invalid SQL. The only call site (buildRowTextExpr) guards this with an early return, but the function itself carries no documented precondition.♻️ Proposed improvements
// concatWSBatched produces a concat_ws('|', ...) expression. When len(exprs) // exceeds 99 (the max value-arguments per concat_ws call, since the separator // takes one slot), it splits the expressions into batches and nests the calls. +// +// Only one level of nesting is applied. This is safe because PostgreSQL's +// maximum column count (~1600) means at most ceil(1600/99)=17 inner batches, +// keeping the outer concat_ws well within the 100-argument limit. +// +// Precondition: exprs must not be empty. func concatWSBatched(exprs []string) string { const maxArgs = 99 // 100 total - 1 for the separator + if len(exprs) == 0 { + return "''" + } + if len(exprs) <= maxArgs { return fmt.Sprintf("concat_ws('|', %s)", strings.Join(exprs, ", ")) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@db/queries/queries.go` around lines 583 - 599, The concatWSBatched function needs a short doc comment stating it assumes only single-depth nesting (the outer concat_ws is not batched and therefore must not exceed PostgreSQL's ~100-arg limit in this usage) and must guard against an empty input slice; update concatWSBatched to return an empty string immediately when len(exprs) == 0 (or otherwise no-op) to avoid emitting invalid SQL like "concat_ws('|', )", and add the comment near the const maxArgs / batches logic referencing concatWSBatched and batches so future readers understand the single-depth assumption.db/queries/queries_test.go (2)
457-468: Add expression-presence check to the 200-expression subtest.Unlike the 100-expression subtest, this case only verifies the number of
concat_wscalls but not that all 200 expressions appear in the output. A bug that drops the third batch entirely would still pass.♻️ Proposed addition
if strings.Count(result, "concat_ws(") != 4 { t.Errorf("expected 4 concat_ws calls for 200 exprs, got %d in: %s", strings.Count(result, "concat_ws("), result) } + for _, e := range exprs { + if !strings.Contains(result, e) { + t.Errorf("missing expression %s in result", e) + } + } })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@db/queries/queries_test.go` around lines 457 - 468, Update the "200 expressions nests into 3 inner batches" subtest to also assert that all expressions are present in the result: after calling concatWSBatched(exprs) iterate the expected expression names (e.g., "col0".."col199") and assert each appears in result (or assert the total occurrences of "col" labels equals 200). Modify the test that calls concatWSBatched to include this presence check so a dropped batch would fail; reference the concatWSBatched call and the local exprs slice when adding the assertions.
449-455: Substring collision weakens the expression-presence check.Column names
col0–col99include prefix collisions:"col1"is a substring of"col10"–"col19", sostrings.Contains(result, "col1")passes even ifcol1is absent but any ofcol10–col19is present. The test will not catch a regression where a specific column is silently dropped.♻️ More precise presence check
- // Every expression should be present - for _, e := range exprs { - if !strings.Contains(result, e) { - t.Errorf("missing expression %s in result", e) - } - } + // Every expression should be present — check for the exact token + // (comma or parenthesis delimited) to avoid substring false-positives. + for _, e := range exprs { + // A column appears as ", colN" or "(colN" inside the concat_ws calls. + if !strings.Contains(result, ", "+e) && !strings.Contains(result, "("+e) { + t.Errorf("missing expression %s in result: %s", e, result) + } + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@db/queries/queries_test.go` around lines 449 - 455, The check using strings.Contains(result, e) is vulnerable to substring collisions (e.g., "col1" matching "col10"); update the loop that iterates over exprs and checks result to use a word-boundary regex match instead: compile regexp.MustCompile(`\b` + regexp.QuoteMeta(e) + `\b`) and use its MatchString(result) in place of strings.Contains; ensure you add the regexp import and use exprs/result as before so each column name is matched as a whole token.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@db/queries/queries_test.go`:
- Around line 457-468: Update the "200 expressions nests into 3 inner batches"
subtest to also assert that all expressions are present in the result: after
calling concatWSBatched(exprs) iterate the expected expression names (e.g.,
"col0".."col199") and assert each appears in result (or assert the total
occurrences of "col" labels equals 200). Modify the test that calls
concatWSBatched to include this presence check so a dropped batch would fail;
reference the concatWSBatched call and the local exprs slice when adding the
assertions.
- Around line 449-455: The check using strings.Contains(result, e) is vulnerable
to substring collisions (e.g., "col1" matching "col10"); update the loop that
iterates over exprs and checks result to use a word-boundary regex match
instead: compile regexp.MustCompile(`\b` + regexp.QuoteMeta(e) + `\b`) and use
its MatchString(result) in place of strings.Contains; ensure you add the regexp
import and use exprs/result as before so each column name is matched as a whole
token.
In `@db/queries/queries.go`:
- Around line 583-599: The concatWSBatched function needs a short doc comment
stating it assumes only single-depth nesting (the outer concat_ws is not batched
and therefore must not exceed PostgreSQL's ~100-arg limit in this usage) and
must guard against an empty input slice; update concatWSBatched to return an
empty string immediately when len(exprs) == 0 (or otherwise no-op) to avoid
emitting invalid SQL like "concat_ws('|', )", and add the comment near the const
maxArgs / batches logic referencing concatWSBatched and batches so future
readers understand the single-depth assumption.
PostgreSQL limits functions to 100 arguments. The row-hashing expression used concat_ws('|', col1, ..., colN), which fails on wide tables. Split column expressions into groups of 99 with nested concat_ws calls. This affects both table-diff and merkle tree hash queries.