Skip to content

SPOC-311: Fix inconsistent LSN tracking during subscription SYNC.#351

Open
ibrarahmad wants to merge 3 commits intopgEdge:mainfrom
ibrarahmad:SPOC-311
Open

SPOC-311: Fix inconsistent LSN tracking during subscription SYNC.#351
ibrarahmad wants to merge 3 commits intopgEdge:mainfrom
ibrarahmad:SPOC-311

Conversation

@ibrarahmad
Copy link
Contributor

Move adjust_progress_info() after ensure_replication_slot_snapshot() and execute it within the same exported snapshot transaction, so that the progress LSNs are consistent with the data included in the COPY.

Move adjust_progress_info() after ensure_replication_slot_snapshot() and
execute it within the same exported snapshot transaction, so that the
progress LSNs are consistent with the data included in the COPY.
@coderabbitai
Copy link

coderabbitai bot commented Feb 18, 2026

📝 Walkthrough

Walkthrough

Progress read during initial subscription sync is moved to occur after creating the COPY snapshot and after starting a dedicated copy transaction; version-conditional injection-point markers were added; origin connection rollback and PQfinish handling were introduced; a new TAP test verifies behavior across a 3-node cluster.

Changes

Cohort / File(s) Summary
Sync logic
src/spock_sync.c
Conditionally include injection_point.h for PG >= 170000; move progress read to after COPY snapshot creation and after starting a copy transaction; add injection-point markers (version-conditional for 170000+/180000+); start a copy-origin transaction, perform a rollback on origin_conn with warning on failure, and ensure PQfinish is called.
Test schedule & counts
tests/tap/schedule, tests/tap/t/011_zodan_sync_third.pl
Insert new test 018_inject_sync_progress into the test schedule and adjust the Test::More plan count in 011_zodan_sync_third.pl to reflect the added test.
New injection-point test
tests/tap/t/018_inject_sync_progress.pl
Add TAP script that requires injection-point support, sets up a 3-node cluster, injects a pause between snapshot creation and progress read on one node, performs updates on another node while paused, resumes the paused node, advances the replication slot using observed progress LSN, and verifies LSN-consistent progress and data equality across nodes.

Poem

🐇
I paused the copy mid-hop and peered,
read progress when the snapshot appeared.
If rollback stumbles, I thump and warn,
three nodes sync up at the break of dawn.
Carrots, commits, and a tidy cheer! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the main change: fixing inconsistent LSN tracking during subscription synchronization, which matches the core objective of making progress LSNs consistent with COPY data.
Description check ✅ Passed The description directly addresses the changeset by explaining the key modification: moving adjust_progress_info() after ensure_replication_slot_snapshot() to ensure consistency between progress LSNs and COPY data.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
src/spock_sync.c (1)

1223-1234: Consider reusing finish_copy_origin_tx() to avoid duplicating the ROLLBACK + PQfinish pattern.

finish_copy_origin_tx() (line 636) already performs ROLLBACK + PQfinish, which is exactly what lines 1225–1234 do manually. Replacing this block with a single call reduces duplication and keeps the cleanup pattern consistent.

♻️ Suggested simplification
 		start_copy_origin_tx(origin_conn, snapshot);
 		progress_entries_list = adjust_progress_info(origin_conn);
-		{
-			PGresult   *res = PQexec(origin_conn, "ROLLBACK");
-
-			if (PQresultStatus(res) != PGRES_COMMAND_OK)
-				elog(WARNING, "ROLLBACK on origin node failed: %s",
-					 PQresultErrorMessage(res));
-			PQclear(res);
-		}
-
-		PQfinish(origin_conn);
+		finish_copy_origin_tx(origin_conn);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/spock_sync.c` around lines 1223 - 1234, The block after calling
start_copy_origin_tx(origin_conn, snapshot) that sets progress_entries_list =
adjust_progress_info(origin_conn) then manually runs ROLLBACK + PQfinish should
be replaced by a single call to finish_copy_origin_tx(origin_conn) to reuse the
existing cleanup sequence; ensure you still call
adjust_progress_info(origin_conn) to set progress_entries_list before invoking
finish_copy_origin_tx(), and remove the manual PQexec("ROLLBACK") / PQfinish()
lines to avoid duplicating the cleanup logic implemented by
finish_copy_origin_tx().
tests/tap/t/018_inject_sync_progress.pl (1)

64-64: sleep 10 is a fragile timing assumption.

A fixed 10-second sleep to wait for the sync worker to reach the injection point is race-prone: on a slow CI machine it may not be enough, and on a fast machine it wastes time. Consider polling for evidence that the injection point has been reached (e.g., checking pg_stat_activity or an injection-point status function) with a bounded timeout instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/tap/t/018_inject_sync_progress.pl` at line 64, Replace the fragile
fixed delay "sleep 10" with a bounded polling loop that checks for the sync
worker having reached the injection point instead of sleeping: repeatedly query
a reliable signal (e.g., SELECT count(*) FROM pg_stat_activity WHERE query ILIKE
'%<expected sync query snippet>%' or call your injection-point helper like
injection_point_status() or a test-only function) with a short interval and a
timeout (e.g., poll every 200–500ms up to N seconds), and proceed only once the
query indicates the injection point is reached or fail the test after the
timeout; update the test's "sleep 10" site to use this polling pattern so it is
robust and faster on CI.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/spock_sync.c`:
- Line 1212: The injection point name is misleading because
INJECTION_POINT("spock-before-replication-slot-snapshot", "wait") is invoked
after ensure_replication_slot_snapshot() returns; rename the injection string to
a name that reflects it fires after snapshot creation (for example
"spock-after-replication-slot-snapshot" or "spock-before-sync-progress-read")
and update any tests or callers that reference the old name so they use the new
identifier; ensure the INJECTION_POINT invocation and any matching test fixtures
(e.g., those expecting the delay between snapshot creation and progress read)
remain consistent.
- Line 61: Add a PG version guard around the unguarded include of
utils/injection_point.h and the INJECTION_POINT(...) usage: wrap the `#include`
"utils/injection_point.h" with `#if` PG_VERSION_NUM >= 170000 ... `#endif`, and wrap
the INJECTION_POINT(...) call (the macro invocation found in src/spock_sync.c)
with the same `#if/`#endif so the header and macro are only used when
PG_VERSION_NUM >= 170000; keep the guard pattern consistent with other guards in
the codebase.

In `@tests/tap/t/011_zodan_sync_third.pl`:
- Line 3: The test plan declared as "use Test::More tests => 30;" does not match
the actual number of assertions produced by this file (only ~24 from explicit
ok() calls plus helper-created assertions in create_cluster, cross_wire, and
destroy_cluster); fix by either adjusting the Test::More plan to the correct
total (count explicit ok() calls plus the assertions added by create_cluster,
cross_wire, and destroy_cluster) or add the missing assertions to reach 30
(e.g., add explicit ok() checks or additional expectations) and ensure the
symbols to inspect are the Test::More plan line and the helper functions
create_cluster, cross_wire, and destroy_cluster.

In `@tests/tap/t/018_inject_sync_progress.pl`:
- Around line 58-59: The test attaches the wrong injection point name
'spock-before-replication-slot-snapshot' in psql_or_bail; update the attached
name to the post-snapshot injection point used in spock_sync.c (e.g. replace
with 'spock-after-replication-slot-snapshot') so the test delays after the
replication slot snapshot is created and matches the C code behavior.

---

Nitpick comments:
In `@src/spock_sync.c`:
- Around line 1223-1234: The block after calling
start_copy_origin_tx(origin_conn, snapshot) that sets progress_entries_list =
adjust_progress_info(origin_conn) then manually runs ROLLBACK + PQfinish should
be replaced by a single call to finish_copy_origin_tx(origin_conn) to reuse the
existing cleanup sequence; ensure you still call
adjust_progress_info(origin_conn) to set progress_entries_list before invoking
finish_copy_origin_tx(), and remove the manual PQexec("ROLLBACK") / PQfinish()
lines to avoid duplicating the cleanup logic implemented by
finish_copy_origin_tx().

In `@tests/tap/t/018_inject_sync_progress.pl`:
- Line 64: Replace the fragile fixed delay "sleep 10" with a bounded polling
loop that checks for the sync worker having reached the injection point instead
of sleeping: repeatedly query a reliable signal (e.g., SELECT count(*) FROM
pg_stat_activity WHERE query ILIKE '%<expected sync query snippet>%' or call
your injection-point helper like injection_point_status() or a test-only
function) with a short interval and a timeout (e.g., poll every 200–500ms up to
N seconds), and proceed only once the query indicates the injection point is
reached or fail the test after the timeout; update the test's "sleep 10" site to
use this polling pattern so it is robust and faster on CI.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
tests/tap/t/018_inject_sync_progress.pl (1)

4-4: IPC::Run is imported but not directly used.

IPC::Run is loaded at line 4 but no function from it (run, start, harness, etc.) appears in this file. If it is only needed transitively by SpockTest, the use IPC::Run here is dead code. Remove it to keep imports honest.

♻️ Proposed fix
-use IPC::Run;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/tap/t/018_inject_sync_progress.pl` at line 4, The file imports IPC::Run
via the statement "use IPC::Run;" but no symbols (run, start, harness, etc.)
from IPC::Run are used; remove the "use IPC::Run;" line from
tests/tap/t/018_inject_sync_progress.pl unless IPC::Run is required transitively
by SpockTest or other test helpers—if it is required, replace the direct import
with a comment explaining why or move the dependency to the module that actually
needs it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/spock_sync.c`:
- Around line 1229-1240: The inline ROLLBACK + PQfinish pattern after
start_copy_origin_tx can leak origin_conn if start_copy_origin_tx calls
elog(ERROR); replace that inline block with a call to the existing
finish_copy_origin_tx(origin_conn, snapshot or relevant state) to perform both
rollback and PQfinish, and ensure the call site is protected by PG_TRY/PG_CATCH
(or ensure the PG_ENSURE_ERROR_CLEANUP handler calls PQfinish) so origin_conn is
always PQfinished even when start_copy_origin_tx throws; update
spock_sync_worker_cleanup_error_cb if needed to avoid duplicating cleanup so
finish_copy_origin_tx is the single responsibility for ending the remote
transaction and closing origin_conn.

In `@tests/tap/t/018_inject_sync_progress.pl`:
- Line 64: Replace the fragile sleep 10 with a polling wait that verifies the
sync worker is actually paused at the injection point before running the DML:
after calling sub_create, poll pg_stat_activity (using scalar_query) for
wait_event_type = 'InjectionPoint' and wait_event =
'spock-before-sync-progress-read' (or call injection_points_wait_until_attached
if that helper exists) with a sensible timeout (e.g. 60s) and die on timeout;
only proceed to the UPDATE/injection_points_wakeup once the poll reports the
worker is attached.
- Around line 105-107: The test creates a 3-node cluster with create_cluster(3,
...) but never tears it down; add a call to destroy_cluster with the same
cluster identifier (e.g., destroy_cluster(3, ...) or destroy_cluster($cluster)
matching how create_cluster was invoked) after the final assertions (after the
ok(...) lines) to stop node processes and remove temp directories so the test
cleans up after itself.

---

Duplicate comments:
In `@src/spock_sync.c`:
- Around line 61-63: The version-guarding of the injection_point.h include is
correct; keep the conditional compilation around the include (the `#if`
PG_VERSION_NUM >= 170000 ... `#endif` block) so that injection_point.h is only
included for PG_VERSION_NUM >= 170000, and ensure the symbol injection_point.h
and the macro PG_VERSION_NUM are used exactly as in src/spock_sync.c to preserve
build compatibility with PG 15/16.

In `@tests/tap/t/018_inject_sync_progress.pl`:
- Around line 58-59: The injection point name mismatch has been resolved: ensure
the call to injection_points_attach in tests/tap/t/018_inject_sync_progress.pl
uses the exact string 'spock-before-sync-progress-read' to match the C-level
INJECTION_POINT (used in INJECTION_POINT(...) or injection point macro) and that
no other occurrences use the old name; no functional code change required beyond
keeping injection_points_attach and the C INJECTION_POINT in sync.

---

Nitpick comments:
In `@tests/tap/t/018_inject_sync_progress.pl`:
- Line 4: The file imports IPC::Run via the statement "use IPC::Run;" but no
symbols (run, start, harness, etc.) from IPC::Run are used; remove the "use
IPC::Run;" line from tests/tap/t/018_inject_sync_progress.pl unless IPC::Run is
required transitively by SpockTest or other test helpers—if it is required,
replace the direct import with a comment explaining why or move the dependency
to the module that actually needs it.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/tap/t/018_inject_sync_progress.pl`:
- Around line 66-74: The SQL check in the polling loop uses the wrong column:
change the scalar_query call inside the while loop (the SELECT that currently
tests wait_event = 'InjectionPoint') to instead check wait_event_type =
'InjectionPoint' and wait_event = 'spock-before-sync-progress-read' so the loop
actually detects the backend blocked at the named injection point; keep the loop
structure and $waited variable as-is and only update the WHERE clause in the
scalar_query invocation so ok($waited < 60, ...) behaves as intended.
- Line 24: The test plan declares 15 tests but only yields 9 assertions; update
the Plan or add assertions: either change the "plan tests => 15;" line to "plan
tests => 9;" to match the actual assertions, or add six additional assertions
(e.g., more ok() calls or explicit pass() checks) so the total equals 15; ensure
any added assertions are placed where appropriate relative to the existing ok()
calls and helper-generated assertions from create_cluster(), cross_wire(), and
destroy_cluster().

plan skip_all => 'Injection points not supported by this build';
}

plan tests => 15;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Count ok() / pass/fail assertions emitted by SpockTest helpers used in this file,
# and verify the total matches `plan tests => 15`.

# 1. How many ok() calls in this test file itself
echo "=== Explicit ok() calls in 018_inject_sync_progress.pl ==="
rg -n '\bok\s*\(' tests/tap/t/018_inject_sync_progress.pl

# 2. Find SpockTest module
echo -e "\n=== Finding SpockTest module ==="
fd -e pm | grep -i spocktest

# 3. Check what assertions SpockTest helpers emit
echo -e "\n=== Assertions inside create_cluster ==="
rg -n -A10 'sub create_cluster' tests/tap/SpockTest.pm || echo "Not found in default location"

echo -e "\n=== Assertions inside cross_wire ==="
rg -n -A10 'sub cross_wire' tests/tap/SpockTest.pm || echo "Not found in default location"

echo -e "\n=== Assertions inside destroy_cluster ==="
rg -n -A10 'sub destroy_cluster' tests/tap/SpockTest.pm || echo "Not found in default location"

echo -e "\n=== Total ok/pass/fail/is calls in SpockTest.pm ==="
rg -c '\b(?:ok|pass|fail|is|like|unlike|cmp_ok)\s*\(' tests/tap/SpockTest.pm || echo "File not found"

Repository: pgEdge/spock

Length of output: 1149


🏁 Script executed:

#!/bin/bash
# Check the SpockTest helpers at the correct path
echo "=== Assertions inside create_cluster ==="
rg -n -A15 'sub create_cluster' tests/tap/t/SpockTest.pm

echo -e "\n=== Assertions inside cross_wire ==="
rg -n -A15 'sub cross_wire' tests/tap/t/SpockTest.pm

echo -e "\n=== Assertions inside destroy_cluster ==="
rg -n -A15 'sub destroy_cluster' tests/tap/t/SpockTest.pm

echo -e "\n=== Total assertion calls in SpockTest.pm ==="
rg '\b(?:ok|pass|fail|is|like|unlike|cmp_ok)\s*\(' tests/tap/t/SpockTest.pm | wc -l

Repository: pgEdge/spock

Length of output: 3174


🏁 Script executed:

#!/bin/bash
# Check which SpockTest helpers are called in 018_inject_sync_progress.pl
echo "=== Calls to SpockTest helpers in 018_inject_sync_progress.pl ==="
rg -n 'create_cluster|cross_wire|destroy_cluster' tests/tap/t/018_inject_sync_progress.pl

# Get full content of 018_inject_sync_progress.pl to understand structure
echo -e "\n=== Full content of 018_inject_sync_progress.pl ==="
cat tests/tap/t/018_inject_sync_progress.pl

# Check all ok() calls in SpockTest.pm with context
echo -e "\n=== All assertion calls in SpockTest.pm with context ==="
rg -n -B2 -A2 '\b(?:ok|pass|fail)\s*\(' tests/tap/t/SpockTest.pm

Repository: pgEdge/spock

Length of output: 6974


Fix plan count mismatch.

The plan declares 15 tests, but only 9 assertions are generated:

  • 6 explicit ok() calls (lines 74, 108, 109, 115, 116, 117)
  • 3 pass() calls from SpockTest helpers:
    • create_cluster() emits 1 assertion
    • cross_wire() emits 1 assertion
    • destroy_cluster() emits 1 assertion

Update plan tests => 15; to plan tests => 9; or add 6 more assertions to match the declared count.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/tap/t/018_inject_sync_progress.pl` at line 24, The test plan declares
15 tests but only yields 9 assertions; update the Plan or add assertions: either
change the "plan tests => 15;" line to "plan tests => 9;" to match the actual
assertions, or add six additional assertions (e.g., more ok() calls or explicit
pass() checks) so the total equals 15; ensure any added assertions are placed
where appropriate relative to the existing ok() calls and helper-generated
assertions from create_cluster(), cross_wire(), and destroy_cluster().

Comment on lines +66 to +74
my $waited = 0;
while ($waited < 60) {
my $hit = scalar_query(3,
"SELECT 1 FROM pg_stat_activity WHERE wait_event = 'InjectionPoint' AND datname = current_database()");
last if ($hit eq '1');
sleep 1;
$waited++;
}
ok($waited < 60, "Sync worker reached injection point within 60s");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

wait_event = 'InjectionPoint' is the wrong column — polling loop will always time out.

In pg_stat_activity, wait_event_type is the wait category and wait_event is the specific reason a client is waiting for. When a backend is blocked at an injection point named spock-before-sync-progress-read, PostgreSQL sets wait_event_type = 'InjectionPoint' and wait_event = 'spock-before-sync-progress-read'. The condition on line 69 tests wait_event = 'InjectionPoint', which will never match any injection-point wait — it would only match if there were a point literally named 'InjectionPoint'. As a result, $waited always reaches 60, so:

  1. ok($waited < 60, ...) at line 74 always emits not ok.
  2. The 60-second sleep lets the sync worker advance past the injection point before the UPDATE on line 80, making the test vacuous.
🐛 Proposed fix
-    my $hit = scalar_query(3,
-        "SELECT 1 FROM pg_stat_activity WHERE wait_event = 'InjectionPoint' AND datname = current_database()");
+    my $hit = scalar_query(3,
+        "SELECT 1 FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'"
+        . " AND wait_event = 'spock-before-sync-progress-read'"
+        . " AND datname = current_database()");
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
my $waited = 0;
while ($waited < 60) {
my $hit = scalar_query(3,
"SELECT 1 FROM pg_stat_activity WHERE wait_event = 'InjectionPoint' AND datname = current_database()");
last if ($hit eq '1');
sleep 1;
$waited++;
}
ok($waited < 60, "Sync worker reached injection point within 60s");
my $waited = 0;
while ($waited < 60) {
my $hit = scalar_query(3,
"SELECT 1 FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'"
. " AND wait_event = 'spock-before-sync-progress-read'"
. " AND datname = current_database()");
last if ($hit eq '1');
sleep 1;
$waited++;
}
ok($waited < 60, "Sync worker reached injection point within 60s");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/tap/t/018_inject_sync_progress.pl` around lines 66 - 74, The SQL check
in the polling loop uses the wrong column: change the scalar_query call inside
the while loop (the SELECT that currently tests wait_event = 'InjectionPoint')
to instead check wait_event_type = 'InjectionPoint' and wait_event =
'spock-before-sync-progress-read' so the loop actually detects the backend
blocked at the named injection point; keep the loop structure and $waited
variable as-is and only update the WHERE clause in the scalar_query invocation
so ok($waited < 60, ...) behaves as intended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant