SPOC-311: Fix inconsistent LSN tracking during subscription SYNC.#351
SPOC-311: Fix inconsistent LSN tracking during subscription SYNC.#351ibrarahmad wants to merge 3 commits intopgEdge:mainfrom
Conversation
Move adjust_progress_info() after ensure_replication_slot_snapshot() and execute it within the same exported snapshot transaction, so that the progress LSNs are consistent with the data included in the COPY.
📝 WalkthroughWalkthroughProgress read during initial subscription sync is moved to occur after creating the COPY snapshot and after starting a dedicated copy transaction; version-conditional injection-point markers were added; origin connection rollback and PQfinish handling were introduced; a new TAP test verifies behavior across a 3-node cluster. Changes
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
src/spock_sync.c (1)
1223-1234: Consider reusingfinish_copy_origin_tx()to avoid duplicating the ROLLBACK + PQfinish pattern.
finish_copy_origin_tx()(line 636) already performsROLLBACK+PQfinish, which is exactly what lines 1225–1234 do manually. Replacing this block with a single call reduces duplication and keeps the cleanup pattern consistent.♻️ Suggested simplification
start_copy_origin_tx(origin_conn, snapshot); progress_entries_list = adjust_progress_info(origin_conn); - { - PGresult *res = PQexec(origin_conn, "ROLLBACK"); - - if (PQresultStatus(res) != PGRES_COMMAND_OK) - elog(WARNING, "ROLLBACK on origin node failed: %s", - PQresultErrorMessage(res)); - PQclear(res); - } - - PQfinish(origin_conn); + finish_copy_origin_tx(origin_conn);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/spock_sync.c` around lines 1223 - 1234, The block after calling start_copy_origin_tx(origin_conn, snapshot) that sets progress_entries_list = adjust_progress_info(origin_conn) then manually runs ROLLBACK + PQfinish should be replaced by a single call to finish_copy_origin_tx(origin_conn) to reuse the existing cleanup sequence; ensure you still call adjust_progress_info(origin_conn) to set progress_entries_list before invoking finish_copy_origin_tx(), and remove the manual PQexec("ROLLBACK") / PQfinish() lines to avoid duplicating the cleanup logic implemented by finish_copy_origin_tx().tests/tap/t/018_inject_sync_progress.pl (1)
64-64:sleep 10is a fragile timing assumption.A fixed 10-second sleep to wait for the sync worker to reach the injection point is race-prone: on a slow CI machine it may not be enough, and on a fast machine it wastes time. Consider polling for evidence that the injection point has been reached (e.g., checking
pg_stat_activityor an injection-point status function) with a bounded timeout instead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/tap/t/018_inject_sync_progress.pl` at line 64, Replace the fragile fixed delay "sleep 10" with a bounded polling loop that checks for the sync worker having reached the injection point instead of sleeping: repeatedly query a reliable signal (e.g., SELECT count(*) FROM pg_stat_activity WHERE query ILIKE '%<expected sync query snippet>%' or call your injection-point helper like injection_point_status() or a test-only function) with a short interval and a timeout (e.g., poll every 200–500ms up to N seconds), and proceed only once the query indicates the injection point is reached or fail the test after the timeout; update the test's "sleep 10" site to use this polling pattern so it is robust and faster on CI.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/spock_sync.c`:
- Line 1212: The injection point name is misleading because
INJECTION_POINT("spock-before-replication-slot-snapshot", "wait") is invoked
after ensure_replication_slot_snapshot() returns; rename the injection string to
a name that reflects it fires after snapshot creation (for example
"spock-after-replication-slot-snapshot" or "spock-before-sync-progress-read")
and update any tests or callers that reference the old name so they use the new
identifier; ensure the INJECTION_POINT invocation and any matching test fixtures
(e.g., those expecting the delay between snapshot creation and progress read)
remain consistent.
- Line 61: Add a PG version guard around the unguarded include of
utils/injection_point.h and the INJECTION_POINT(...) usage: wrap the `#include`
"utils/injection_point.h" with `#if` PG_VERSION_NUM >= 170000 ... `#endif`, and wrap
the INJECTION_POINT(...) call (the macro invocation found in src/spock_sync.c)
with the same `#if/`#endif so the header and macro are only used when
PG_VERSION_NUM >= 170000; keep the guard pattern consistent with other guards in
the codebase.
In `@tests/tap/t/011_zodan_sync_third.pl`:
- Line 3: The test plan declared as "use Test::More tests => 30;" does not match
the actual number of assertions produced by this file (only ~24 from explicit
ok() calls plus helper-created assertions in create_cluster, cross_wire, and
destroy_cluster); fix by either adjusting the Test::More plan to the correct
total (count explicit ok() calls plus the assertions added by create_cluster,
cross_wire, and destroy_cluster) or add the missing assertions to reach 30
(e.g., add explicit ok() checks or additional expectations) and ensure the
symbols to inspect are the Test::More plan line and the helper functions
create_cluster, cross_wire, and destroy_cluster.
In `@tests/tap/t/018_inject_sync_progress.pl`:
- Around line 58-59: The test attaches the wrong injection point name
'spock-before-replication-slot-snapshot' in psql_or_bail; update the attached
name to the post-snapshot injection point used in spock_sync.c (e.g. replace
with 'spock-after-replication-slot-snapshot') so the test delays after the
replication slot snapshot is created and matches the C code behavior.
---
Nitpick comments:
In `@src/spock_sync.c`:
- Around line 1223-1234: The block after calling
start_copy_origin_tx(origin_conn, snapshot) that sets progress_entries_list =
adjust_progress_info(origin_conn) then manually runs ROLLBACK + PQfinish should
be replaced by a single call to finish_copy_origin_tx(origin_conn) to reuse the
existing cleanup sequence; ensure you still call
adjust_progress_info(origin_conn) to set progress_entries_list before invoking
finish_copy_origin_tx(), and remove the manual PQexec("ROLLBACK") / PQfinish()
lines to avoid duplicating the cleanup logic implemented by
finish_copy_origin_tx().
In `@tests/tap/t/018_inject_sync_progress.pl`:
- Line 64: Replace the fragile fixed delay "sleep 10" with a bounded polling
loop that checks for the sync worker having reached the injection point instead
of sleeping: repeatedly query a reliable signal (e.g., SELECT count(*) FROM
pg_stat_activity WHERE query ILIKE '%<expected sync query snippet>%' or call
your injection-point helper like injection_point_status() or a test-only
function) with a short interval and a timeout (e.g., poll every 200–500ms up to
N seconds), and proceed only once the query indicates the injection point is
reached or fail the test after the timeout; update the test's "sleep 10" site to
use this polling pattern so it is robust and faster on CI.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tests/tap/t/018_inject_sync_progress.pl (1)
4-4:IPC::Runis imported but not directly used.
IPC::Runis loaded at line 4 but no function from it (run,start,harness, etc.) appears in this file. If it is only needed transitively bySpockTest, theuse IPC::Runhere is dead code. Remove it to keep imports honest.♻️ Proposed fix
-use IPC::Run;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/tap/t/018_inject_sync_progress.pl` at line 4, The file imports IPC::Run via the statement "use IPC::Run;" but no symbols (run, start, harness, etc.) from IPC::Run are used; remove the "use IPC::Run;" line from tests/tap/t/018_inject_sync_progress.pl unless IPC::Run is required transitively by SpockTest or other test helpers—if it is required, replace the direct import with a comment explaining why or move the dependency to the module that actually needs it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/spock_sync.c`:
- Around line 1229-1240: The inline ROLLBACK + PQfinish pattern after
start_copy_origin_tx can leak origin_conn if start_copy_origin_tx calls
elog(ERROR); replace that inline block with a call to the existing
finish_copy_origin_tx(origin_conn, snapshot or relevant state) to perform both
rollback and PQfinish, and ensure the call site is protected by PG_TRY/PG_CATCH
(or ensure the PG_ENSURE_ERROR_CLEANUP handler calls PQfinish) so origin_conn is
always PQfinished even when start_copy_origin_tx throws; update
spock_sync_worker_cleanup_error_cb if needed to avoid duplicating cleanup so
finish_copy_origin_tx is the single responsibility for ending the remote
transaction and closing origin_conn.
In `@tests/tap/t/018_inject_sync_progress.pl`:
- Line 64: Replace the fragile sleep 10 with a polling wait that verifies the
sync worker is actually paused at the injection point before running the DML:
after calling sub_create, poll pg_stat_activity (using scalar_query) for
wait_event_type = 'InjectionPoint' and wait_event =
'spock-before-sync-progress-read' (or call injection_points_wait_until_attached
if that helper exists) with a sensible timeout (e.g. 60s) and die on timeout;
only proceed to the UPDATE/injection_points_wakeup once the poll reports the
worker is attached.
- Around line 105-107: The test creates a 3-node cluster with create_cluster(3,
...) but never tears it down; add a call to destroy_cluster with the same
cluster identifier (e.g., destroy_cluster(3, ...) or destroy_cluster($cluster)
matching how create_cluster was invoked) after the final assertions (after the
ok(...) lines) to stop node processes and remove temp directories so the test
cleans up after itself.
---
Duplicate comments:
In `@src/spock_sync.c`:
- Around line 61-63: The version-guarding of the injection_point.h include is
correct; keep the conditional compilation around the include (the `#if`
PG_VERSION_NUM >= 170000 ... `#endif` block) so that injection_point.h is only
included for PG_VERSION_NUM >= 170000, and ensure the symbol injection_point.h
and the macro PG_VERSION_NUM are used exactly as in src/spock_sync.c to preserve
build compatibility with PG 15/16.
In `@tests/tap/t/018_inject_sync_progress.pl`:
- Around line 58-59: The injection point name mismatch has been resolved: ensure
the call to injection_points_attach in tests/tap/t/018_inject_sync_progress.pl
uses the exact string 'spock-before-sync-progress-read' to match the C-level
INJECTION_POINT (used in INJECTION_POINT(...) or injection point macro) and that
no other occurrences use the old name; no functional code change required beyond
keeping injection_points_attach and the C INJECTION_POINT in sync.
---
Nitpick comments:
In `@tests/tap/t/018_inject_sync_progress.pl`:
- Line 4: The file imports IPC::Run via the statement "use IPC::Run;" but no
symbols (run, start, harness, etc.) from IPC::Run are used; remove the "use
IPC::Run;" line from tests/tap/t/018_inject_sync_progress.pl unless IPC::Run is
required transitively by SpockTest or other test helpers—if it is required,
replace the direct import with a comment explaining why or move the dependency
to the module that actually needs it.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/tap/t/018_inject_sync_progress.pl`:
- Around line 66-74: The SQL check in the polling loop uses the wrong column:
change the scalar_query call inside the while loop (the SELECT that currently
tests wait_event = 'InjectionPoint') to instead check wait_event_type =
'InjectionPoint' and wait_event = 'spock-before-sync-progress-read' so the loop
actually detects the backend blocked at the named injection point; keep the loop
structure and $waited variable as-is and only update the WHERE clause in the
scalar_query invocation so ok($waited < 60, ...) behaves as intended.
- Line 24: The test plan declares 15 tests but only yields 9 assertions; update
the Plan or add assertions: either change the "plan tests => 15;" line to "plan
tests => 9;" to match the actual assertions, or add six additional assertions
(e.g., more ok() calls or explicit pass() checks) so the total equals 15; ensure
any added assertions are placed where appropriate relative to the existing ok()
calls and helper-generated assertions from create_cluster(), cross_wire(), and
destroy_cluster().
| plan skip_all => 'Injection points not supported by this build'; | ||
| } | ||
|
|
||
| plan tests => 15; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Count ok() / pass/fail assertions emitted by SpockTest helpers used in this file,
# and verify the total matches `plan tests => 15`.
# 1. How many ok() calls in this test file itself
echo "=== Explicit ok() calls in 018_inject_sync_progress.pl ==="
rg -n '\bok\s*\(' tests/tap/t/018_inject_sync_progress.pl
# 2. Find SpockTest module
echo -e "\n=== Finding SpockTest module ==="
fd -e pm | grep -i spocktest
# 3. Check what assertions SpockTest helpers emit
echo -e "\n=== Assertions inside create_cluster ==="
rg -n -A10 'sub create_cluster' tests/tap/SpockTest.pm || echo "Not found in default location"
echo -e "\n=== Assertions inside cross_wire ==="
rg -n -A10 'sub cross_wire' tests/tap/SpockTest.pm || echo "Not found in default location"
echo -e "\n=== Assertions inside destroy_cluster ==="
rg -n -A10 'sub destroy_cluster' tests/tap/SpockTest.pm || echo "Not found in default location"
echo -e "\n=== Total ok/pass/fail/is calls in SpockTest.pm ==="
rg -c '\b(?:ok|pass|fail|is|like|unlike|cmp_ok)\s*\(' tests/tap/SpockTest.pm || echo "File not found"Repository: pgEdge/spock
Length of output: 1149
🏁 Script executed:
#!/bin/bash
# Check the SpockTest helpers at the correct path
echo "=== Assertions inside create_cluster ==="
rg -n -A15 'sub create_cluster' tests/tap/t/SpockTest.pm
echo -e "\n=== Assertions inside cross_wire ==="
rg -n -A15 'sub cross_wire' tests/tap/t/SpockTest.pm
echo -e "\n=== Assertions inside destroy_cluster ==="
rg -n -A15 'sub destroy_cluster' tests/tap/t/SpockTest.pm
echo -e "\n=== Total assertion calls in SpockTest.pm ==="
rg '\b(?:ok|pass|fail|is|like|unlike|cmp_ok)\s*\(' tests/tap/t/SpockTest.pm | wc -lRepository: pgEdge/spock
Length of output: 3174
🏁 Script executed:
#!/bin/bash
# Check which SpockTest helpers are called in 018_inject_sync_progress.pl
echo "=== Calls to SpockTest helpers in 018_inject_sync_progress.pl ==="
rg -n 'create_cluster|cross_wire|destroy_cluster' tests/tap/t/018_inject_sync_progress.pl
# Get full content of 018_inject_sync_progress.pl to understand structure
echo -e "\n=== Full content of 018_inject_sync_progress.pl ==="
cat tests/tap/t/018_inject_sync_progress.pl
# Check all ok() calls in SpockTest.pm with context
echo -e "\n=== All assertion calls in SpockTest.pm with context ==="
rg -n -B2 -A2 '\b(?:ok|pass|fail)\s*\(' tests/tap/t/SpockTest.pmRepository: pgEdge/spock
Length of output: 6974
Fix plan count mismatch.
The plan declares 15 tests, but only 9 assertions are generated:
- 6 explicit
ok()calls (lines 74, 108, 109, 115, 116, 117) - 3
pass()calls from SpockTest helpers:create_cluster()emits 1 assertioncross_wire()emits 1 assertiondestroy_cluster()emits 1 assertion
Update plan tests => 15; to plan tests => 9; or add 6 more assertions to match the declared count.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/tap/t/018_inject_sync_progress.pl` at line 24, The test plan declares
15 tests but only yields 9 assertions; update the Plan or add assertions: either
change the "plan tests => 15;" line to "plan tests => 9;" to match the actual
assertions, or add six additional assertions (e.g., more ok() calls or explicit
pass() checks) so the total equals 15; ensure any added assertions are placed
where appropriate relative to the existing ok() calls and helper-generated
assertions from create_cluster(), cross_wire(), and destroy_cluster().
| my $waited = 0; | ||
| while ($waited < 60) { | ||
| my $hit = scalar_query(3, | ||
| "SELECT 1 FROM pg_stat_activity WHERE wait_event = 'InjectionPoint' AND datname = current_database()"); | ||
| last if ($hit eq '1'); | ||
| sleep 1; | ||
| $waited++; | ||
| } | ||
| ok($waited < 60, "Sync worker reached injection point within 60s"); |
There was a problem hiding this comment.
wait_event = 'InjectionPoint' is the wrong column — polling loop will always time out.
In pg_stat_activity, wait_event_type is the wait category and wait_event is the specific reason a client is waiting for. When a backend is blocked at an injection point named spock-before-sync-progress-read, PostgreSQL sets wait_event_type = 'InjectionPoint' and wait_event = 'spock-before-sync-progress-read'. The condition on line 69 tests wait_event = 'InjectionPoint', which will never match any injection-point wait — it would only match if there were a point literally named 'InjectionPoint'. As a result, $waited always reaches 60, so:
ok($waited < 60, ...)at line 74 always emitsnot ok.- The 60-second sleep lets the sync worker advance past the injection point before the
UPDATEon line 80, making the test vacuous.
🐛 Proposed fix
- my $hit = scalar_query(3,
- "SELECT 1 FROM pg_stat_activity WHERE wait_event = 'InjectionPoint' AND datname = current_database()");
+ my $hit = scalar_query(3,
+ "SELECT 1 FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'"
+ . " AND wait_event = 'spock-before-sync-progress-read'"
+ . " AND datname = current_database()");📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| my $waited = 0; | |
| while ($waited < 60) { | |
| my $hit = scalar_query(3, | |
| "SELECT 1 FROM pg_stat_activity WHERE wait_event = 'InjectionPoint' AND datname = current_database()"); | |
| last if ($hit eq '1'); | |
| sleep 1; | |
| $waited++; | |
| } | |
| ok($waited < 60, "Sync worker reached injection point within 60s"); | |
| my $waited = 0; | |
| while ($waited < 60) { | |
| my $hit = scalar_query(3, | |
| "SELECT 1 FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'" | |
| . " AND wait_event = 'spock-before-sync-progress-read'" | |
| . " AND datname = current_database()"); | |
| last if ($hit eq '1'); | |
| sleep 1; | |
| $waited++; | |
| } | |
| ok($waited < 60, "Sync worker reached injection point within 60s"); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/tap/t/018_inject_sync_progress.pl` around lines 66 - 74, The SQL check
in the polling loop uses the wrong column: change the scalar_query call inside
the while loop (the SELECT that currently tests wait_event = 'InjectionPoint')
to instead check wait_event_type = 'InjectionPoint' and wait_event =
'spock-before-sync-progress-read' so the loop actually detects the backend
blocked at the named injection point; keep the loop structure and $waited
variable as-is and only update the WHERE clause in the scalar_query invocation
so ok($waited < 60, ...) behaves as intended.
Move adjust_progress_info() after ensure_replication_slot_snapshot() and execute it within the same exported snapshot transaction, so that the progress LSNs are consistent with the data included in the COPY.