Support multi-process debugging: sync breakpoints and coordinate instances #1172
Draft
st0012 wants to merge 5 commits intoruby:masterfrom
Draft
Support multi-process debugging: sync breakpoints and coordinate instances #1172st0012 wants to merge 5 commits intoruby:masterfrom
st0012 wants to merge 5 commits intoruby:masterfrom
Conversation
…ances Two fixes for debugging multi-process Ruby applications: 1. Breakpoint synchronization across forked processes (fixes ruby#714): Store serialized breakpoint specs in a shared JSON tempfile alongside the existing flock tempfile. Publish on subsession leave, check on subsession enter, and in the socket reader retry paths for both DAP and console protocols. Breakpoints define to_sync_data for serialization. Only LineBreakpoint and CatchBreakpoint are synced. 2. Coordination of independent debugger instances: When parallel test runners fork workers before the debugger loads, each worker gets its own SESSION with no coordination. Add a well-known lock file keyed by process group ID (/tmp/ruby-debug-{uid}-pgrp-{getpgrp}.lock) that all sibling instances discover automatically. On enter_subsession, acquire the lock (blocking flock) so only one process enters the debugger at a time. While blocked, no prompt is shown and IRB/Reline never reads STDIN.
❌ 2/707 Tests Failed/home/runner/work/debug/debug/test/protocol/hover_raw_dap_test.rb#test_hover_works_correctly/home/runner/work/debug/debug/test/protocol/hover_raw_dap_test.rb#test_1641198331 |
Tests for breakpoint sync (fork_bp_sync_test.rb): - Breakpoint set/deleted after fork syncs to child - Multiple children receive synced breakpoints - Catch breakpoint syncs to child - Late-forked child catches up - Stress test with binding.break Tests for well-known lock (wk_lock_test.rb): - Single-process debugging unaffected - fork_mode: :both uses ProcessGroup not well-known lock - Independent workers serialized by well-known lock
1395d43 to
cdd7e8e
Compare
- Fix version counter drift: read file version before writing to prevent processes from missing each other's updates - Add MethodBreakpoint sync support (to_sync_data + reconciliation) - Fix CatchBreakpoint sync to preserve command and path attributes - Add syncable? predicate to avoid unnecessary hash allocation - Add type validation in create_bp_from_spec for defense-in-depth - Use Dir.tmpdir instead of hardcoded /tmp for portability - Set explicit 0600 permissions on temp state file writes - Broaden error handling to SystemCallError in read/write state - Add error handling to ensure_wk_lock! for disk-full/read-only - Publish breakpoint changes on DAP disconnect
When multiple independent workers share the well-known lock, releasing it on step/next/finish allowed a sibling worker to grab the lock before the stepping worker could re-enter its subsession. This caused the user to need 2 next commands to actually advance — the first one would inadvertently drive the other worker. Only release wk_lock on :continue, which is expected to run for an extended period. Step commands hold the lock so the same worker immediately re-enters without yielding.
The previous fix only held the lock during step commands, but continue between breakpoints had the same ping-pong problem — another worker could grab the lock before the current one hit its next breakpoint. Now the wk_lock is never released in leave_subsession. Each worker keeps exclusive debugger access for its entire lifetime. Other workers queue up and get their turn when the current one exits. The kernel releases flock automatically on process exit.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem 1: Breakpoints not shared between forked processes (#714)
Closes #714
Summary
When a Ruby app forks workers (Puma, Unicorn), breakpoints set in the debugger only fire in some workers. Users must toggle breakpoints multiple times to get them to register, and hits are inconsistent.
Cause
In
fork_mode: :both(the default), afterfork()each process gets an independent copy of@bps. The existingProcessGroupflock only serializes which process talks to the debugger — it never synchronizes breakpoint state.Solution
Store serialized breakpoint specs in a shared JSON tempfile alongside the existing flock tempfile. Publish on subsession leave, check on subsession enter.
LineBreakpoint,CatchBreakpoint, andMethodBreakpointare synced as descriptors. Writes are atomic (tmp file +File.rename).Problem 2: Multiple debugger instances competing for STDIN
Summary
When running parallel test workers (parallel_tests, ci-queue), all workers that hit
debuggerenter the debug prompt simultaneously. Output is clobbered and input goes to random processes.Cause
Parallel test runners fork workers before the debugger loads. Each worker creates its own
SESSIONwith no shared coordination — they all compete for the same STDIN.Solution
Add a well-known lock file keyed by process group ID. On
enter_subsession, acquire with blockingflock(LOCK_EX). The lock is held for the worker's entire session and released by the kernel on process exit. Skipped whenMultiProcessGroupis active (fork_mode: :both already handles coordination).