Revert internals destruction and add test for internals recreation#5972
Revert internals destruction and add test for internals recreation#5972rwgk merged 41 commits intopybind:masterfrom
Conversation
|
@XuehaiPan Is your plan to work on adding a new unit test here after you have the fix? — Short-term the fix is most important, long-term the test will be more important. I could try to use some LLM to generate a new test, based on what you provided under 5958. Do you think that'd be useful? |
|
It might be better to mostly revert #5958 if the behavior it has is not what we want. The problem described is expected (but undesirable) behavior from that PR. It is not a use-after-free. It's also possible for One option is to go back to something more like the first commit on #5958, before I went into releasing internals itself. The original goal of that PR was to get rid of the leaks of the |
|
I think releasing some part of the internal but keeping the internal leaked would be a safest choice. |
@XuehaiPan did you see my question above (https://github.com/pybind/pybind11/pull/5972#issuecomment-3786298624)? Subinterpreter support added a lot of complexity; if we don’t capture these issues in unit tests, pybind11 will regress over time. I’m very skeptical of fixes that don’t start from a failing unit test. |
Those leaks are currently dwarfed by CPython leaks documented here (pure‑C reproducer): Until those are fixed in CPython, or we find a workaround that sidesteps them, the savings from #5958 are marginal. If reverting #5958 is what it takes to ship v3.0.2, that seems like a reasonable short‑term compromise. |
This makes sense to me because most users only use the main interpreter. The leak will be freed anyway on program shutdown, while the correctness is more important.
I am trying to add a repro. It is hard to find a consistent repro for the order bugs since there is no guarantee. |
I added a test that fails on the master branch. But I don't think it can test all cases since there is no guarantee for the GC order. |
|
This looks great at first glance. I'll start understanding/reviewing this now with LLM assistance. |
|
Below is the result of me discussing this PR with Cursor (GPT-5.2 Codex High). I didn't study the code changes in this PR in detail. In my mind, these are the most important goals:
High level approach: pare this PR back, to more-or-less roll back 5958, but keep as much of the new unit test as possible. Based on that, Cursor came up with the Options A, B, C below. @XuehaiPan @b-pass, could you please scrutinize Cursor's analysis, does it make sense? Based on the analysis, Option A looks best to me. WDYT? PR 5972 internals version bump considerations (v3.0.2)Executive summaryFor v3.0.2 you want no internals version bump and no new ABI breaks, but you also want to
Nothing else in 5972 inherently requires a bump (no struct layout changes or ABI‑critical type So the decision point is:
Below are the minimal patch sets and recommendations. Root cause (why the crash happens)PR 5958 introduced What in 5972 is related to the crash fixCore behavioral change (the shutdown‑safety fix)These changes allow pybind11 types to hold a reference to the internals capsules, keeping them
This prevents Non‑essential or unrelated changesThese are not required for the crash fix and add risk for a patch release:
Mixed v3.0.1 / v3.0.2 (no bump) analysisIf you remove the bump, v3.0.1 and v3.0.2 modules will share internals (same capsule key). Implications:
Minimal patch sets for v3.0.2 (no bump)Option A (safest for mixed environments, no bump)Fully avoid destroying internals at shutdown (i.e., revert the risky part of 5958):
Pros
Cons
This is the only way to guarantee safety in mixed 3.0.1/3.0.2 processes without a bump. Option B (partial fix, no bump)Keep the shutdown behavior from 5958, but add the capsule‑ref protection from 5972: Keep
Drop
Pros
Cons
Option C (post‑release, with bump)After v3.0.2, keep the shutdown‑safety fix and bump internals version so old modules do not Recommended course (for v3.0.2)Given the requirement of full ABI compatibility and avoiding regressions for mixed builds, the
Then, after v3.0.2, do the larger cleanup:
Notes on test strategyThe current new test in 5972 uses a weakref callback to confirm the capsule is still alive at type
This directly targets the failure mode without enforcing a particular lifetime mechanism. Final answer to “what in 5972 requires an internals bump?”Strictly speaking, nothing in 5972 requires a bump. There is no ABI‑level change to the |
|
This is a very complex PR. In an attempt to understand it, I worked with Cursor (GPT-5.2 Codex High) on a draft comprehensive PR description (below). This all makes sense to me high-level. @XuehaiPan @b-pass could you please check for correctness? @XuehaiPan could you please add the corrected version to the actual PR description? I'll look in more detail after the comprehensive PR description is posted. DRAFT comprehensive PR descriptionSummaryThis PR changes pybind11’s interpreter‑shutdown behavior to avoid failures when Importantly, the internals version bump has been reverted; the default Motivation / backgroundPR #5958 introduced Runtime behavior changes (net effect vs
|
rwgk
left a comment
There was a problem hiding this comment.
A couple very easy suggestions for changes to comments.
include/pybind11/detail/internals.h
Outdated
| // Detect re-creation of internals after destruction during interpreter shutdown. | ||
| // If pybind11 code (e.g., tp_traverse/tp_clear calling py::cast) runs after internals have | ||
| // been destroyed, a new empty internals would be created, causing type lookup failures. | ||
| // See also get_or_create_pp_in_state_dict() comments. |
There was a problem hiding this comment.
This PR reverted the "re-creation during shutdown" scenario, so why is this code necessary now?
There was a problem hiding this comment.
I think there are two reasons:
- Add an explicit check to prevent potential bugs in the future. Maybe we can eventually not leak the internals.
- The reverted code still creates the internals raw pointers multiple times (
pp->reset(new InternalsType())). This PR adds a lock to ensure consistency.
There was a problem hiding this comment.
I don't think this code should have this unordered map and the associated book keeping around for an unreachable failure condition. At least make it debug only, or have its own ifdef, or completely remove it.
There was a problem hiding this comment.
The overhead is very small, and ideally, it should only run once. I think it is acceptable.
an unreachable failure condition
We do run into this here:
https://github.com/pybind/pybind11/actions/runs/21558860079/job/62119622188?pr=5979#step:12:1536
There was a problem hiding this comment.
We do run into this here:
I'm glad you understand this already! (I didn't)
@XuehaiPan do you have ideas how we should take care of that failure? This PR, or a follow-on? — We definitely need to bring back the commit I backed out (91189c9).
There was a problem hiding this comment.
The concurrent.interpreters.NotShareableError: func not shareable is unrelated. And it is unexpected because the func function is not using any global variables (it uses local imports). It should be shareable between interpreters.
UPDATE: the root cause is:
E Exception: Traceback (most recent call last):
E File "<frozen importlib._bootstrap>", line 1371, in _find_and_load
E File "<frozen importlib._bootstrap>", line 1333, in _find_and_load_unlocked
E File "<frozen importlib._bootstrap>", line 1267, in _find_spec
E File "<frozen importlib._bootstrap_external>", line 1292, in find_spec
E File "<frozen importlib._bootstrap_external>", line 1266, in _get_spec
E File "<frozen importlib._bootstrap_external>", line 1369, in find_spec
E File "<frozen importlib._bootstrap_external>", line 1412, in _fill_cache
E BlockingIOError: [Errno 11] Resource temporarily unavailable: '/opt/python/cp314-cp314t/lib/python3.14t/concurrent'
E
E The above exception was the direct cause of the following exception:
E
E concurrent.interpreters.NotShareableError: object could not be unpickledThere was a problem hiding this comment.
So it looks like it's flaky, at best. I'll try reruns under 5850 to see if I can get it to pass with enough trials.
It worked only on the 10th attempt!
https://github.com/pybind/pybind11/actions/runs/21580597316/job/62255061573?pr=5850
@XuehaiPan @b-pass we need to handle this somehow, it'll be super distracting. What should we do? Create an issue and add an xfail pointing to it?
There was a problem hiding this comment.
I cannot reproduce this. Does this only fail with C++11? If so, we can add a skipif.
There was a problem hiding this comment.
When I tried reproducing before, with C++20, I saw another kind of failure, and only when running test_multiple_interpreters.py in isolation.
I tried again just now, with C++11, and it's the same behavior. See below, JIC it's useful somehow.
I have to give up for now; I don't have a lot of spare time during the week.
Probably, if I had the free bandwidth, I'd try to set up a Manylinux container, identically to what we have in the CI.
( cd /wrk/forked/pybind11/tests && PYTHONPATH=/wrk/bld/pybind11_gcc_v3.14.2_df793163d58_freethreaded/lib /wrk/bld/pybind11_gcc_v3.14.2_df793163d58_freethreaded/TestVenv/bin/python3 -m pytest -v test_multiple_interpreters.py )
============================================================================= test session starts ==============================================================================
platform linux -- Python 3.14.2, pytest-9.0.2, pluggy-1.6.0 -- /wrk/bld/pybind11_gcc_v3.14.2_df793163d58_freethreaded/TestVenv/bin/python3
cachedir: .pytest_cache
installed packages of interest: build==1.4.0 numpy==2.4.2 scipy==1.17.0
C++ Info: 13.3.0 C++11 __pybind11_internals_v11_system_libstdcpp_gxx_abi_1xxx_use_cxx11_abi_1__ PYBIND11_SIMPLE_GIL_MANAGEMENT=False
free-threaded Python build
rootdir: /wrk/forked/pybind11/tests
configfile: pytest.ini
plugins: timeout-2.4.0, xdist-3.8.0
collected 7 items
test_multiple_interpreters.py::test_independent_subinterpreters PASSED [ 14%]
test_multiple_interpreters.py::test_independent_subinterpreters_modern PASSED [ 28%]
test_multiple_interpreters.py::test_dependent_subinterpreters FAILED [ 42%]
test_multiple_interpreters.py::test_import_module_with_singleton_per_interpreter PASSED [ 57%]
test_multiple_interpreters.py::test_import_in_subinterpreter_after_main PASSED [ 71%]
test_multiple_interpreters.py::test_import_in_subinterpreter_before_main PASSED [ 85%]
test_multiple_interpreters.py::test_import_in_subinterpreter_concurrently PASSED [100%]
=================================================================================== FAILURES ===================================================================================
________________________________________________________________________ test_dependent_subinterpreters ________________________________________________________________________
@pytest.mark.skipif(
sys.platform.startswith("emscripten"), reason="Requires loadable modules"
)
def test_dependent_subinterpreters():
"""Makes sure the internals object differs across subinterpreters"""
sys.path.insert(0, os.path.dirname(pybind11_tests.__file__))
run_string, create = get_interpreters(modern=False)
> import mod_shared_interpreter_gil as m
E RuntimeWarning: The global interpreter lock (GIL) has been enabled to load module 'mod_shared_interpreter_gil', which has not declared that it can run safely without the GIL. To override this behavior and keep the GIL disabled (at your own risk), run with PYTHON_GIL=0 or -Xgil=0.
create = <function get_interpreters.<locals>.create at 0x4f3e3193a80>
run_string = <function get_interpreters.<locals>.run_string at 0x4f3e3193c00>
test_multiple_interpreters.py:193: RuntimeWarning
=========================================================================== short test summary info ============================================================================
FAILED test_multiple_interpreters.py::test_dependent_subinterpreters - RuntimeWarning: The global interpreter lock (GIL) has been enabled to load module 'mod_shared_interpreter_gil', which has not declared that it can run safely without the G...
========================================================================= 1 failed, 6 passed in 19.00s =========================================================================
ERROR: completed_process.returncode=1
There was a problem hiding this comment.
I marked the test as flaky on musllinux in:
- Add fallback implementation of
PyCriticalSection_BeginMutexfor Python 3.13t #5981 - https://github.com/pybind/pybind11/pull/5981/changes#diff-a74aadc300d8c6861a71f50412018bd8c9be4df2a83641c674c6fbc4f1904c91R411
The test passes on manylinux but is flaky on musllinux.
This reverts commit f3197de. After pybind#5972 is/was merged, tests should pass (already tested under pybind#5980). See also pybind#5972 (comment)
* Fix race condition with py::make_key_iterator in free threading The creation of the iterator class needs to be synchronized. * style: pre-commit fixes * Use PyCriticalSection_BeginMutex instead of recursive mutex * style: pre-commit fixes * Make pycritical_section non-copyable and non-movable The pycritical_section class is a RAII wrapper that manages a Python critical section lifecycle: - Acquires the critical section in the constructor via PyCriticalSection_BeginMutex - Releases it in the destructor via PyCriticalSection_End - Holds a reference to a pymutex Allowing copy or move operations would be dangerous: 1. Copy: Both the original and copied objects would call PyCriticalSection_End on the same PyCriticalSection object in their destructors, leading to double-unlock and undefined behavior. 2. Move: The moved-from object's destructor would still run and attempt to end the critical section, while the moved-to object would also try to end it, again causing double-unlock. This follows the same pattern used by other RAII lock guards in the codebase, such as gil_scoped_acquire and gil_scoped_release, which also explicitly delete copy/move operations to prevent similar issues. By explicitly deleting these operations, we prevent accidental misuse and ensure the critical section is properly managed by a single RAII object throughout its lifetime. * Drop Python 3.13t support from CI Python 3.13t was experimental, while Python 3.14t is not. This PR uses PyCriticalSection_BeginMutex which is only available in Python 3.14+, making Python 3.13t incompatible with the changes. Removed all Python 3.13t CI jobs: - ubuntu-latest, 3.13t (standard-large matrix) - macos-15-intel, 3.13t (standard-large matrix) - windows-latest, 3.13t (standard-large matrix) - manylinux job testing 3.13t This aligns with the decision to drop Python 3.13t support as discussed in PR #5971. * Add Python 3.13 (default) replacement jobs for removed 3.13t jobs After removing Python 3.13t support (incompatible with PyCriticalSection_BeginMutex which requires Python 3.14+), we're adding replacement jobs using Python 3.13 (default) to maintain test coverage in key dimensions: 1. ubuntu-latest, Python 3.13: C++20 + DISABLE_HANDLE_TYPE_NAME_DEFAULT_IMPLEMENTATION - Replaces: ubuntu-latest, 3.13t with same config - Maintains coverage for this specific configuration combination 2. macos-15-intel, Python 3.13: C++11 - Replaces: macos-15-intel, 3.13t with same config - Maintains macOS coverage for Python 3.13 3. manylinux (musllinux), Python 3.13: GIL testing - Replaces: manylinux, 3.13t job - Maintains manylinux/musllinux container testing coverage These additions are proposed to get feedback on which jobs should be kept to maintain appropriate test coverage without the experimental 3.13t builds. * ci: run in free-threading mode a bit more on 3.14 * Revert "ci: run in free-threading mode a bit more on 3.14" This reverts commit 91189c9. Reason: #5971 (comment) * Reapply "ci: run in free-threading mode a bit more on 3.14" This reverts commit f3197de. After #5972 is/was merged, tests should pass (already tested under #5980). See also #5972 (comment) --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Ralf W. Grosse-Kunstleve <rgrossekunst@nvidia.com> Co-authored-by: Henry Schreiner <HenrySchreinerIII@gmail.com> Co-authored-by: Ralf W. Grosse-Kunstleve <rwgkio@gmail.com>
Description
This PR fixes a segfault that can occur during Python interpreter shutdown when
tp_traverse/tp_clearcallbacks callpy::castafter pybind11 internals have been destroyed.Fixes #5958 (comment)
Background
PR #5958 introduced
internals_shutdown()that resets internals during interpreter finalization. However, because GC order is not guaranteed during shutdown,tp_traverse/tp_clearcan run after internals have been reset. When these callbacks callpy::cast, it recreates an empty internals struct, fails type lookup, and throwspybind11::cast_error, terminating the process.Solution
This PR takes the "leak internals" approach for correctness:
Revert internals destruction: The explicit destructors from Destruct internals during interpreter finalization #5958 are removed.
internalsandlocal_internalsrevert to the previous "leak on shutdown" behavior. The leak only occurs at process exit, so it's benign in practice.Re-entrancy detection: A new
create_pp_content_once()method tracks which interpreters have already created internals. If code attempts to recreate internals after destruction (e.g., late shutdown code callingpy::cast), it triggers a hard failure with a diagnostic message rather than silently creating empty internals.New capsule helpers:
get_internals_capsule()andget_local_internals_capsule()allow tests (and potentially user code) to verify capsule lifetime.Changes
include/pybind11/detail/internals.h:internals::~internals()->= default(no explicit destruction)local_internalsno longer storesistateor defines a destructordtor=nullptrto intentionally leak at shutdownPYBIND11_DTOR_USE_DELETEsentinel foratomic_get_or_create_in_state_dict()to distinguish "use delete" from "leak"create_pp_content_once()with re-entrancy detectionget_internals_capsule(),get_local_internals_capsule(), andget_local_internals_key()helpersTests:
OwnsPythonObjects->ContainerOwnsPythonObjectswith a vector-based containeradd_gc_checkers_with_weakrefs()to verify capsule lifetime during GCtest_py_cast_useable_on_shutdownthat exercises the shutdown GC path in a subprocesscheck_script_success_in_subprocess()totests/env.pyfor reuseDocumentation:
docs/advanced/classes.rstexample to match the new test codeNotes
PYBIND11_INTERNALS_VERSIONremains 11 for ABI compatibility with v3.0.1pybind11_fail()which terminates the process. This is intentional to surface problematic code paths rather than silently failing.Suggested changelog entry:
py::castis called during interpreter shutdown viatp_traverse/tp_clearcallbacks.📚 Documentation preview 📚: https://pybind11--5972.org.readthedocs.build/