Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
201 changes: 201 additions & 0 deletions IMPLEMENTATION_SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
# Implementation Summary: LibevWrapper Cleanup Fix

## Overview

This document summarizes the implementation of fixes for the libevwrapper atexit cleanup bug that caused Python shutdown crashes.

## Problem Statement

The libev reactor was using `atexit.register(partial(_cleanup, _global_loop))` where `_global_loop` was `None` at module import time. This caused the cleanup function to receive `None` at shutdown instead of the actual loop instance, preventing proper cleanup and leading to crashes when callbacks fired during Python interpreter shutdown.

## Solutions Implemented

### Solution 1: Fix atexit Registration (Commit: 8c90f05)

**Minimal change approach - fixes the root cause**

#### Changes Made:
1. Replaced `atexit.register(partial(_cleanup, _global_loop))` with a wrapper function
2. Created `_atexit_cleanup()` that looks up `_global_loop` when called, not when registered
3. Removed unused `partial` import from functools
4. Updated tests to verify the fix works

#### Files Modified:
- `cassandra/io/libevreactor.py` - Added `_atexit_cleanup()` wrapper function
- `tests/unit/io/test_libevreactor_shutdown.py` - Updated tests to verify fix

#### Code Changes:
```python
# OLD (buggy):
_global_loop = None
atexit.register(partial(_cleanup, _global_loop)) # Captures None!

# NEW (fixed):
def _atexit_cleanup():
"""Cleanup function that looks up _global_loop at shutdown time."""
global _global_loop
if _global_loop is not None:
_cleanup(_global_loop)

_global_loop = None
atexit.register(_atexit_cleanup) # Looks up current value when called
```

#### Benefits:
- Minimal change (14 lines modified)
- Fixes the immediate bug
- No C extension changes required
- No API changes
- Easy to understand and maintain

### Solution 2: Add loop.stop() Method (Commit: 466e4cb)

**Enhanced robustness - adds explicit loop stopping mechanism**

#### Changes Made:
1. Added `ev_async async_watcher` field to `libevwrapper_Loop` struct
2. Implemented `async_stop_cb()` callback that calls `ev_break(EVBREAK_ALL)`
3. Added `Loop_stop()` Python-callable method
4. Initialize and start async_watcher in `Loop_init()`
5. Clean up async_watcher in `Loop_dealloc()`
6. Updated `_atexit_cleanup()` to call `loop.stop()` before cleanup

#### Files Modified:
- `cassandra/io/libevwrapper.c` - Added stop method to C extension
- `cassandra/io/libevreactor.py` - Call loop.stop() in cleanup

#### Code Changes:

**C Extension (libevwrapper.c):**
```c
typedef struct libevwrapper_Loop {
PyObject_HEAD
struct ev_loop *loop;
ev_async async_watcher; // NEW: for thread-safe stopping
} libevwrapper_Loop;

static void async_stop_cb(EV_P_ ev_async *w, int revents) {
ev_break(EV_A_ EVBREAK_ALL); // Break the event loop
}

static PyObject *
Loop_stop(libevwrapper_Loop *self, PyObject *args) {
ev_async_send(self->loop, &self->async_watcher);
Py_RETURN_NONE;
}

// In Loop_init:
ev_async_init(&self->async_watcher, async_stop_cb);
ev_async_start(self->loop, &self->async_watcher);
```

**Python (libevreactor.py):**
```python
def _atexit_cleanup():
global _global_loop
if _global_loop is not None:
# Stop the event loop before cleanup (thread-safe)
if _global_loop._loop:
try:
_global_loop._loop.stop()
except Exception:
pass # Continue cleanup even if stop fails
_cleanup(_global_loop)
```

#### Benefits:
- Thread-safe loop stopping via libev's async mechanism
- Explicitly breaks event loop before cleanup
- Prevents callbacks from firing during cleanup
- Works with Solution 1 for defense in depth
- Implements the approach suggested in the original issue

## Testing

### Tests Updated:
1. `test_atexit_callback_uses_current_global_loop()` - Verifies atexit handler is the wrapper function, not a partial
2. `test_shutdown_cleanup_works_with_fix()` - Subprocess test verifying proper cleanup
3. `test_cleanup_with_fix_properly_shuts_down()` - Verifies cleanup actually shuts down the loop

### Test Results:
Tests verify that:
- The atexit handler is `_atexit_cleanup`, not a `partial` object
- `_global_loop` is properly looked up at shutdown time
- Cleanup receives the actual loop instance, not `None`
- The loop is properly shut down after cleanup runs

## Impact Analysis

### Before the Fix:
- atexit cleanup received `None` and did nothing
- Event loop kept running during Python shutdown
- Callbacks could fire after Python started deallocating modules
- Resulted in segmentation faults and crashes

### After Solution 1:
- atexit cleanup receives actual loop instance
- Proper shutdown sequence executes:
- Sets `_shutdown` flag
- Closes connections
- Stops watchers
- Joins event loop thread (with 1s timeout)

### After Solution 2 (Additional):
- Event loop explicitly breaks before cleanup starts
- Prevents race conditions where loop is processing events during cleanup
- Thread-safe stopping via libev's async watcher mechanism
- More robust cleanup even if loop is in the middle of event processing

## Deployment Considerations

### Building:
Solution 2 requires rebuilding the C extension:
```bash
python setup.py build_ext --inplace
```

### Compatibility:
- Both solutions maintain backward compatibility
- No API changes for users
- The `stop()` method is a new addition (doesn't break existing code)

### Testing Recommendations:
1. Test with active connections during shutdown
2. Test with pending I/O operations
3. Test in forked processes
4. Stress test with many connections
5. Test rapid creation/destruction cycles

## Crash Scenarios Addressed

1. ✅ **Root cause fixed**: Cleanup now receives actual loop instance
2. ✅ **Thread join timeout**: Loop is stopped before trying to join thread
3. ✅ **Object lifecycle**: Watchers are properly stopped before Python teardown
4. ✅ **Connection cleanup**: Cleanup actually runs, closing connections and stopping watchers
5. ✅ **Race conditions**: Explicit loop.stop() prevents callbacks during cleanup
6. ⚠️ **GIL state issues**: Still possible but less likely (could be addressed with Solution 3)
7. ⚠️ **Module deallocation order**: Python still controls this, but shorter cleanup window reduces risk
8. ⚠️ **Fork handling**: Existing fork detection should work with the fix

## Future Enhancements (Optional)

### Solution 3: Callback Safety Guards
Could add additional safety in C callbacks:
```c
static void io_callback(...) {
if (Py_IsInitialized() == 0) {
return; // Don't execute during shutdown
}
// ... rest of callback
}
```

This would provide defense in depth but isn't strictly necessary with Solutions 1 and 2 in place.

## Conclusion

Both solutions have been implemented:
- **Solution 1** fixes the root cause with minimal changes
- **Solution 2** adds robustness with explicit loop stopping

Together, they provide a comprehensive fix for the libev shutdown crash issue.
Loading
Loading