Handle server shutdown gracefully to prevent traceback spam #408

Vasuk12 · 2025-12-12T03:30:38Z

Issue #397

Add ServerShutdownError exception for graceful shutdown handling
Detect server shutdown via health check and set _server_shutting_down flag
Convert ServerDisconnectedError to ServerShutdownError during shutdown
Catch ServerShutdownError in runner and log at debug level (no traceback)

- Add ServerShutdownError exception for graceful shutdown handling - Detect server shutdown via health check and set _server_shutting_down flag - Convert ServerDisconnectedError to ServerShutdownError during shutdown - Catch ServerShutdownError in runner and log at debug level (no traceback) Signed-off-by: Vasu <vasukhare88@gmail.com>

Copilot

Pull request overview

This PR implements graceful handling of server shutdown scenarios to eliminate traceback spam in logs. It introduces a new ServerShutdownError exception that is raised when the server is detected to be shutting down, allowing the runner to handle these situations gracefully with debug-level logging instead of full exception traces.

Key Changes:

Introduced ServerShutdownError exception for distinguishing shutdown scenarios from other errors
Added server health monitoring via _server_shutting_down flag that is set when health checks fail
Implemented conversion from ServerDisconnectedError to ServerShutdownError during detected shutdowns
Added exception handlers in the runner to catch ServerShutdownError and log at debug level without tracebacks

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
agentlightning/store/client_server.py	Defines `ServerShutdownError` exception; adds `_server_shutting_down` flag to track server state; updates `_wait_until_healthy` to set shutdown flag and log as warning instead of error; modifies `_request_json` to convert `ServerDisconnectedError` to `ServerShutdownError` when shutdown is detected
agentlightning/runner/agent.py	Imports `ServerShutdownError`; adds exception handling in `_post_process_rollout_result`, `_step_impl`, and `iter` methods to catch and log `ServerShutdownError` at debug level without tracebacks

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

agentlightning/store/client_server.py

agentlightning/runner/agent.py

agentlightning/store/client_server.py

…rmatting Signed-off-by: Vasu <vasukhare88@gmail.com>

Signed-off-by: Vasu <vasukhare88@gmail.com>

Vasuk12 · 2025-12-12T04:46:43Z

This fix handles both ServerDisconnectedError and CancelledError during server shutdown...

ultmaster · 2025-12-12T05:57:43Z

Thanks for taking a look at this issue. I think we would need more discussions on how to correct solve it.

Option 1: We can put a multiprocess-shared value/flag called server_online. It's only set to true after the server has launched, and set to false before server is stopped. When client makes the request yet failed, it always check the value. If the value says server should be not online now, the client should not be surprised and silently ignore the error.
Pros: I think we can get an accurate awareness of the server status.
Cons: It must work with client-server execution strategy. For example, isolated store mode won't benefit from this solution.

Option 2: Cleverly handle all the exception types within the client, and tell the expected issues from the real issues. This requires a good understanding of aiohttp implementation, which, unfortunately, I don't have yet.
Pros: Sounds perfect.
Cons: Not sure whether it's doable.

So what do you think?

ultmaster · 2025-12-12T05:58:12Z

agentlightning/runner/agent.py

+                await store.add_otel_span(rollout.rollout_id, rollout.attempt.attempt_id, reward_span)
+            except (ServerShutdownError, asyncio.CancelledError):
+                # Server is shutting down or request was cancelled - handle gracefully without traceback
+                logger.debug(


I think this isn't the only place with such issue. We'd better handle it in store client.

ultmaster · 2025-12-12T05:59:04Z

agentlightning/store/client_server.py

        self._dequeue_first_unsuccessful: bool = True

+        # Track server shutdown state to handle errors gracefully
+        self._server_shutting_down: bool = False


How about having a _server_online shared flag and we can sunset hacky flags like _dequeue_first_unsuccessful.

Vasuk12 · 2025-12-12T07:31:57Z

Thanks for taking a look at this issue. I think we would need more discussions on how to correct solve it.

Option 1: We can put a multiprocess-shared value/flag called server_online. It's only set to true after the server has launched, and set to false before server is stopped. When client makes the request yet failed, it always check the value. If the value says server should be not online now, the client should not be surprised and silently ignore the error. Pros: I think we can get an accurate awareness of the server status. Cons: It must work with client-server execution strategy. For example, isolated store mode won't benefit from this solution.

Option 2: Cleverly handle all the exception types within the client, and tell the expected issues from the real issues. This requires a good understanding of aiohttp implementation, which, unfortunately, I don't have yet. Pros: Sounds perfect. Cons: Not sure whether it's doable.

So what do you think?

I think Option 1 is more feasible for client-server mode. Instead of handling all exception types, we only need to check a shared flag. I've added Option 2 which works for client-server execution and external store servers. But Option 1 would be faster and more accurate for client-server execution. I think we should keep Option 2 as the fallback for external store servers (where shared flag isn't available), and add Option 1 as an enhancement for client-server execution mode. Is that alright?

ultmaster · 2025-12-12T09:25:41Z

You can try to implement Option 1. Be mindful of performance issues as all processes need to synchronously read a flag.

Vasuk12 · 2025-12-14T13:32:18Z

You can try to implement Option 1. Be mindful of performance issues as all processes need to synchronously read a flag.

Okay.. Give me some time to implement option 1!

Vasuk12 · 2025-12-21T18:49:56Z

Let me know if I am missing something or if there are any more changes...

agentlightning/store/client_server.py

ultmaster · 2025-12-23T02:14:21Z

agentlightning/store/client_server.py

        n_workers: int = 1,
-        tracker: MetricsBackend | None = None,
+        prometheus: bool = False,
+        server_online_flag: Any = None,


This should not be Any and it should be documented in docstring

removed Any and documented server_online_flag in the docstring. But with object | None typing, we get linter errors accessing .get_lock() and .value (from multiprocessing.synchronize.Synchronized). Should we use a Protocol import with TYPE_CHECKING or just use type ignores?

maybe you should document it as whatever the type of ctx.Value("b", False) is. Use copilot/cc/codex to help you if you are not sure.

agentlightning/store/client_server.py

ultmaster · 2025-12-23T02:16:19Z

agentlightning/execution/client_server.py

+        # This flag is set to True after server launches and False before server stops
+        # Clients check this flag when requests fail - if server is not online, silently ignore errors
+        # Be mindful of performance: all processes need to synchronously read this flag
+        ctx = multiprocessing.get_context()


Please add unit-tests for this feature!

in which file do u want the unit tests? or should i create a new one?

in tests/execution/test_client_server.py

ultmaster · 2026-01-04T10:06:15Z

please resolve the conflicts so that I can rerun the CI.

Vasuk12 · 2026-01-08T01:27:42Z

Sorry for the delay. I will look into the changes today.

Copilot AI review requested due to automatic review settings December 12, 2025 03:30

Copilot started reviewing on behalf of Vasuk12 December 12, 2025 03:31 View session

Copilot AI reviewed Dec 12, 2025

View reviewed changes

Vasuk12 added 4 commits December 12, 2025 03:48

Address Copilot suggestions: thread safety, flag reset, and string fo…

1156651

…rmatting Signed-off-by: Vasu <vasukhare88@gmail.com>

Add CancelledError handling for graceful server shutdown

fb928f6

Signed-off-by: Vasu <vasukhare88@gmail.com>

Signed-off-by: Vasu <vasukhare88@gmail.com>

41948fa

Signed-off-by: Vasu <vasukhare88@gmail.com>

397a34c

ultmaster reviewed Dec 12, 2025

View reviewed changes

Handle server shutdown errors gracefully using shared status flag

ec5b0e4

ultmaster requested changes Dec 23, 2025

View reviewed changes

Handle server shutdown gracefully to prevent traceback spam #408

Are you sure you want to change the base?

Handle server shutdown gracefully to prevent traceback spam #408

Conversation

Vasuk12 commented Dec 12, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Vasuk12 commented Dec 12, 2025

Uh oh!

ultmaster commented Dec 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vasuk12 commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ultmaster commented Dec 12, 2025

Uh oh!

Vasuk12 commented Dec 14, 2025

Uh oh!

Vasuk12 commented Dec 21, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ultmaster commented Jan 4, 2026

Uh oh!

Vasuk12 commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Vasuk12 commented Dec 12, 2025 •

edited

Loading