Skip to content

Conversation

@Vasuk12
Copy link
Contributor

@Vasuk12 Vasuk12 commented Dec 12, 2025

Issue #397

  • Add ServerShutdownError exception for graceful shutdown handling
  • Detect server shutdown via health check and set _server_shutting_down flag
  • Convert ServerDisconnectedError to ServerShutdownError during shutdown
  • Catch ServerShutdownError in runner and log at debug level (no traceback)

- Add ServerShutdownError exception for graceful shutdown handling
- Detect server shutdown via health check and set _server_shutting_down flag
- Convert ServerDisconnectedError to ServerShutdownError during shutdown
- Catch ServerShutdownError in runner and log at debug level (no traceback)

Signed-off-by: Vasu <vasukhare88@gmail.com>
Copilot AI review requested due to automatic review settings December 12, 2025 03:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements graceful handling of server shutdown scenarios to eliminate traceback spam in logs. It introduces a new ServerShutdownError exception that is raised when the server is detected to be shutting down, allowing the runner to handle these situations gracefully with debug-level logging instead of full exception traces.

Key Changes:

  • Introduced ServerShutdownError exception for distinguishing shutdown scenarios from other errors
  • Added server health monitoring via _server_shutting_down flag that is set when health checks fail
  • Implemented conversion from ServerDisconnectedError to ServerShutdownError during detected shutdowns
  • Added exception handlers in the runner to catch ServerShutdownError and log at debug level without tracebacks

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
agentlightning/store/client_server.py Defines ServerShutdownError exception; adds _server_shutting_down flag to track server state; updates _wait_until_healthy to set shutdown flag and log as warning instead of error; modifies _request_json to convert ServerDisconnectedError to ServerShutdownError when shutdown is detected
agentlightning/runner/agent.py Imports ServerShutdownError; adds exception handling in _post_process_rollout_result, _step_impl, and iter methods to catch and log ServerShutdownError at debug level without tracebacks

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Vasuk12
Copy link
Contributor Author

Vasuk12 commented Dec 12, 2025

This fix handles both ServerDisconnectedError and CancelledError during server shutdown...

@ultmaster
Copy link
Contributor

Thanks for taking a look at this issue. I think we would need more discussions on how to correct solve it.

Option 1: We can put a multiprocess-shared value/flag called server_online. It's only set to true after the server has launched, and set to false before server is stopped. When client makes the request yet failed, it always check the value. If the value says server should be not online now, the client should not be surprised and silently ignore the error.
Pros: I think we can get an accurate awareness of the server status.
Cons: It must work with client-server execution strategy. For example, isolated store mode won't benefit from this solution.

Option 2: Cleverly handle all the exception types within the client, and tell the expected issues from the real issues. This requires a good understanding of aiohttp implementation, which, unfortunately, I don't have yet.
Pros: Sounds perfect.
Cons: Not sure whether it's doable.

So what do you think?

await store.add_otel_span(rollout.rollout_id, rollout.attempt.attempt_id, reward_span)
except (ServerShutdownError, asyncio.CancelledError):
# Server is shutting down or request was cancelled - handle gracefully without traceback
logger.debug(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this isn't the only place with such issue. We'd better handle it in store client.

self._dequeue_first_unsuccessful: bool = True

# Track server shutdown state to handle errors gracefully
self._server_shutting_down: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about having a _server_online shared flag and we can sunset hacky flags like _dequeue_first_unsuccessful.

@Vasuk12
Copy link
Contributor Author

Vasuk12 commented Dec 12, 2025

Thanks for taking a look at this issue. I think we would need more discussions on how to correct solve it.

Option 1: We can put a multiprocess-shared value/flag called server_online. It's only set to true after the server has launched, and set to false before server is stopped. When client makes the request yet failed, it always check the value. If the value says server should be not online now, the client should not be surprised and silently ignore the error. Pros: I think we can get an accurate awareness of the server status. Cons: It must work with client-server execution strategy. For example, isolated store mode won't benefit from this solution.

Option 2: Cleverly handle all the exception types within the client, and tell the expected issues from the real issues. This requires a good understanding of aiohttp implementation, which, unfortunately, I don't have yet. Pros: Sounds perfect. Cons: Not sure whether it's doable.

So what do you think?

I think Option 1 is more feasible for client-server mode. Instead of handling all exception types, we only need to check a shared flag. I've added Option 2 which works for client-server execution and external store servers. But Option 1 would be faster and more accurate for client-server execution. I think we should keep Option 2 as the fallback for external store servers (where shared flag isn't available), and add Option 1 as an enhancement for client-server execution mode. Is that alright?

@ultmaster
Copy link
Contributor

You can try to implement Option 1. Be mindful of performance issues as all processes need to synchronously read a flag.

@Vasuk12
Copy link
Contributor Author

Vasuk12 commented Dec 14, 2025

You can try to implement Option 1. Be mindful of performance issues as all processes need to synchronously read a flag.

Okay.. Give me some time to implement option 1!

@Vasuk12
Copy link
Contributor Author

Vasuk12 commented Dec 21, 2025

Let me know if I am missing something or if there are any more changes...

n_workers: int = 1,
tracker: MetricsBackend | None = None,
prometheus: bool = False,
server_online_flag: Any = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be Any and it should be documented in docstring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed Any and documented server_online_flag in the docstring. But with object | None typing, we get linter errors accessing .get_lock() and .value (from multiprocessing.synchronize.Synchronized). Should we use a Protocol import with TYPE_CHECKING or just use type ignores?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you should document it as whatever the type of ctx.Value("b", False) is. Use copilot/cc/codex to help you if you are not sure.

# This flag is set to True after server launches and False before server stops
# Clients check this flag when requests fail - if server is not online, silently ignore errors
# Be mindful of performance: all processes need to synchronously read this flag
ctx = multiprocessing.get_context()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add unit-tests for this feature!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in which file do u want the unit tests? or should i create a new one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in tests/execution/test_client_server.py

@ultmaster
Copy link
Contributor

please resolve the conflicts so that I can rerun the CI.

@Vasuk12
Copy link
Contributor Author

Vasuk12 commented Jan 8, 2026

Sorry for the delay. I will look into the changes today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants