Add support for durable execution with DBOS #3526
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Based on the discussion with @tonykipkemboi, here is the prototype of adding durable execution support into agents.
This PR integrates DBOS with
CrewAgentExecutor.invoke
and related methods to provide out-of-the-box durable execution and checkpointing.I would love to hear early feedback before adding more tests to this.
Changes
File Structure:
durable_execution/dbos
folder with all the relevant files.dbos_agent.py
: the main entrypoint for using DBOS agents.dbos_agent_executor.py
: executor for managing the agent main loop.dbos_llm.py
: wrapping llm calls as DBOS steps.dbos_util.py
: defineStepConfig
for configurable step retries.test_dbos_agent.py
test.Workflows:
DBOSAgentExecutor.invoke
is automatically decorated a DBOS workflow.Steps:
Tooling:
Example
To use the integration, users only need to add a few lines of DBOS code on top of their existing agent code. Here is the code from the test:
Discussion
DBOS requires workflows to be defined statically, so the recovery thread can correctly find the workflow definition. Currently,
DBOSAgentExecutor
objects are created dynamically inAgent.execute_task
. This means if the server crashes, recovery cannot find the workflow definition by name. For durable execution, we might need to require a static definition of the executor during the agent creation time and don't allow dynamic creations.