SQLAlchemy DetachedInstanceError in WorkflowCycleManage due to Session Management Issue with Long-Running Tasks #11910

leozanee · 2024-12-20T10:54:57Z

Self Checks

This is only for bug report, if you would like to ask a question, please head to Discussions.
I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
Please do not modify this template :) and fill in all the required fields.

Dify version

0v0.14.1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

1. Create a workflow with parallel nodes:

Create a new workflow in the application
Add multiple parallel nodes that each make Anthropic API calls:
Configure the workflow to process these parallel nodes concurrently

2. Execute the workflow:

When running this workflow, all parallel nodes attempt to call Anthropic API simultaneously
Due to Anthropic's rate limits (concurrent request limits), some requests receive HTTP 429 responses
System automatically initiates retries with backoff:

HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 429 Too Many Requests"
Retrying request to /v1/messages in 51.000000 seconds

3. During retry period:

Some parallel nodes complete successfully
Others are waiting for retry due to rate limits
The long wait during retry (50+ seconds) causes the database session to detach
When retried requests complete, the workflow attempts to finalize
At this point, accessing workflow_run.created_by_role fails because the session is detached

This scenario specifically occurs because:

Parallel execution triggers rate limits
Rate limits cause long retry delays
During these delays, database sessions become invalid
Session detachment causes SQLAlchemy errors when accessing object attributes

✔️ Expected Behavior

Hi Dify team! First of all, I want to express my sincere gratitude for your amazing project. I've been regularly using Dify to build my productivity tools, and it has been invaluable to my work.

I've encountered a bug and would like to contribute to its resolution. While my backend database expertise is limited, I've worked with Claude to analyze the issue as thoroughly as possible to provide a comprehensive report.

Issue Overview

The bug occurs during parallel workflow execution when multiple Anthropic API calls trigger rate limiting. Here's what happens:

Multiple parallel nodes attempt concurrent Anthropic API calls
Rate limiting (HTTP 429) triggers retry mechanisms with ~50s delays
During these delays, SQLAlchemy sessions become detached
When the workflow tries to finalize, accessing workflow_run.created_by_role fails

sqlalchemy.orm.exc.DetachedInstanceError: Instance <WorkflowRun> is not bound to a Session; attribute refresh operation cannot proceed

I've analyzed the code and logs, and it appears to be a session management issue with long-running operations in parallel workflows.

I'm happy to help resolve this bug, whether by implementing a fix myself with your guidance or providing any additional information you need. If my analysis is helpful, I'd love to collaborate on a solution. If not, I welcome your professional team's insights on how to best address this.

Thank you again for creating such an excellent project! Let me know if you need any clarification or additional details about the issue.

Best regards

❌ Actual Behavior

An SQLAlchemy DetachedInstanceError occurs during workflow completion handling when the system attempts to access created_by_role on a detached WorkflowRun instance. This issue appears to be triggered by long-running operations, particularly when API rate limiting causes retries.

Error Details

The core error:

sqlalchemy.orm.exc.DetachedInstanceError: Instance <WorkflowRun at 0x7fbb786d7dd0> is not bound to a Session; attribute refresh operation cannot proceed

Stack Trace

The error occurs in the following call chain:

_workflow_finish_to_stream_response() in workflow_cycle_manage.py
Attempts to access workflow_run.created_by_role
SQLAlchemy tries to refresh the attribute
Fails because the instance is detached from its session

Reproduction Scenario

Start a workflow execution
During execution, API calls to Anthropic hit rate limits (HTTP 429)
System initiates retry with significant delay (50+ seconds)
When workflow completion handling occurs, the database session has expired
Accessing created_by_role triggers the error

Technical Analysis

The issue stems from session management in long-running operations
The database session expires/detaches while waiting for API retry
The code attempts to access lazy-loaded attributes after session detachment
Current implementation doesn't properly handle session lifecycle for extended operations

Impact

Workflow execution fails to complete properly
User experience is disrupted
Error responses are sent to the client
Potential inconsistency in workflow state

Additional Context

[Logs will be attached showing the sequence of events, including rate limiting and retry attempts]

Environment

SQLAlchemy version: [version]
Python version: 3.12
Deployment: Docker containerized environment

Related Issues

Rate limiting handling in workflow execution
Session management in distributed operations
Long-running task handling

I added the screenshot of the logs when the error happened as below:

The text was updated successfully, but these errors were encountered:

dosubot · 2024-12-20T10:56:32Z

Hey @leozanee! I'm here to help you with bug fixes, questions, and contributions. Let's work together to resolve the issue you've encountered.

To resolve the SQLAlchemy DetachedInstanceError in the WorkflowCycleManage module caused by session management issues during long-running tasks with parallel nodes and API rate limiting, you can follow these steps:

Close Sessions Immediately: After performing any database operation, ensure that the session is closed to release the connection. This prevents holding onto connections during long-running tasks.

Example for creating a new record:
```
app = App(id=1)
db.session.add(app)
db.session.commit()
db.session.refresh(app)
db.session.close()
return app.id
```
Pass IDs Instead of Model Objects: When working with data across different tasks or nodes, pass the ID of the record instead of the model object. This avoids detachment issues when the session is closed.

Example for fetching a record:
```
app = db.session.query(App).filter(App.id == app_id).first()
created_at = app.created_at
db.session.close()
```
Retrieve Data Within Each Task: For long-running tasks, retrieve the necessary data within the task itself using the ID, ensuring that each task manages its own session.

By implementing these practices, you can mitigate the DetachedInstanceError and manage database connections more effectively during long-running and concurrent tasks ^[1].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

fdb02983rhy · 2024-12-20T11:46:27Z

Could you share a DSL file that can reproduce the error?

dosubot bot added 🌊 feat:workflow Workflow related stuff. 🐞 bug Something isn't working labels Dec 20, 2024

fdb02983rhy mentioned this issue Dec 20, 2024

fix: ensure WorkflowRun attributes are refreshed in WorkflowCycleMana… #11913

Merged

5 tasks

laipz8200 closed this as completed in #11913 Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQLAlchemy DetachedInstanceError in WorkflowCycleManage due to Session Management Issue with Long-Running Tasks #11910

SQLAlchemy DetachedInstanceError in WorkflowCycleManage due to Session Management Issue with Long-Running Tasks #11910

leozanee commented Dec 20, 2024

dosubot bot commented Dec 20, 2024

fdb02983rhy commented Dec 20, 2024

SQLAlchemy DetachedInstanceError in WorkflowCycleManage due to Session Management Issue with Long-Running Tasks #11910

SQLAlchemy DetachedInstanceError in WorkflowCycleManage due to Session Management Issue with Long-Running Tasks #11910

Comments

leozanee commented Dec 20, 2024

Self Checks

Dify version

Cloud or Self Hosted

Steps to reproduce

1. Create a workflow with parallel nodes:

2. Execute the workflow:

3. During retry period:

✔️ Expected Behavior

Issue Overview

❌ Actual Behavior

Error Details

Stack Trace

Reproduction Scenario

Technical Analysis

Impact

Suggested Solutions

Short-term Fix

Long-term Solutions

Additional Context

Environment

Related Issues

dosubot bot commented Dec 20, 2024

fdb02983rhy commented Dec 20, 2024