Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQLAlchemy DetachedInstanceError in WorkflowCycleManage due to Session Management Issue with Long-Running Tasks #11910

Closed
5 tasks done
leozanee opened this issue Dec 20, 2024 · 2 comments · Fixed by #11913
Closed
5 tasks done
Labels
🐞 bug Something isn't working 🌊 feat:workflow Workflow related stuff.

Comments

@leozanee
Copy link

Self Checks

  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

0v0.14.1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

1. Create a workflow with parallel nodes:

  • Create a new workflow in the application
  • Add multiple parallel nodes that each make Anthropic API calls:
  • Configure the workflow to process these parallel nodes concurrently

2. Execute the workflow:

  • When running this workflow, all parallel nodes attempt to call Anthropic API simultaneously
  • Due to Anthropic's rate limits (concurrent request limits), some requests receive HTTP 429 responses
  • System automatically initiates retries with backoff:
HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 429 Too Many Requests"
Retrying request to /v1/messages in 51.000000 seconds

3. During retry period:

  • Some parallel nodes complete successfully
  • Others are waiting for retry due to rate limits
  • The long wait during retry (50+ seconds) causes the database session to detach
  • When retried requests complete, the workflow attempts to finalize
  • At this point, accessing workflow_run.created_by_role fails because the session is detached

This scenario specifically occurs because:

  1. Parallel execution triggers rate limits
  2. Rate limits cause long retry delays
  3. During these delays, database sessions become invalid
  4. Session detachment causes SQLAlchemy errors when accessing object attributes

✔️ Expected Behavior

Hi Dify team! First of all, I want to express my sincere gratitude for your amazing project. I've been regularly using Dify to build my productivity tools, and it has been invaluable to my work.

I've encountered a bug and would like to contribute to its resolution. While my backend database expertise is limited, I've worked with Claude to analyze the issue as thoroughly as possible to provide a comprehensive report.

Issue Overview

The bug occurs during parallel workflow execution when multiple Anthropic API calls trigger rate limiting. Here's what happens:

  1. Multiple parallel nodes attempt concurrent Anthropic API calls
  2. Rate limiting (HTTP 429) triggers retry mechanisms with ~50s delays
  3. During these delays, SQLAlchemy sessions become detached
  4. When the workflow tries to finalize, accessing workflow_run.created_by_role fails
sqlalchemy.orm.exc.DetachedInstanceError: Instance <WorkflowRun> is not bound to a Session; attribute refresh operation cannot proceed

I've analyzed the code and logs, and it appears to be a session management issue with long-running operations in parallel workflows.

I'm happy to help resolve this bug, whether by implementing a fix myself with your guidance or providing any additional information you need. If my analysis is helpful, I'd love to collaborate on a solution. If not, I welcome your professional team's insights on how to best address this.

Thank you again for creating such an excellent project! Let me know if you need any clarification or additional details about the issue.

Best regards

❌ Actual Behavior

An SQLAlchemy DetachedInstanceError occurs during workflow completion handling when the system attempts to access created_by_role on a detached WorkflowRun instance. This issue appears to be triggered by long-running operations, particularly when API rate limiting causes retries.

Error Details

The core error:

sqlalchemy.orm.exc.DetachedInstanceError: Instance <WorkflowRun at 0x7fbb786d7dd0> is not bound to a Session; attribute refresh operation cannot proceed

Stack Trace

The error occurs in the following call chain:

  1. _workflow_finish_to_stream_response() in workflow_cycle_manage.py
  2. Attempts to access workflow_run.created_by_role
  3. SQLAlchemy tries to refresh the attribute
  4. Fails because the instance is detached from its session

Reproduction Scenario

  1. Start a workflow execution
  2. During execution, API calls to Anthropic hit rate limits (HTTP 429)
  3. System initiates retry with significant delay (50+ seconds)
  4. When workflow completion handling occurs, the database session has expired
  5. Accessing created_by_role triggers the error

Technical Analysis

  • The issue stems from session management in long-running operations
  • The database session expires/detaches while waiting for API retry
  • The code attempts to access lazy-loaded attributes after session detachment
  • Current implementation doesn't properly handle session lifecycle for extended operations

Impact

  • Workflow execution fails to complete properly
  • User experience is disrupted
  • Error responses are sent to the client
  • Potential inconsistency in workflow state

Suggested Solutions

Short-term Fix

  1. Implement proper session handling in _workflow_finish_to_stream_response:
with Session(db.engine, expire_on_commit=False) as session:
    workflow_run = session.merge(workflow_run)
    if workflow_run.created_by_role == CreatedByRole.ACCOUNT.value:
        # ... subsequent processing
  1. Or eagerly load required attributes:
db.session.refresh(workflow_run, ['created_by_role'])

Long-term Solutions

  1. Implement a session management middleware for long-running tasks
  2. Consider using scoped_session for better session lifecycle management
  3. Review and refactor session handling in async operations
  4. Add proper session handling for retry scenarios
  5. Consider implementing a session pool for long-running operations

Additional Context

[Logs will be attached showing the sequence of events, including rate limiting and retry attempts]

Environment

  • SQLAlchemy version: [version]
  • Python version: 3.12
  • Deployment: Docker containerized environment

Related Issues

  • Rate limiting handling in workflow execution
  • Session management in distributed operations
  • Long-running task handling

I added the screenshot of the logs when the error happened as below:

screenshot-20241220-184839

Copy link

dosubot bot commented Dec 20, 2024

Hey @leozanee! I'm here to help you with bug fixes, questions, and contributions. Let's work together to resolve the issue you've encountered.

To resolve the SQLAlchemy DetachedInstanceError in the WorkflowCycleManage module caused by session management issues during long-running tasks with parallel nodes and API rate limiting, you can follow these steps:

  1. Close Sessions Immediately: After performing any database operation, ensure that the session is closed to release the connection. This prevents holding onto connections during long-running tasks.

    Example for creating a new record:

    app = App(id=1)
    db.session.add(app)
    db.session.commit()
    db.session.refresh(app)
    db.session.close()
    return app.id
  2. Pass IDs Instead of Model Objects: When working with data across different tasks or nodes, pass the ID of the record instead of the model object. This avoids detachment issues when the session is closed.

    Example for fetching a record:

    app = db.session.query(App).filter(App.id == app_id).first()
    created_at = app.created_at
    db.session.close()
  3. Retrieve Data Within Each Task: For long-running tasks, retrieve the necessary data within the task itself using the ID, ensuring that each task manages its own session.

By implementing these practices, you can mitigate the DetachedInstanceError and manage database connections more effectively during long-running and concurrent tasks [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@dosubot dosubot bot added 🌊 feat:workflow Workflow related stuff. 🐞 bug Something isn't working labels Dec 20, 2024
@fdb02983rhy
Copy link
Contributor

Could you share a DSL file that can reproduce the error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working 🌊 feat:workflow Workflow related stuff.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants