Skip to content

A system for agentic LLM-powered data processing and ETL

License

Notifications You must be signed in to change notification settings

laurentcarcagno/docetl

 
 

Repository files navigation

DocETL: Powering Complex Document Processing Pipelines

Website Documentation Discord Paper

DocETL Figure

DocETL is a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers a low-code, declarative YAML interface to define LLM-powered operations on complex data.

When to Use DocETL

DocETL is the ideal choice when you're looking to maximize correctness and output quality for complex tasks over a collection of documents or unstructured datasets. You should consider using DocETL if:

  • You want to perform semantic processing on a collection of data
  • You have complex tasks that you want to represent via map-reduce
  • You're unsure how to best express your task to maximize LLM accuracy
  • You're working with long documents that don't fit into a single prompt
  • You have validation criteria and want tasks to automatically retry when validation fails

Community Projects

Educational Resources

Installation

Prerequisites

  • Python 3.10 or later
  • OpenAI API key

Quick Start

  1. Install from PyPI:
pip install docetl

To see examples of how to use DocETL, check out the tutorial.

Running the UI Locally

We offer a simple UI for building pipelines. We recommend building up complex pipelines one operation at a time, so you can see the results of each operation as you go and iterate on your pipeline. To run it locally, follow these steps:

Playground Screenshot

  1. Clone the repository:
git clone https://github.com/ucbepic/docetl.git
cd docetl
  1. Install dependencies:
make install      # Install Python package
make install-ui   # Install UI dependencies
  1. Set up environment variables in .env:
OPENAI_API_KEY=your_api_key_here
BACKEND_ALLOW_ORIGINS=
BACKEND_HOST=localhost
BACKEND_PORT=8000
BACKEND_RELOAD=True
FRONTEND_HOST=0.0.0.0
FRONTEND_PORT=3000
  1. Start the development server:
make run-ui-dev
  1. Visit http://localhost:3000/playground

Development Setup

If you're planning to contribute or modify DocETL, you can verify your setup by running the test suite:

make tests-basic  # Runs basic test suite (costs < $0.01 with OpenAI)

For detailed documentation and tutorials, visit our documentation.

About

A system for agentic LLM-powered data processing and ETL

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 57.6%
  • TypeScript 42.0%
  • Other 0.4%