🌟 Build Multimodal Language Agents with Ease 🌟

📖 Introduction

OmAgent is python library for building multimodal language agents with ease. We try to keep the library simple without too much overhead like other agent framework.

We wrap the complex engineering (worker orchestration, task queue, node optimization, etc.) behind the scene and only leave you with a super-easy-to-use interface to define your agent.
We further enable useful abstractions for reusable agent components, so you can build complex agents aggregating from those basic components.
We also provides features required for multimodal agents, such as native support for VLM models, video processing, and mobile device connection to make it easy for developers and researchers building agents that can reason over not only text, but image, video and audio inputs.

🔑 Key Features

A flexible agent architecture that provides graph-based workflow orchestration engine and various memory type enabling contextual reasoning.
Native multimodal interaction support include VLM models, real-time API, computer vision models, mobile connection and etc.
A suite of state-of-the-art unimodal and multimodal agent algorithms that goes beyond simple LLM reasoning, e.g. ReAct, CoT, SC-Cot etc.

🛠️ How To Install

python >= 3.10
Install omagent_core
Use pip to install omagent_core latest release.
```
pip install omagent-core
```
Or install the latest version from the source code like below.
```
pip install -e omagent-core
```
Set Up Conductor Server (Docker-Compose) Docker-compose includes conductor-server, Elasticsearch, and Redis.
```
cd docker
docker-compose up -d
```

🚀 Quick Start

Configuration

The container.yaml file is a configuration file that manages dependencies and settings for different components of the system. To set up your configuration:

Generate the container.yaml file:
```
cd examples/step1_simpleVQA
python compile_container.py
```
This will create a container.yaml file with default settings under examples/step1_simpleVQA.
Configure your LLM settings in configs/llms/gpt.yml:
- Set your OpenAI API key or compatible endpoint through environment variable or by directly modifying the yml file
```
export custom_openai_key="your_openai_api_key"
export custom_openai_endpoint="your_openai_endpoint"
```
You can use a locally deployed Ollama to call your own language model. The tutorial is here.
Update settings in the generated container.yaml:
- Configure Redis connection settings, including host, port, credentials, and both redis_stream_client and redis_stm_client sections.
- Update the Conductor server URL under conductor_config section
- Adjust any other component settings as needed

For more information about the container.yaml configuration, please refer to the container module

Run the demo

Run the simple VQA demo with webpage GUI:

For WebpageClient usage: Input and output are in the webpage
```
cd examples/step1_simpleVQA
python run_webpage.py
```
Open the webpage at http://127.0.0.1:7860, you will see the following interface:

🤖 Example Projects

1. Video QA Agents

Build a system that can answer any questions about uploaded videos with video understanding agents. See Details here.
More about the video understanding agent can be found in paper.

2. Mobile Personal Assistant

Build your personal mulitmodal assistant just like Google Astral in 2 minutes. See Details here.

3. Agentic Operators

We define reusable agentic workflows, e.g. CoT, ReAct, and etc as agent operators. This project compares various recently proposed reasoning agent operators with the same LLM choice and test datasets. How do they perform? See details here.

Algorithm	LLM	Average	gsm8k-score	gsm8k-cost($)	AQuA-score	AQuA-cost($)
SC-COT	gpt-3.5-turbo	73.69	80.06	5.0227	67.32	0.6491
COT	gpt-3.5-turbo	69.86	78.70	0.6788	61.02	0.0957
ReAct-Pro	gpt-3.5-turbo	69.74	74.91	3.4633	64.57	0.4928
POT	gpt-3.5-turbo	64.42	76.88	0.6902	51.97	0.1557
IO*	gpt-3.5-turbo	38.40	37.83	0.3328	38.98	0.0380

*IO: Input-Output Direct Prompting (Baseline)

More Details in our new repo open-agent-leaderboard and Hugging Face space

💻 Documentation

More detailed documentation is available here.

🤝 Contributing

For more information on how to contribute, see here.
We value and appreciate the contributions of our community. Special thanks to our contributors for helping us improve OmAgent.

🔔 Follow us

You can follow us on X, Discord and WeChat group for more updates and discussions.

🔗 Related works

If you are intrigued by multimodal large language models, and agent technologies, we invite you to delve deeper into our research endeavors:
🔆 How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection (AAAI24)
🏠 GitHub Repository

🔆 OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network (IET Computer Vision)
🏠 Github Repository

⭐️ Citation

If you find our repository beneficial, please cite our paper:

@article{zhang2024omagent,
  title={OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer},
  author={Zhang, Lu and Zhao, Tiancheng and Ying, Heting and Ma, Yibo and Lee, Kyusong},
  journal={arXiv preprint arXiv:2406.16620},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 369 Commits
.github/workflows		.github/workflows
docker		docker
docs		docs
examples		examples
omagent-core		omagent-core
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌟 Build Multimodal Language Agents with Ease 🌟

📖 Introduction

🔑 Key Features

🛠️ How To Install

🚀 Quick Start

Configuration

Run the demo

🤖 Example Projects

1. Video QA Agents

2. Mobile Personal Assistant

3. Agentic Operators

💻 Documentation

🤝 Contributing

🔔 Follow us

🔗 Related works

⭐️ Citation

About

Releases 3

Packages

Contributors 15

Languages

License

om-ai-lab/OmAgent

Folders and files

Latest commit

History

Repository files navigation

🌟 Build Multimodal Language Agents with Ease 🌟

📖 Introduction

🔑 Key Features

🛠️ How To Install

🚀 Quick Start

Configuration

Run the demo

🤖 Example Projects

1. Video QA Agents

2. Mobile Personal Assistant

3. Agentic Operators

💻 Documentation

🤝 Contributing

🔔 Follow us

🔗 Related works

⭐️ Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 15

Languages

Packages