Official implementation of 'SPHINX: A Mixer of Tasks, Domains, and Embeddings Advances Multi-modal Large Language Models'.
Try out our web demo 🚀 here!
- [2024-1-12] We release SPHINX-Tiny built on the compact 1.1B TinyLlama that everyone can play with! 🔥🔥🔥
- [2024-1-5] We release SPHINX-MoE supercharged with the powerful Mixtral 8x7B Backbone! 🔥🔥🔥
- [2023-11-17] We release SPHINX-V2, the same architecture but enhanced capabilities! 🔥🔥
- [2023-11-09] We release the technical report of SPHINX 🔥.
- [2023-10-17] We release the demo, code, and model of SPHINX 🎉.
We present
-
Task Mix. For all-purpose capabilities, we mix a variety of vision-language tasks for mutual improvement: VQA, REC, REG, OCR, DET, POSE, REL DET, T2I, etc.
-
Embedding Mix. We capture robust visual representations by fusing distinct visual architectures, pretraining, and granularity.
-
Domain Mix. For data from real-world and synthetic domains, we mix the weights of two domain-specific models for complementarity.
On top of SPHINX, we propose to further mix visual scales and sub-images for better capture fine-grained semantics on high-resolution images.
- SPHINX is built upon LLaMA2-Accessory, please follow the instructions here for environment setup.
- Important 🔦: For flexible instantiation of SPHINX models, please set up the LLaMA2-Accessory repo to your python environment.
After this, you will be able to invoke
# go to the root directory of LLaMA2-Accessory cd LLaMA2-Accessory # install LLaMA2-Accessory pip install -e .
import accessory
orimport SPHINX
without the restriction of working directory. - For SPHINX-MoE, megablocks and stk should be additionally installed according their the official guides.
- To enable the segmentation ability shown in our official demo, SAM is also needed:
pip install git+https://github.com/facebookresearch/segment-anything.git
We release the following checkpoints:
Name | Architecture | Checkpoint |
---|---|---|
SPHINX | llama_ens | Hugging face/Baidu(提取码:46s7) |
SPHINX-1K | llama_ens5 | Hugging face/Baidu(提取码:pua9) |
SPHINX-v2-1k | llama_ens5 | Hugging face/Baidu(提取码:88z0) |
SPHINX-MoE | mixtral_sparse_ens | Hugging face |
SPHINX-MoE-1k | mixtral_sparse_ens5 | Hugging face |
SPHINX-Tiny | llama_ens_light.py | Hugging face |
SPHINX-Tiny-1k | llama_ens5_light.py | Hugging face |
Note that SPHINX-1K was previously called Long-SPHINX
Please download them to your own machine. The file structure should appear as follows:
path/to/checkpoint
├── consolidated.00-of-02.model.pth
├── consolidated.01-of-02.model.pth
├── tokenizer.model
├── config.json
└── meta.json
from SPHINX import SPHINXModel
from PIL import Image
import torch
# Besides loading the `consolidated.*.pth` model weights, from_pretrained will also try to
# use `tokenizer.model', 'meta.json', and 'config.json' under `pretrained_path` to configure
# the `tokenizer_path`, `llama_type`, and `llama_config` of the model. You may also override
# the configurations by explicitly specifying the arguments
model = SPHINXModel.from_pretrained(pretrained_path="path/to/checkpoint", with_visual=True)
image = Image.open("examples/1.jpg")
qas = [["What's in the image?", None]]
response = model.generate_response(qas, image, max_gen_len=1024, temperature=0.9, top_p=0.5, seed=0)
print(response)
# if you wanna continue
qas[-1][-1] = response
qas.append(["Then how does it look like?", None])
response2 = model.generate_response(qas, image, max_gen_len=1024, temperature=0.9, top_p=0.5, seed=0)
print(response2)
from SPHINX import SPHINXModel
from PIL import Image
import torch
import torch.distributed as dist
import multiprocessing as mp
def main(world_size, rank) -> None:
dist.init_process_group(
backend="nccl", rank=rank, world_size=world_size,
init_method=f"tcp://127.0.0.1:23560",
)
torch.cuda.set_device(rank)
# mp_group tells the model which ranks will work together
# through model parallel to compose a complete model.
# When mp_group is None, a single-rank process group will
# be created and used, which means model parallel size = 1 (not enabled)
model = SPHINXModel.from_pretrained(
pretrained_path="path/to/checkpoint", with_visual=True,
mp_group=dist.new_group(ranks=list(range(world_size)))
)
# it's important to make sure that ranks within the same
# model parallel group should always receive the same input simultaneously
image = Image.open("examples/1.jpg")
qas = [["What's in the image?", None]]
response = model.generate_response(qas, image, max_gen_len=1024, temperature=0.9, top_p=0.5, seed=0)
if __name__ == "__main__":
N_GPU = 2
assert N_GPU in [1, 2, 4, 8]
if N_GPU == 1:
main(world_size=1, rank=0)
else:
# You can use whatever method, e.g. torchrun, slurm, etc. for distributed launch
# Just be sure to initialize torch distributed (by invoking dist.init_process_group)
# before creating the SPHINX model if model parallel size > 1 is used
mp.set_start_method("spawn")
for rank in range(N_GPU):
process = mp.Process(target=main, args=(N_GPU, rank))
process.start()
If torchrun is preferred, an example is inference.py:
torchrun --master_port=1112 --nproc_per_node=2 inference.py
For thoes who want to host a demo like our official one locally, this section provides a step-by-step guide.
- SAM should be installed to enable segmentation.
- If you're already familiar with the LLAMA2-Accessory toolkit, note that hosting a SPHINX demo follows the same pipeline as hosting demos for the other models supported by LLAMA2-Accessory.
Execute the following command for demo hosting:
cd LLaMA2-Accessory/accessory
python demos/multi_turn_mm_box.py --n_gpus=2 \
--pretrained_path /path/to/checkpoint/
Explanation of each argument:
--n_gpus
: Number of gpus to use. More GPUs alleviate the memory and computation load on each GPU through model parallelism.1,2,4,8
are supported.--pretrained_path
: The path to pretrained checkpoint
Note
In the past we required users to manually specify the llama_type
, llama_config
and tokenizer_path
arguments.
However, now LLaMA2-Accessory will automatically investigate the files under pretrained_path
to probe these
information. If your program raises an error, please make sure that your pretrained_path
contain all the files
mentioned here.
Here we show an example of using LLaMA2-Accessory to finetune SPHINX on ImageNet-1k.
We transform the image classification problem into single-turn conversation, with "Classify the image." as instruction and "This is a [CLASS]" as response. We provide the preprocessed training data at 🤗accessory_imagenet_train.json. Note that you still need to prepare the ImageNet-1k images by yourself.
Since LLaMA2-Accessory is designed to support the joint finetuning on multiple datasets,
you need to additionally prepare a data_config.yaml
file, which specifies the collection
of datasets used for finetuning. The following shows the contents of data_config.yaml
:
META:
-
path: 'path/to/accessory_imagenet_train.json'
type: 'text'
root: 'path/to/imagenet/images' # optional
ratio: 1.0 # optional
Since we only use one dataset for this example, the META
field in data_config.yaml
contains only 1 item. For this
item, the four keys has the following meanings:
path
: specifies the path to data annotation file.type
: when multiple datasets are used for finetuning, LLaMA2-Accessory guarantees that in each global batch (batch size per GPU * data parallel size * accumulate grad iterations), all data samples are from datasets of the sametype
. For example, when the training set consists of both text-only and image-text datasets, the two kind of datasets should have differenttype
values.root
: optional; when specified, the image paths in the dataset will be considered as relative path toroot
.ratio
: optional; when specified, before training the dataset will be randomly sampled by the ratio.
If you are interested, please refer to dataset.py for the underlying implementation.
Suppose you have prepared SPHINX-v2-1k at /path/to/sphinx-v2-1k
, and data_config.yaml
at
path/to/data_config.yaml
, you can now start finetuning with the following script:
#!/bin/bash
#SBATCH --gres=gpu:8
#SBATCH -n 16
#SBATCH --ntasks-per-node 8
#SBATCH --cpus-per-task=16
llama_type=llama_ens5 # llama_ens5 for sphinx-v2-1k and sphinx-1k, llama_ens for sphinx
pretrained_path=/path/to/sphinx-v2-1k
llama_config=/path/to/sphinx-v2-1k/params.json
tokenizer_path=/path/to/sphinx-v2-1k/tokenizer.model
data_config=path/to/data_config.yaml
data_parallel=sdp
model_parallel=2
lr=0.00002 # We recommend 5e-6 for SPHINX-MoE and SPHINX-MoE-1k, and 2e-5 for others
exp_name=finetune/imagenet/sphinx-v2-1k/
echo "exp name: $exp_name"
mkdir -p output/"$exp_name"
srun python -u main_finetune.py \
--output_dir output/"$exp_name" --epochs 1 --warmup_epochs 0.03 \
--batch_size 4 --accum_iter 4 --num_workers 2 \
--max_words 512 \
--lr "$lr" --min_lr 0 --clip_grad 8 --weight_decay 0 \
--data_parallel "$data_parallel" --model_parallel_size "$model_parallel" --checkpointing \
--llama_type llama_ens5 --llama_config $llama_config --tokenizer_path "$tokenizer_path" \
--pretrained_path "$pretrained_path" --pretrained_type="$pretrained_type" \
--data_config $data_config --dialog \
--image_transform padded_resize \
2>&1 | tee -a output/"$exp_name"/output.log
echo "exp name: $exp_name"
Note that the working directory for running the script should be LLaMA2-Accessory/accessory
.