CLUSTERLLM: Large Language Models as a Guide for Text Clustering

This is the official PyTorch implementation of paper CLUSTERLLM: Large Language Models as a Guide for Text Clustering (EMNLP2023).

Install

pip install -r requirements.txt

Datasets

Download zip file here and unzip.

Steps to run perspective experiments

1. Original embeddings

cd perspective/2_finetune
bash scripts/get_embedding.sh

The embeddings are produced in each folder of datasets. It will also save the clustering measures. Details instructions see bash script. E5 embeddings are produced with scripts/get_embedding_e5.sh.

2. Sample triplets

cd perspective/1_predict_triplet
bash scripts/triplet_sampling.sh

Sampled triplets will be produced in perspective/1_predict_triplet/sampled_triplet_results. Details instructions see bash script.

3. Predict triplets

First replace the openai keys in perspective/1_predict_triplet/scripts/predict_triplet.sh.

cd perspective/1_predict_triplet
bash scripts/predict_triplet.sh

Predicted triplets will be in perspective/1_predict_triplet/predicted_triplet_results. Details instructions see bash script.

4. Convert triplets

This step only converts the format.

cd perspective/2_finetune
bash scripts/convert_triplet.sh
bash scripts/convert_triplet_self.sh

Converted triplets will be in perspective/2_finetune/converted_triplet_results. Details instructions see bash script.

5. Finetune

cd perspective/2_finetune
bash scripts/finetune.sh

Finetuned model will be in perspective/2_finetune/checkpoints. Details instructions see bash script.

6. Finetune

cd perspective/2_finetune
bash scripts/get_embedding.sh

This time, switch to checkpoints. Clustering measures will be saved into checkpoint folder.

Steps to run granularity experiments

1. Sample pairs

cd granularity
bash scripts/sample_pairs.sh

Sampled pairs will be saved in sampled_pair_results.

[optional] Sample pairs for prompt

4 pairs will be sampled as in-context examples.

cd granularity
bash scripts/sample_pairs_for_prompt.sh

2. Predict pairs

First replace the openai keys in granularity/scripts/predict_pairs.sh.

cd granularity
bash scripts/predict_pairs.sh

Predicted pairs will be in granularity/predicted_pair_results. Also specify prompt_file to sampled the prompt.

3. Predict cluster num

cd granularity
bash scripts/predict_num_clusters.sh

Details instructions see bash script.

Citation

@misc{zhang2023clusterllm,
      title={ClusterLLM: Large Language Models as a Guide for Text Clustering}, 
      author={Yuwei Zhang and Zihan Wang and Jingbo Shang},
      year={2023},
      eprint={2305.14871},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Thanks

Some of the code was adapted from:

https://github.com/xlang-ai/instructor-embedding

Contact

Yuwei Zhang yuz163@ucsd.edu

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
granularity		granularity
image		image
perspective		perspective
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLUSTERLLM: Large Language Models as a Guide for Text Clustering

Install

Datasets

Steps to run perspective experiments

1. Original embeddings

2. Sample triplets

3. Predict triplets

4. Convert triplets

5. Finetune

6. Finetune

Steps to run granularity experiments

1. Sample pairs

[optional] Sample pairs for prompt

2. Predict pairs

3. Predict cluster num

Citation

Thanks

Contact

About

Releases

Packages

Languages

858006908cc/ClusterLLM

Folders and files

Latest commit

History

Repository files navigation

CLUSTERLLM: Large Language Models as a Guide for Text Clustering

Install

Datasets

Steps to run perspective experiments

1. Original embeddings

2. Sample triplets

3. Predict triplets

4. Convert triplets

5. Finetune

6. Finetune

Steps to run granularity experiments

1. Sample pairs

[optional] Sample pairs for prompt

2. Predict pairs

3. Predict cluster num

Citation

Thanks

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages