Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Merge_Models' with new topic_model from outliers #2222

Open
OnAnd0n opened this issue Nov 22, 2024 · 3 comments
Open

'Merge_Models' with new topic_model from outliers #2222

OnAnd0n opened this issue Nov 22, 2024 · 3 comments

Comments

@OnAnd0n
Copy link

OnAnd0n commented Nov 22, 2024

I would like to utilize 'Merge_Models' in BERTopic to re-cluster the outliers with HDBScan and merge them with the existing topics.

However, there are currently some challenges with the Merge_Models functionality:

  1. When merging the Topic_model (including all data, with outliers) and the Out_Topic_model (consisting only of outliers), the 'Count' of the Topic_model for -1 increases by the number of outliers, instead of effectively concat them.

  2. The Representative_docs are displayed as NaN.
    => is the only way?

My BERTopic Version is 0.16.3

How can these issues be resolved?

@MaartenGr
Copy link
Owner

When merging the Topic_model (including all data, with outliers) and the Out_Topic_model (consisting only of outliers), the 'Count' of the Topic_model for -1 increases by the number of outliers, instead of effectively concat them.

I have a hard time understanding what you exactly mean here. Could you give an example? Perhaps showcase what is happening and what you would expect to happen?

The Representative_docs are displayed as NaN.
=> is the only way?

The representative documents are indeed displayed as NaN since merge_models is also meant for federated learning. If you want representative documents re-calculated, I would advise checking the issues page. I believe there are a number of issues that describe in detail how you can do this.

@bfisseler
Copy link

bfisseler commented Nov 27, 2024

Hi @MaartenGr and @OnAnd0n , let me share my code as an example here, as I also try to extract outliers from a topic model, calculate a new topic model using the outliers only and then merge both models. What I tried so far:

# lets assume there is a topic model called topic_model with roughly 50% outliers (topic -1)
# first, lets filter the outliers from the initial topic model
filtered_docs = [doc for doc, topic in zip(docs, topics) if topic != -1] 

# new topic model with outlier docs only
outlier_topic_model = BERTopic()
outlier_topics, outlier_probs = outlier_topic_model.fit_transform(filtered_docs)

# merge these two models together
merged_model = BERTopic.merge_models([topic_model, outlier_topic_model])

Is this a possible approach? And why do I not get e.g. the topic frequencies for the merged model? This: merged_model.get_topic_freq() does not work?

@MaartenGr
Copy link
Owner

Is this a possible approach? And why do I not get e.g. the topic frequencies for the merged model? This: merged_model.get_topic_freq() does not work?

Could you share a bit more information? Which version of BERTopic are you using? Can you share the output? You mention that it does not work, but what exactly does that mean? Do you get an error or perhaps unexpected results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants