'Merge_Models' with new topic_model from outliers #2222

OnAnd0n · 2024-11-22T00:05:33Z

I would like to utilize 'Merge_Models' in BERTopic to re-cluster the outliers with HDBScan and merge them with the existing topics.

However, there are currently some challenges with the Merge_Models functionality:

When merging the Topic_model (including all data, with outliers) and the Out_Topic_model (consisting only of outliers), the 'Count' of the Topic_model for -1 increases by the number of outliers, instead of effectively concat them.
The Representative_docs are displayed as NaN.
=> is the only way?

My BERTopic Version is 0.16.3

How can these issues be resolved?

MaartenGr · 2024-11-25T11:04:38Z

When merging the Topic_model (including all data, with outliers) and the Out_Topic_model (consisting only of outliers), the 'Count' of the Topic_model for -1 increases by the number of outliers, instead of effectively concat them.

I have a hard time understanding what you exactly mean here. Could you give an example? Perhaps showcase what is happening and what you would expect to happen?

The Representative_docs are displayed as NaN.
=> is the only way?

The representative documents are indeed displayed as NaN since merge_models is also meant for federated learning. If you want representative documents re-calculated, I would advise checking the issues page. I believe there are a number of issues that describe in detail how you can do this.

bfisseler · 2024-11-27T19:51:02Z

Hi @MaartenGr and @OnAnd0n , let me share my code as an example here, as I also try to extract outliers from a topic model, calculate a new topic model using the outliers only and then merge both models. What I tried so far:

# lets assume there is a topic model called topic_model with roughly 50% outliers (topic -1)
# first, lets filter the outliers from the initial topic model
filtered_docs = [doc for doc, topic in zip(docs, topics) if topic != -1] 

# new topic model with outlier docs only
outlier_topic_model = BERTopic()
outlier_topics, outlier_probs = outlier_topic_model.fit_transform(filtered_docs)

# merge these two models together
merged_model = BERTopic.merge_models([topic_model, outlier_topic_model])

Is this a possible approach? And why do I not get e.g. the topic frequencies for the merged model? This: merged_model.get_topic_freq() does not work?

MaartenGr · 2024-11-29T06:55:15Z

Is this a possible approach? And why do I not get e.g. the topic frequencies for the merged model? This: merged_model.get_topic_freq() does not work?

Could you share a bit more information? Which version of BERTopic are you using? Can you share the output? You mention that it does not work, but what exactly does that mean? Do you get an error or perhaps unexpected results?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'Merge_Models' with new topic_model from outliers #2222

'Merge_Models' with new topic_model from outliers #2222

OnAnd0n commented Nov 22, 2024

MaartenGr commented Nov 25, 2024

bfisseler commented Nov 27, 2024 •

edited

Loading

MaartenGr commented Nov 29, 2024

'Merge_Models' with new topic_model from outliers #2222

'Merge_Models' with new topic_model from outliers #2222

Comments

OnAnd0n commented Nov 22, 2024

MaartenGr commented Nov 25, 2024

bfisseler commented Nov 27, 2024 • edited Loading

MaartenGr commented Nov 29, 2024

bfisseler commented Nov 27, 2024 •

edited

Loading