How can I get kmeans clustered features? #22

Remaxic · 2024-02-27T10:39:46Z

Hi,
I called the checkpoint_best_legacy_100.pt model using the inference code under the fairseq framework, and I found that the features generated were unclustered. I read in your paper that it is optional whether the output is clustered or not, so I would like to know how can I choose to output the clustered features?

Meanwhile, I have clustered the output using learn_kmeans.py and dump_km_label.py in fairseq framework. I chose n=50 and then decoded it using a trained decoder. I found the results to be very poor. I'm wondering if this is because your model was trained for n=100, so even though the output is continuous features, it only presents the best performance at n=100?

auspicious3000 · 2024-02-27T15:52:02Z

Clustering is a separate step. You need to use the code in the fairseq framework to do that. Just like you did above.

"I'm wondering if this is because your model was trained for n=100, so even though the output is continuous features, it only presents the best performance at n=100?"
Not necessarily. You can cluster the features into any clusters you want. The key here is to retrain the decoder because even if you cluster into 100 classes, the class ids are going to be different every time you do it.

Remaxic · 2024-02-28T05:36:15Z

I see! Thank you very much!
"Clustering is a separate step“，so what's the difference between the model with classes=100 and classes=500?

auspicious3000 · 2024-02-28T05:54:56Z

That's the teacher label's number of clusters.

Remaxic · 2024-02-28T06:48:48Z

Thank you！

huangf79 · 2024-03-16T03:44:39Z

@Remaxic Hi. Have you obtained good clustered results? Could you share your script?

Remaxic · 2024-03-26T10:30:30Z

@huangf79 Hi, I just extracted the features of my dataset using contentvec model and generated k-means clustering model with k=50 and k=100 by calling learn_kmeans.py and dump_km_label.py files under fairseq framework. I found that the former performs nowhere near as well as the latter, and does not even meet the basic needs of my downstream task.

I read the papers of the HuBERT model proposers, hoping to find their particular method of training a perfect clustering model. But there doesn't seem to be one, and they didn't perform dimensionality reduction or other special operations, except that the dataset (100h) is much larger than mine (about 44h). Considering that the model performs well with k=100, I'm guessing it has something to do with contentvec's feature extraction capabilities. Perhaps it is not suitable for small codebook tasks, or perhaps a better discretisation idea is needed.

If you have a better clustering idea and would like to let me know, I would be very grateful!

Remaxic changed the title ~~How can I get the results after kmeans clustering?~~ How can I get kmeans clustered features? Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I get kmeans clustered features? #22

How can I get kmeans clustered features? #22

Remaxic commented Feb 27, 2024

auspicious3000 commented Feb 27, 2024

Remaxic commented Feb 28, 2024

auspicious3000 commented Feb 28, 2024

Remaxic commented Feb 28, 2024

huangf79 commented Mar 16, 2024

Remaxic commented Mar 26, 2024

How can I get kmeans clustered features? #22

How can I get kmeans clustered features? #22

Comments

Remaxic commented Feb 27, 2024

auspicious3000 commented Feb 27, 2024

Remaxic commented Feb 28, 2024

auspicious3000 commented Feb 28, 2024

Remaxic commented Feb 28, 2024

huangf79 commented Mar 16, 2024

Remaxic commented Mar 26, 2024