-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can I get kmeans clustered features? #22
Comments
Clustering is a separate step. You need to use the code in the fairseq framework to do that. Just like you did above. "I'm wondering if this is because your model was trained for n=100, so even though the output is continuous features, it only presents the best performance at n=100?" |
I see! Thank you very much! |
That's the teacher label's number of clusters. |
Thank you! |
@Remaxic Hi. Have you obtained good clustered results? Could you share your script? |
@huangf79 Hi, I just extracted the features of my dataset using contentvec model and generated k-means clustering model with k=50 and k=100 by calling learn_kmeans.py and dump_km_label.py files under fairseq framework. I found that the former performs nowhere near as well as the latter, and does not even meet the basic needs of my downstream task. I read the papers of the HuBERT model proposers, hoping to find their particular method of training a perfect clustering model. But there doesn't seem to be one, and they didn't perform dimensionality reduction or other special operations, except that the dataset (100h) is much larger than mine (about 44h). Considering that the model performs well with k=100, I'm guessing it has something to do with contentvec's feature extraction capabilities. Perhaps it is not suitable for small codebook tasks, or perhaps a better discretisation idea is needed. If you have a better clustering idea and would like to let me know, I would be very grateful! |
Hi,
I called the checkpoint_best_legacy_100.pt model using the inference code under the fairseq framework, and I found that the features generated were unclustered. I read in your paper that it is optional whether the output is clustered or not, so I would like to know how can I choose to output the clustered features?
Meanwhile, I have clustered the output using learn_kmeans.py and dump_km_label.py in fairseq framework. I chose n=50 and then decoded it using a trained decoder. I found the results to be very poor. I'm wondering if this is because your model was trained for n=100, so even though the output is continuous features, it only presents the best performance at n=100?
The text was updated successfully, but these errors were encountered: