The parameters of BIKE are smaller than the original CLIP ViT-L/14 is that in the BIKE model. #4
Open
Description
The reason why the parameters of BIKE are smaller than the original CLIP ViT-L/14 is that in the BIKE model, we only utilize the vision encoder from CLIP and do not include the parameters of CLIP's text encoder.
Originally posted by @whwu95 in #3 (comment)
In fact, the parameters of visual encoder is 303M for ViT-L/16, which excludes text encoder.
Metadata
Assignees
Labels
No labels