Abstract: The emergence of vision-language foundation models, such as CLIP, has revolutionized image-text representation, enabling a broad range of applications via prompt learning. Despite its ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results
Feedback