TY - GEN
T1 - CLIPAG
T2 - 2024 IEEE Winter Conference on Applications of Computer Vision, WACV 2024
AU - Ganz, Roy
AU - Elad, Michael
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024/1/3
Y1 - 2024/1/3
N2 - Perceptually Aligned Gradients (PAG) refer to an intriguing property observed in robust image classification models, wherein their input gradients align with human perception and pose semantic meanings. While this phenomenon has gained significant research attention, it was solely studied in the context of unimodal vision-only architectures. In this work, we extend the study of PAG to Vision-Language architectures, which form the foundations for diverse image-text tasks and applications. Through an adversarial robustification finetuning of CLIP, we demonstrate that robust Vision-Language models exhibit PAG in contrast to their vanilla counterparts. This work reveals the merits of CLIP with PAG (CLIPAG) in several vision-language generative tasks. Notably, we show that seamlessly integrating CLIPAG in a "plug-n-play"manner leads to substantial improvements in vision-language generative applications. Furthermore, leveraging its PAG property, CLIPAG enables text-to-image generation without any generative model, which typically requires huge generators.
AB - Perceptually Aligned Gradients (PAG) refer to an intriguing property observed in robust image classification models, wherein their input gradients align with human perception and pose semantic meanings. While this phenomenon has gained significant research attention, it was solely studied in the context of unimodal vision-only architectures. In this work, we extend the study of PAG to Vision-Language architectures, which form the foundations for diverse image-text tasks and applications. Through an adversarial robustification finetuning of CLIP, we demonstrate that robust Vision-Language models exhibit PAG in contrast to their vanilla counterparts. This work reveals the merits of CLIP with PAG (CLIPAG) in several vision-language generative tasks. Notably, we show that seamlessly integrating CLIPAG in a "plug-n-play"manner leads to substantial improvements in vision-language generative applications. Furthermore, leveraging its PAG property, CLIPAG enables text-to-image generation without any generative model, which typically requires huge generators.
KW - Adversarial learning
KW - Algorithms
KW - Algorithms
KW - Vision + language and/or other modalities
KW - adversarial attack and defense methods
UR - http://www.scopus.com/inward/record.url?scp=85191946003&partnerID=8YFLogxK
U2 - 10.1109/WACV57701.2024.00380
DO - 10.1109/WACV57701.2024.00380
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:85191946003
T3 - Proceedings - 2024 IEEE Winter Conference on Applications of Computer Vision, WACV 2024
SP - 3831
EP - 3841
BT - Proceedings - 2024 IEEE Winter Conference on Applications of Computer Vision, WACV 2024
Y2 - 4 January 2024 through 8 January 2024
ER -