My research highlights perception in spatial (3D) and temporal (video) contexts, which is the cornerstone for embodied agents and digital assistants.
Although being a so-called "perception" people, my ambition is how to unify generative modeling (LLMs, diffusion models, NeRF, etc.) with perception tasks .
My goal is to unlock better scaling, better interactivity with humans,
self-improvement and self-exploration of perception models from the generative capabilities of models.
Text-to-image diffusion models can digest additional "graph" conditions about the relationships of entities, supporting more nuanced generation for recommendation systems, virtual arts, etc.
Simply bounding the size of memory banks improves VOS on challenging state-changes and long videos, indicating the importance of selecting relevant information from long contexts.
Frozen Transformers from Language Models are Effective Visual Encoder Layers Ziqi Pang,
Ziyang Xie*,
Yunze Man*,
Yu-Xiong Wang ICLR , 2024 (Spotlight)  
Code
/
arXiv
Frozen transformers from language models, though trained solely on textual data, can effectively improves diverse visual tasks by directly encoding visual tokens.