My research concentrates on perception in spatial (3D) and temporal (video) contexts, which is the cornerstone for embodied agents and real-world visual assistants.
Although being a so-called "perception" people, my ambition interest lies in how to unify generative modeling (LLMs, diffusion models, NeRF, etc.) with perception tasks .
My goal is to unlock better scaling, better interactivity with humans,
and self-improvement and self-exploration of perception models by enabling them to generate a distribution.
Text-to-image diffusion models can digest additional "graph" conditions about the relationships of entities, supporting more nuanced generation for recommendation systems, virtual arts, etc.
Simply bounding the size of memory banks improves VOS on challenging state-changes and long videos, indicating the importance of selecting relevant information from long contexts.
Frozen Transformers from Language Models are Effective Visual Encoder Layers Ziqi Pang,
Ziyang Xie*,
Yunze Man*,
Yu-Xiong Wang ICLR , 2024 (Spotlight)  
Code
/
arXiv
Frozen transformers from language models, though trained solely on textual data, can effectively improves diverse visual tasks by directly encoding visual tokens.