My research concentrates on perception in 3D and contexts, which is the cornerstone for embodied agents.
The core of my research principle is understanding the the flow of information across datasets, models, and inference-time contexts.
Therefore, I mostly adopt the approach of Paul Erdos by investigating a broad range of tasks, topics, models, and techniques:
Representation learning driven by foundation models (e.g. LLM)
Propogation of temporal information in 3D tracking of autonomous driving.
Selection and fusion of long-context in AR/VR oriented video understanding and autonomous driving.
Sparsifying and condensing the representation for 3D perception.
Restricted Memory Banks Improve Video Object Segmentation: A Revisit (Alias: RMem) Junbao Zhou*,
Ziqi Pang*,
Yu-Xiong Wang CVPR , 2024
Simply bounding the size of memory banks improves VOS on challenging state-changes and long videos, indicating the importance of selecting relevant information from long contexts.
Frozen Transformers from Language Models are Effective Visual Encoder Layers Ziqi Pang,
Ziyang Xie*,
Yunze Man*,
Yu-Xiong Wang ICLR , 2024 (Spotlight)  
Code
/
arXiv
Frozen transformers from language models, though trained solely on textual data, can effectively improves diverse visual tasks by directly encoding visual tokens.