Culture & Research Philosophy
Our interdisciplinary team works at the foundational models of computer vision, natural language processing, and multimodal learning:
- Experimental: Conduct reproducible experiments that advance fundamental understanding.
- Computational: Leverage algorithms, models, and coding expertise to tackle challenging questions.
Lab Entry
We welcome students from all disciplines with strong curiosity and a passion for rigorous, original research. We commonly submit our work to conferences including ACL, NIPS, CVPR and ACM CHI.
- Document Intelligence
OCR is no longer just about reading text; it is about understanding the structure of a page. A human understands a document by looking at the layout, charts, and text together. We are building Multimodal models that can parse complex, unstructured documents (like invoices or scientific papers) end-to-end. We are especially interested in low-resource languages where training data is scarce. - Small Language Models
Everyone is competing to build bigger models, but we are going the other direction. We focus on Small Language Models (SLMs) and Knowledge Distillation. The core research question here is: How much reasoning capability can we retain if we reduce the model size by 10x or 100x? We investigate quantization and parameter-efficient fine-tuning to bring LLM-level intelligence to edge devices. - Speech2Text
Labelling audio data is expensive and slow. We focus on Self-Supervised Learning (SSL), where models learn the structure of speech just by listening to massive amounts of unlabelled audio. We apply this to difficult problems like "Code-Switching" (when speakers mix languages in one sentence) and speech recognition in highly noisy environments where standard models collapse. - Text2Speech
We work on the dual edges of generative audio. On the creative side, we are building Controllable TTS systems that allow fine-grained control over emotion, speed, and prosody without needing hours of studio recording. On the safety side, we are developing forensic tools to detect Deepfakes. We want to find the subtle statistical artifacts that generative models leave behind, so we can distinguish AI voices from human ones. - Medical VQA
Visual Question Answering (VQA) in the medical domain is a challenging task that requires a deep understanding of both visual data and medical knowledge. We are developing advanced VQA systems that can assist healthcare professionals by providing accurate answers to complex medical questions based on medical images such as X-rays, MRIs, and CT scans. Our research focuses on integrating multimodal data and leveraging domain-specific knowledge to improve the accuracy and reliability of these systems. - Human-AI Interaction
As AI systems become more prevalent, understanding how humans interact with them is crucial. We study Human-AI Interaction to design systems that are intuitive and user-friendly. Our research focuses on creating interfaces that facilitate seamless collaboration between humans and AI, ensuring that the technology enhances user experience without overwhelming them. Example applications include AI assistants, Mental Health Chatbots, Human-Robot Interaction, and interactive learning platforms.
