Self supervised audio-visual learning
Under guidance of Prof. Preethi Jyothi and Prof. Ganesh Ramakrishnan
The thesis has revolved around multi-modal learning and how self-supervised objectives learn better embeddings.
During the thesis work, we had explored video-caption retrieval tasks, novel audio-visual video parsing task. We explain and critic various related works and propose new models for improvements.
Responsibilities :
- Investigating various techniques to learn joint audio-visual-linguistic embedding for video-text retrieval
- Inspecting new losses that can help improve the performance
- Exploring various heuristics to form augmented supervision required for new loss for better ranking of videos given text query and vice versa
- Implement retrieval task with different losses in MSRVTT, Charades and TFT dataset
- Technologies: Python and PyTorch