Role overview
The Spatial AI Lab is part of the Applied Sciences Group, a Microsoft research and development organization dedicated to creating next-generation human-computer interaction technologies leveraging the most recent AI developments and exploring new hardware capabilities and device form-factors. Our team of scientists and engineers has strong expertise in computer vision, multi-modal AI,spatial and embodied AI.
Your main job will be to help create smart systems for new types of agents by training and improving multimodal AI models. This role will help you gain more experience in building and using AI models for Microsoft products and large-scale AI systems. You will also have the opportunity to join cutting edge research working with partners like ETH Zurich to publish in top-tier venues, present at workshops, and mentor students.
Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Responsibilities
- Research novel machine learning algorithms and models.
- Work on pre and/or post training of foundational multimodal models.
- Build data and learning solutions for scalability, efficiency, and performance.
- Curate training and evaluation datasets/benchmark.
- Optimize models for CPUs, GPUs and NPUs and integrate into products.
- Collaborate across Microsoft research and engineering teams.
Basic qualifications
- A PhD in Machine Learning / Computer Vision or 3+ years of relevant industry experience.
- Engineering skills in programming languages such as Python and/or C++.
- Hands-on experience with modern deep learning frameworks (e.g. Pytorch/Tensorflow/Jax).
- Self-motivated team-player, problem solver, and keen to learn.
- Ability to present complex technical concepts to a diverse audience.
Preferred qualifications
- Multimodal Models hands-on experience in any of the following topics:
- Pre and/or post training of large vision language models;
- Experience in techniques such as pruning, distillation and finetuning.
- LLMs; Large vision-language models (VLMs);
- Video generative models and diffusion algorithms; or
- action-based transformers and Vision Language Action models (VLAs).
- Large-Scale ML Systems Experience with large scale machine learning compute systems.
- Publications Track record of impact, either via research publications at top-tier machine learning or computer vision conferences (NeurIPS, ICML, CVPR, ECCV, ICCV ), or via contributions to successful industry initiatives.