Role overview

From day one, we're looking out for your well-being–at work and at home–so you can focus on realizing your ambitions. Learn how GM supports a rewarding career that rewards you personally by visiting Total Rewards Resources .

What you'll work on

Lead the design and development of scalable, reliabile, high-performance ML framework to support model training at scale.
Lead model training performance analysis and optimizaiton solutions to scale distributed training workflows and maximize resource utilization across heterogeneous hardware environments, and save cost.
Raise the bar on system observability, debuggability, and operational excellence, and user experience.
Collaborate with cross-functional teams to integrate new features and technologies into the platform.

What we're looking for

Bachelors or higher degree in Computer Science or equivalent major or equaivalent experience
7+ years professional software engineering experience
3+ years specialized experience in AI/ML infrastructure, e.g., enabling distributed training for scaling large ML models
Strong programming skills in Python, with proficiency in frameworks such as,PyTorch (prefered), TensorFlow, or similar
Experience with distributed computing, GPU computing, and cloud environments (AWS, GCP, Azure).
Willingness to travel to Sunnyvale, CA as needed
Comfortable working in highly ambiguous and dynamic environments

What Will Give You a Competitive Edge (preferred qualifications):

Self-motivated, strong execution, impact-delivering oriented
Extensive knowledge and experience with PyTorch 2.x+ and distributed training framework
Experience with design and development of training framework that supports FSDP, Pipeline Parallelism and other scalable solutions to training large foundational models
Experience with profiling, analysis, debugging and optimizing training and dataloading performance.
Excellent communication skills to resolve controversial, make consensus, communicate risks and give constructive feedback

Compensation: The compensation information is a good faith estimate only. It is based on what a successful applicant might be paid in accordance with applicable state laws. The compensation may not be representative for positions located outside of New York, Colorado, California, or Washington.

Tags & focus areas

Used for matching and alerts on DevFound

Fulltime Remote Ai Machine Learning Mlops

Staff Machine Learning Engineer AI Platm

Role overview

What you'll work on

What we're looking for

Tags & focus areas

Ready to Join the Team?