Role overview
We’re looking for seasoned ML Infrastructure engineers with experience designing, building and maintaining training and serving infrastructure for ML research.
Responsibilities
- Provide infrastructure support to our ML research and product
- Build tooling to diagnose cluster issues and hardware failures
- Monitor deployments, manage experiments, and generally support our research
- Maximize GPU allocation and utilization for both serving and training
Basic qualifications
- 4+ years of experience supporting the infrastructure within an ML environment
- Experience in developing tools used to diagnose ML infrastructure problems and failures
- Experience with cloud platforms (e.g., Compute Engine, Kubernetes, Cloud Storage)
- Experience working with GPUs
Preferred qualifications
- Experience with large GPU clusters and high-performance computing/networking
- Experience with supporting large language model training
- Experience with ML frameworks like Pytorch/TensorFlow/JAX
- Experience with GPU kernel development
Tags & focus areas
Used for matching and alerts on DevFound Fulltime Remote Machine Learning Ai