San Jose, California, United States
Full time

Immigration sponsorship is not available for this position

Responsibilities:

Develop the Machine Learning Platform management system.
Design and implement intuitive user interfaces and APls for seamless interaction with the platform.
Ensure robust access control and security measures for the Machine Learning Platform.
Regularly evaluate and enhance platform performance, scalability, and reliability. Integrate tools for data versioning, experiment tracking, and workflow orchestration.
Build the toolchains, service, pipeline for model development workflow, and model serving architecture.
Create automated pipelines for data preprocessing, feature engineering, and dataset versioning.
Develop Cl/CD pipelines for deploying models into production environments with minimal downtime.
Enable support for distributed model training and hyperparameter optimization.
Incorporate A/B testing frameworks for evaluating multiple model deployments.
Collaborate with data scientists and engineers to streamline the model development lifecycle.
Prioritize various metrics for model training and inferencing monitoring. Implement logging and monitoring tools to track model performance, resource utilization, and throughput.
Develop dashboards to visualize key metrics such as latency, accuracy, and drift detection in realtime.
Establish alerting mechanisms to detect and respond to anomalies or performance degradation.
Continuously refine metric prioritization based on stakeholder feedback and evolving business goals.
Develop and maintaining the high-performance LLM training GPU infrastructure and cluster.
Optimize GPU utilization for large-scale training workloads, ensuring minimal resource wastage.
Implement fault-tolerant and distributed training strategies for handling large language models (LLMs).
Evaluate and integrate emerging hardware technologies, such as TPUs, into the training infrastructure.
Regularly update cluster configurations to support new frameworks and model architectures.
Manage scheduling and resource allocation for multi-tenant GPU clusters.
Understand the auto scale for inference service and multi-models for dynamical loading.
Design systems that dynamically allocate resources based on real-time demand for inference services.
Develop mechanisms for loading and unloading models in memory to optimize latency and resource usage.
Implement strategies for caching frequently used models to improve inference performance.
Experiment with serverless architectures to further enhance scalability and cost efficiency.
Ensure compatibility with edge devices and deploy lightweight models for edge inference.
Support, troubleshoot, and resolve any issues during the training and inferencing.
Create detailed runbooks for common troubleshooting scenarios to reduce resolution times.
Perform root cause analysis for failures and implement long-term fixes to prevent recurrence.
Collaborate with DevOps and IT teams to ensure the stability of underlying infrastructure.
Develop self-healing systems that can automatically recover from common training or inference issues.
Provide technical support and guidance to data scientists and engineers working on the platform.

What we're looking for:

Requires a Bachelor's degree in Communications Engineering, Artificial Intelligence, Software Engineering, a related field, or a foreign degree equivalent. Must have 2 years of experience in job offered or related occupation. Must have 2 years of experience in:

Designing, Implementing, or optimizing large-scale distributed training systems using technologies like Horovod, DeepSpeed, PyTorch Distributed, or Ray;
Tensor/model parallelism and pipeline parallelism;
Utilizing cloud-native or on-prem infrastructure (Kubernetes, Docker, Slurm) to support scalable, fault-tolerant, and resource-efficient AI workloads across multi-node GPU clusters;
Using Performance Profiling and Optimization to diagnose and improve end-to-end training performance by optimizing data pipelines (e.g., DALI, tf.data), minimizing communication overhead (e.g., NCCL, gRPC), and tuning hardware-specific kernels (e.g., CUDA, Triton);
Systems Programming and Automation in systems-level programming with Python, Bash, and C++ or Go;
Automating deployment and orchestration of AI workloads and monitoring using Prometheus, Grafana, Weights & Biases.
Telecommuting work arrangement permitted one day a week. Four days in office required. Position does not require domestic or international travel

Zoom Communications, Inc.

#LI-DNI

#Ind0

Salary Range or On Target Earnings:

Minimum:

$209,000.00

Maximum:

$275,400.00

In addition to the base salary and/or OTE listed Zoom has a Total Direct Compensation philosophy that takes into consideration; base salary, bonus and equity value.

Note: Starting pay will be based on a number of factors and commensurate with qualifications & experience.

We also have a location based compensation structure; there may be a different range for candidates in this and other locations.

Ways of Working

Our structured hybrid approach is centered around our offices and remote work environments. The work style of each role, Hybrid, Remote, or In-Person is indicated in the job description/posting.

Benefits

As part of our award-winning workplace culture and commitment to delivering happiness, our benefits program offers a variety of perks, benefits, and options to help employees maintain their physical, mental, emotional, and financial health; support work-life balance; and contribute to their community in meaningful ways.

About Us

Zoomies help people stay connected so they can get more done together. We set out to build the best collaboration platform for the enterprise, and today help people communicate better with products like Zoom Contact Center, Zoom Phone, Zoom Events, Zoom Apps, Zoom Rooms, and Zoom Webinars.

We’re problem-solvers, working at a fast pace to design solutions with our customers and users in mind. Find room to grow with opportunities to stretch your skills and advance your career in a collaborative, growth-focused environment.

Our Commitment

At Zoom, we believe great work happens when people feel supported and empowered. We’re committed to fair hiring practices that ensure every candidate is evaluated based on skills, experience, and potential. If you require an accommodation during the hiring process, let us know—we’re here to support you at every step.

We welcome people of different backgrounds, experiences, abilities and perspectives including qualified applicants with arrest and conviction records and any qualified applicants requiring reasonable accommodations in accordance with the law.

If you need assistance navigating the interview process due to a medical disability, please submit an Accommodations Request Form and someone from our team will reach out soon. This form is solely for applicants who require an accommodation due to a qualifying medical disability. Non-accommodation-related requests, such as application follow-ups or technical issues, will not be addressed.

Think of this opportunity as a marathon, not a sprint! We're building a strong team at Zoom, and we're looking for talented individuals to join us for the long haul. No need to rush your application – take your time to ensure it's a good fit for your career goals. We continuously review applications, so submit yours whenever you're ready to take the next step.

Our interviews are supported by BrightHire, a tool that helps us create a consistent and thoughtful interview experience and may include recordings. Please refer to our candidate privacy statement for more information of how we use your data.

Tags & focus areas

Used for matching and alerts on DevFound

Fulltime Remote Ai Ai Engineer Machine Learning

Senior AI Engineer

Tags & focus areas

Ready to Join the Team?