IT Search Corp
AI

Certified NVIDIA AI Infrastructure Kubernetes Platm Engineer

IT Search Corp · Miami, FL, US · $208k - $270k

Actively hiring Posted about 4 hours ago

Role overview

*NVIDIA AI Infrastructure & Kubernetes Platform Engineer (DGX Systems) Remote

Related Certifications required

6 months to 1+ yrs

$open

USC or GC req**

Alternate titles depending on context:

  • AI Platform Architect – DGX & SuperPOD
  • AI Infrastructure DevOps Engineer – NVIDIA DGX Stack
  • *Senior AI Systems Engineer – DGX | Kubernetes | InfiniBand

Job Description:**

We are seeking a highly skilled AI Infrastructure & Kubernetes Platform Engineer with a proven track record in deploying and managing NVIDIA DGX-based AI clusters, orchestrating containerized AI workloads using Kubernetes, and ensuring secure, high-throughput operations across InfiniBand-powered networks. The ideal candidate will hold a combination of Kubernetes certifications (CKA, CKAD, CKS) and NVIDIA certifications (NCA-AIIO, NCP-AIO, NCP-AII, NCP-AIN), coupled with hands-on training in DGX, BlueField, and high-speed network operations.

This position plays a key role in supporting AI/ML infrastructure at scale, enabling efficient training and inference for complex models, and integrating NVIDIA's cutting-edge compute, storage, and fabric solutions with modern DevOps practices.

*Core Responsibilities:

AI Infrastructure Operations**

  • Deploy and manage NVIDIA DGX BasePODs and SuperPODs for high-performance AI workloads.
  • Oversee DGX system lifecycle operations including provisioning, monitoring, firmware upgrades, and capacity planning.
  • Operate Base Command Manager to manage GPU clusters, schedule workloads, and integrate with MLOps tools.
  • Perform DGX node health validation, NCCL interconnect testing, and NVLink topology verification following new deployments or hardware changes.

What we're looking for

  • Kubernetes, Helm, GPU Operator, Kubeflow
  • DevOps tools: Ansible, Terraform, GitOps, CI/CD pipelines
  • Storage: NFS, BeeGFS, Lustre
  • Networking: RoCE, InfiniBand, DPU offload, gRPC, RDMA
  • Programming/scripting: Python, YAML, Bash

Tags & focus areas

Used for matching and alerts on DevFound
Fulltime Remote Ai