T
AI

MLOps Engineer(W2)

TalentXM (Formerly BlockTXM Inc) ·

Actively hiring Posted 6 months ago

Role overview

Our client is seeking an experienced
MLOps Engineer
to develop and operationalize a comprehensive AI cost tracking and observability framework across multiple cloud platforms. In this role, you will be instrumental in ensuring visibility into AI/ML model performance, usage, and cost metrics across Azure, Google Cloud, and Snowflake environments. You will collaborate closely with cross-functional teams (DevOps, FinOps, and others) to optimize model deployments both for performance and cost-efficiency.

Responsibilities

  • Cost & Observability Framework: Build a common AI cost tracking and observability framework spanning Azure ML, Google Vertex AI (Gemini), and Snowflake platforms.
  • Cloud Billing Integration: Integrate cloud billing and usage APIs (Azure ML, OpenAI, Google Vertex AI/Gemini) to aggregate and monitor AI service costs.
  • Metadata Tagging: Develop model-level metadata tagging processes for cost attribution and trend analysis, enabling granular visibility into costs per model or project.
  • Monitoring & Alerting: Implement and manage Datadog dashboards (or similar observability tools) with alerts for model performance issues – including latency spikes, model drift, and anomaly detection in predictions or usage.
  • Collaboration for Optimization: Work closely with DevOps and FinOps teams to visualize model costs and identify optimization opportunities (e.g. rightsizing resources, adjusting usage patterns).
  • Documentation & Knowledge Transfer: Deliver comprehensive documentation and conduct knowledge transfer sessions to internal teams at project closure, ensuring they can maintain and extend the cost tracking framework.
  • MLOps/DevOps Experience: 5+ years of hands-on experience in MLOps, DevOps, or Cloud Engineering roles focused on AI/ML systems deployment and operations.
  • Cloud AI Platforms: Strong experience working with Azure ML , Google Vertex AI (Gemini) , and OpenAI platforms/services, including deploying and managing models on these services.
  • Observability Tools: Expertise in Datadog (or equivalent monitoring/observability tools) for tracking application performance, logs, and metrics.
  • Programming & Automation: Advanced proficiency in Python and SQL for building automation scripts, data analysis, and integration of monitoring pipelines.
  • CI/CD & Monitoring Integration: Proven experience integrating cost and performance monitoring steps into CI/CD pipelines, ensuring that model deployments are coupled with automated observability and cost checks.
  • FinOps & Cost Management: Solid understanding of FinOps principles , cloud billing APIs, and strategies for cloud cost optimization in an engineering context (e.g. optimizing compute/storage for AI workloads).

Preferred qualifications

  • Generative AI Frameworks: Experience with GenAI/Agentic AI frameworks such as LangChain or building RAG (Retrieval-Augmented Generation) pipelines, especially in production environments.
  • Regulated Environment Experience: Familiarity with implementing cost tracking and ML monitoring in regulated environments (e.g. ensuring compliance with ISO , SOC 2 , HITRUST or similar standards).

Tags & focus areas

Used for matching and alerts on DevFound
Contract Remote Ai Machine Learning Data Science Mlops Generative Ai