AI / ML Developer (Senior) with skills AI/ML Development, AI/ML Development, TensorFlow for location Menlo Park, Menlo Park, US
ROLES & RESPONSIBILITIES

We are seeking an experienced AI/ML Infrastructure & Ops Engineer to build, scale, and maintain the critical infrastructure that powers our AI models and autonomous agents. In this role, you will act as the bridge between our AI research/development teams and our production environments. You will not just be deploying models; you will be designing the high-performance, distributed systems required to serve Large Language Models (LLMs), orchestrate multi-agent workflows, and optimize GPU compute at scale.

 

If you are passionate about turning complex AI capabilities into highly reliable, scalable, and cost-efficient production systems, this is the role for you.

 

Key Responsibilities

1. Machine Learning Infrastructure & Serving

  • Design, build, and manage scalable infrastructure for training, fine-tuning, and serving LLMs and multimodal models.

  • Optimize inference latency, throughput, and cost using modern serving frameworks (e.g., vLLM, Triton Inference Server, Ray Serve) [2].

  • Manage and orchestrate GPU/TPU clusters, ensuring high utilization and efficient resource allocation.

 

2. Building and Scaling Agentic Operations (AgentOps)

  • Architect and deploy infrastructure to support autonomous AI agents and multi-agent systems.

  • Integrate and maintain agent orchestration frameworks (e.g., LangGraph, CrewAI) within production environments [3].

  • Build robust state management and memory systems (vector databases, graph databases) required for agentic workflows.

 

3. Observability, Evaluation, and Reliability

  • Implement comprehensive observability stacks tailored for LLMs and agents (tracing, prompt logging, cost tracking) using tools like Langfuse, Arize, or Datadog [4].

  • Design automated evaluation pipelines to monitor agent performance, safety, and reliability in real-time (LLMOps/AgentOps).

  • Act as the first line of defense for production AI systems, diagnosing and resolving issues related to memory limits, inference queues, and cluster failures.

 

4. Developer Platform & CI/CD for AI

  • Build internal developer platforms and tooling that allow AI engineers and data scientists to easily deploy models and agents to production.

  • Adapt traditional CI/CD pipelines to accommodate model versioning, prompt management, and continuous evaluation.

 

Qualifications

Required Skills:

  • Systems Engineering: Strong background in distributed systems, backend engineering, or DevOps/SRE.

  • Programming: Proficiency in Python (essential for the AI ecosystem) and systems languages like Go or Rust.

  • Containerization & Orchestration: Deep expertise in Kubernetes (K8s), Docker, and infrastructure-as-code (Terraform, Pulumi).

  • AI/ML Tooling: Hands-on experience with LLM serving engines (vLLM, TGI, Triton) and distributed computing frameworks (Ray) [2].

  • Agent Frameworks: Familiarity with modern agentic development frameworks like LangChain, LangGraph, or CrewAI [3].

  • Cloud & Hardware: Experience managing high-performance compute (GPUs/TPUs) on major cloud providers (AWS, GCP, Azure) or bare-metal clusters.

 

Preferred Skills:

  • Experience with vector databases (Pinecone, Milvus, Qdrant) and retrieval-augmented generation (RAG) pipelines.

  • Understanding of model optimization techniques (quantization, LoRA, KV caching).

  • Previous experience building platforms from the ground up in a high-growth environment.

EXPERIENCE
  • 6-8 Years
SKILLS
  • Primary Skill: AI/ML Development
  • Sub Skill(s): AI/ML Development
  • Additional Skill(s): AI/ML Development, TensorFlow
ABOUT THE COMPANY

Infogain is a human-centered digital platform and software engineering company based out of Silicon Valley. We engineer business outcomes for Fortune 500 companies and digital natives in the technology, healthcare, insurance, travel, telecom, and retail & CPG industries using technologies such as cloud, microservices, automation, IoT, and artificial intelligence. We accelerate experience-led transformation in the delivery of digital platforms. Infogain is also a Microsoft (NASDAQ: MSFT) Gold Partner and Azure Expert Managed Services Provider (MSP).

Infogain, an Apax Funds portfolio company, has offices in California, Washington, Texas, the UK, the UAE, and Singapore, with delivery centers in Seattle, Houston, Austin, Kraków, Noida, Gurgaon, Mumbai, Pune, and Bengaluru.

Express Application
Upload Microsoft word, PDF file upto 500KB.
I will require US immigration assistance.
Recent Jobs
Posted on April 03, 2026
AI / ML Developer (Senior) | 6-8 Years | AI/ML Development - AI/ML Development, TensorFlow
Posted on April 03, 2026
AI / ML Developer (Standard) | 3-4.5 Years | AI/ML Development - TensorFlow, Pytorch, LangChain Agents, CrewAI, AutoGen
Posted on April 03, 2026
AI / ML Developer (Standard) | 3-4.5 Years | AI/ML Development - TensorFlow, Pytorch, LangChain Agents, CrewAI, AutoGen
Posted on April 03, 2026
Project Lead | 8-11 Years | Project Management (DXP) - AI/ML Project Management, TPM