Machine Learning Engineer (hybrid or remote)

Valiance Solutions

India · On-site Full-time 2w ago

About the role

About Valiance Valiance is a deeptech AI company building sovereign and mission-critical AI solutions for enterprises, public sector, and government institutions. From predictive maintenance and demand planning to sovereign AI for citizen services, we design systems that thrive in high-stakes environments. Recognized with the NASSCOM AI Game Changers Award and the Aegis Graham Bell Award, and a certified Google Cloud Partner, our 200+ engineers and data scientists are shaping the future of industries and societies through responsible AI. We are looking for a senior LLMOps Engineer who has taken LLM inference optimization from idea to production — not just proof of concept. You will own the end-to-end efficiency of our LLM inference infrastructure running on H200 GPUs, driving down cost and latency while maintaining the reliability our enterprise and government clients demand. This is a high-ownership, high-impact role on a team building some of India's most consequential AI systems. Design and operate production-grade LLM inference pipelines on H200 GPU clusters, optimizing for throughput, latency, and cost per token. Mistral, Llama, Phi, Gemma) as cost-efficient alternatives to large models without sacrificing output quality. Tune and manage vLLM deployments — including continuous batching, paged attention, tensor parallelism, and quantization (GPTQ, AWQ, FP8) — in production environments.

Architect

Kubernetes-based autoscaling strategies for inference workloads, balancing cold-start penalties against cost at scale. Collaborate with applied ML engineers and solution architects to identify latency and cost bottlenecks across the model serving stack. 3+ years of hands-on experience operating LLM inference in production — demonstrable cost and latency improvements, not POC results. ~ Strong Python engineering skills — clean, testable, production-ready code. ~ Proficiency with Docker and Kubernetes for deploying and scaling GPU inference workloads. ~ Experience building and maintaining REST/gRPC APIs for model serving at scale. ~ Hands-on experience with open-source LLMs and the ability to evaluate model-quality vs.

Experience with GPU memory profiling and optimization (CUDA-level awareness a plus). Familiarity with model distillation, speculative decoding, or flash attention implementations.

Experience with inference frameworks beyond vLLM: TGI, TensorRT-LLM, Triton Inference Server. Familiarity with sovereign AI or air-gapped deployment constraints. You will work on AI systems that are actually deployed at scale — used by government institutions and large enterprises, not just demoed. Competitive compensation with performance-linked incentives. Opportunity to define how Valiance builds its AI platform as we scale. Upload your resume and a brief note on a specific inference optimization you shipped in production — the problem, your approach, and the measurable outcome.

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Machine Learning Engineer (hybrid or remote)

About the role

Similar roles

Network Support Engineer

Senior Software Engineer - Pricing Platform (Java / Angular) m/w/d

Platform Engineering Manager

Don't send a generic resume