Machine Learning Engineer (hybrid or remote)
Valiance Solutions
About the role
About Valiance Valiance is a deeptech AI company building sovereign and mission-critical AI solutions for enterprises, public sector, and government institutions. From predictive maintenance and demand planning to sovereign AI for citizen services, we design systems that thrive in high-stakes environments. Recognized with the NASSCOM AI Game Changers Award and the Aegis Graham Bell Award, and a certified Google Cloud Partner, our 200+ engineers and data scientists are shaping the future of industries and societies through responsible AI. We are looking for a senior LLMOps Engineer who has taken LLM inference optimization from idea to production — not just proof of concept. You will own the end-to-end efficiency of our LLM inference infrastructure running on H200 GPUs, driving down cost and latency while maintaining the reliability our enterprise and government clients demand. This is a high-ownership, high-impact role on a team building some of India's most consequential AI systems. Design and operate production-grade LLM inference pipelines on H200 GPU clusters, optimizing for throughput, latency, and cost per token. Mistral, Llama, Phi, Gemma) as cost-efficient alternatives to large models without sacrificing output quality. Tune and manage vLLM deployments — including continuous batching, paged attention, tensor parallelism, and quantization (GPTQ, AWQ, FP8) — in production environments.
Architect
Kubernetes-based autoscaling strategies for inference workloads, balancing cold-start penalties against cost at scale. Collaborate with applied ML engineers and solution architects to identify latency and cost bottlenecks across the model serving stack. 3+ years of hands-on experience operating LLM inference in production — demonstrable cost and latency improvements, not POC results. ~ Strong Python engineering skills — clean, testable, production-ready code. ~ Proficiency with Docker and Kubernetes for deploying and scaling GPU inference workloads. ~ Experience building and maintaining REST/gRPC APIs for model serving at scale. ~ Hands-on experience with open-source LLMs and the ability to evaluate model-quality vs.
Experience with GPU memory profiling and optimization (CUDA-level awareness a plus). Familiarity with model distillation, speculative decoding, or flash attention implementations.
Experience with inference frameworks beyond vLLM: TGI, TensorRT-LLM, Triton Inference Server. Familiarity with sovereign AI or air-gapped deployment constraints. You will work on AI systems that are actually deployed at scale — used by government institutions and large enterprises, not just demoed. Competitive compensation with performance-linked incentives. Opportunity to define how Valiance builds its AI platform as we scale. Upload your resume and a brief note on a specific inference optimization you shipped in production — the problem, your approach, and the measurable outcome.
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free