RS
On-prem Platform Engineer
Realtech Services
Charlotte · On-site Contract 2w ago
About the role
About
Key Skills
Must-Have Skills (Mandatory Keywords)
- LLM Inference & Optimization
- vLLM, TensorRT-LLM, Triton Inference Server, SGLang
- Inference optimization techniques:
- Continuous batching
- Speculative decoding
- KV cache / Prefix caching
- Model optimization:
- FP8, AWQ, GPTQ
- Distributed & GPU Systems
- Tensor parallelism and large model scaling
- CUDA, NCCL, GPU architecture
- GPU partitioning & optimization (MIG)
- Kubernetes & ML Serving
- Kubernetes-based ML serving platforms
- KServe, OpenShift AI
- Helm charts, Operators, platform automation
- GPU Orchestration
- Run:AI or similar GPU scheduling/orchestration platforms
- Multi-tenant GPU workload management
- Platform Engineering
- Experience building internal AI/ML platforms (on-prem or hybrid)
- Strong automation and system design mindset
- Observability & Performance
- Prometheus, Grafana
- ML observability (model latency, throughput, drift, resource utilization)
- Performance benchmarking and tuning
Good to Have / Preferred Skills:
- Experience with LLMOps / Gen-AI pipelines
- Exposure to hybrid cloud (on-prem + GCP/Azure integration)
- Familiarity with Inferentia / alternative accelerators
- Knowledge of service mesh / networking in GPU clusters
Responsibilities
- Build, configure, and operate on‑prem Kubernetes/OpenShift AI platforms for deploying and serving Gen-AI models and LLM inference workloads.
- Design and optimize high‑performance inference stacks using vLLM, Tensor RT‑LLM, Triton Inference Server, SGLang, and advanced techniques (continuous batching, speculative decoding, KV caching).
- Manage GPU orchestration and capacity using Run: AI, MIG, CUDA/NCCL, and tensor parallelism to maximize utilization and throughput.
- Deploy and operate Kubernetes ML serving frameworks (KServe, Helm, Operators) for scalable, reliable model serving.
- Drive inference optimization and benchmarking, leveraging FP8, AWQ, GPTQ, and performance tools such as GuideLLM and Locust.
- Implement observability and ML monitoring using Prometheus, Grafana, Arize AI, ensuring SLA/SLO compliance for Gen-AI services.
- Collaborate with ML and research teams to onboard new models, tune inference performance, and productionize Gen-AI use cases.
Skills
Arize AIAWQContinuous batchingCUDAFP8GPTQGrafanaGuideLLMHelmKServeKubernetesKV cachingLLM InferenceLLMOpsLocustMIGNCCLOpenShift AIOperatorsPrefix cachingPrometheusRun:AISGLangSpeculative decodingTensor parallelismTensorRT-LLMTriton Inference ServervLLM
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free