On-prem Platform Engineer

Realtech Services

Charlotte · On-site Contract 2mo ago

Apply with a tailored resume Save job

About the role

About

Key Skills

Must-Have Skills (Mandatory Keywords)

LLM Inference & Optimization
- vLLM, TensorRT-LLM, Triton Inference Server, SGLang
- Inference optimization techniques:
  - Continuous batching
  - Speculative decoding
  - KV cache / Prefix caching
- Model optimization:
  - FP8, AWQ, GPTQ
Distributed & GPU Systems
- Tensor parallelism and large model scaling
- CUDA, NCCL, GPU architecture
- GPU partitioning & optimization (MIG)
Kubernetes & ML Serving
- Kubernetes-based ML serving platforms
- KServe, OpenShift AI
- Helm charts, Operators, platform automation
GPU Orchestration
- Run:AI or similar GPU scheduling/orchestration platforms
- Multi-tenant GPU workload management
Platform Engineering
- Experience building internal AI/ML platforms (on-prem or hybrid)
- Strong automation and system design mindset
Observability & Performance
- Prometheus, Grafana
- ML observability (model latency, throughput, drift, resource utilization)
- Performance benchmarking and tuning

Good to Have / Preferred Skills:

Experience with LLMOps / Gen-AI pipelines
Exposure to hybrid cloud (on-prem + GCP/Azure integration)
Familiarity with Inferentia / alternative accelerators
Knowledge of service mesh / networking in GPU clusters

Responsibilities

Build, configure, and operate on‑prem Kubernetes/OpenShift AI platforms for deploying and serving Gen-AI models and LLM inference workloads.
Design and optimize high‑performance inference stacks using vLLM, Tensor RT‑LLM, Triton Inference Server, SGLang, and advanced techniques (continuous batching, speculative decoding, KV caching).
Manage GPU orchestration and capacity using Run: AI, MIG, CUDA/NCCL, and tensor parallelism to maximize utilization and throughput.
Deploy and operate Kubernetes ML serving frameworks (KServe, Helm, Operators) for scalable, reliable model serving.
Drive inference optimization and benchmarking, leveraging FP8, AWQ, GPTQ, and performance tools such as GuideLLM and Locust.
Implement observability and ML monitoring using Prometheus, Grafana, Arize AI, ensuring SLA/SLO compliance for Gen-AI services.
Collaborate with ML and research teams to onboard new models, tune inference performance, and productionize Gen-AI use cases.

Skills

Arize AIAWQContinuous batchingCUDAFP8GPTQGrafanaGuideLLMHelmKServeKubernetesKV cachingLLM InferenceLLMOpsLocustMIGNCCLOpenShift AIOperatorsPrefix cachingPrometheusRun:AISGLangSpeculative decodingTensor parallelismTensorRT-LLMTriton Inference ServervLLM

Similar roles

Platform Engineer

Geckotools

Regional Asset Manager

Nebius Group

backend developer

skoobe

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free