Skip to content
mimi

On-prem Platform Engineer

Realtech Services

Charlotte · On-site Contract 2w ago

About the role

About

Key Skills

Must-Have Skills (Mandatory Keywords)

  • LLM Inference & Optimization
    • vLLM, TensorRT-LLM, Triton Inference Server, SGLang
    • Inference optimization techniques:
      • Continuous batching
      • Speculative decoding
      • KV cache / Prefix caching
    • Model optimization:
      • FP8, AWQ, GPTQ
  • Distributed & GPU Systems
    • Tensor parallelism and large model scaling
    • CUDA, NCCL, GPU architecture
    • GPU partitioning & optimization (MIG)
  • Kubernetes & ML Serving
    • Kubernetes-based ML serving platforms
    • KServe, OpenShift AI
    • Helm charts, Operators, platform automation
  • GPU Orchestration
    • Run:AI or similar GPU scheduling/orchestration platforms
    • Multi-tenant GPU workload management
  • Platform Engineering
    • Experience building internal AI/ML platforms (on-prem or hybrid)
    • Strong automation and system design mindset
  • Observability & Performance
    • Prometheus, Grafana
    • ML observability (model latency, throughput, drift, resource utilization)
    • Performance benchmarking and tuning

Good to Have / Preferred Skills:

  • Experience with LLMOps / Gen-AI pipelines
  • Exposure to hybrid cloud (on-prem + GCP/Azure integration)
  • Familiarity with Inferentia / alternative accelerators
  • Knowledge of service mesh / networking in GPU clusters

Responsibilities

  • Build, configure, and operate on‑prem Kubernetes/OpenShift AI platforms for deploying and serving Gen-AI models and LLM inference workloads.
  • Design and optimize high‑performance inference stacks using vLLM, Tensor RT‑LLM, Triton Inference Server, SGLang, and advanced techniques (continuous batching, speculative decoding, KV caching).
  • Manage GPU orchestration and capacity using Run: AI, MIG, CUDA/NCCL, and tensor parallelism to maximize utilization and throughput.
  • Deploy and operate Kubernetes ML serving frameworks (KServe, Helm, Operators) for scalable, reliable model serving.
  • Drive inference optimization and benchmarking, leveraging FP8, AWQ, GPTQ, and performance tools such as GuideLLM and Locust.
  • Implement observability and ML monitoring using Prometheus, Grafana, Arize AI, ensuring SLA/SLO compliance for Gen-AI services.
  • Collaborate with ML and research teams to onboard new models, tune inference performance, and productionize Gen-AI use cases.

Skills

Arize AIAWQContinuous batchingCUDAFP8GPTQGrafanaGuideLLMHelmKServeKubernetesKV cachingLLM InferenceLLMOpsLocustMIGNCCLOpenShift AIOperatorsPrefix cachingPrometheusRun:AISGLangSpeculative decodingTensor parallelismTensorRT-LLMTriton Inference ServervLLM

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free