Member of Technical Staff, ML Systems / Inference

Acceler8 Talent

Fremont · On-site Full-time Mid Level 1w ago

About the role

About Us

We are building next-generation cloud infrastructure for AI workloads. As AI systems scale, the industry is running into fundamental limits in power, capacity, and cost with today’s vertically integrated infrastructure. We are addressing that challenge by decoupling AI workloads from the underlying hardware. Our platform intelligently partitions workloads and orchestrates each component onto the hardware best suited for its performance and efficiency needs. This enables heterogeneous systems across multi-vendor and multi-generation hardware, including emerging accelerators, unlocking major improvements in performance and cost efficiency at scale.

On top of this foundation, we are building a production-grade cloud platform for agentic workloads. Customers deploy and manage workloads through stable, production-ready APIs without needing to reason about hardware selection, placement, or low-level performance optimization.

We are working with leading AI labs, hyperscalers, and AI-native organizations to power real production workloads designed to scale to the next generation of AI datacenters.

Role Overview

Member of Technical Staff, ML Systems / Inference

Our primary focus is seeking a Member of Technical Staff focused on ML systems and inference. In this role, you will design and build inference systems that execute full models end to end under real production constraints. You will work at the intersection of model architecture, runtime behavior, and system performance to ensure inference is fast, predictable, and scalable.

This role is ideal for engineers who deeply understand how modern models execute in practice and who care about latency, throughput, and memory behavior across the full inference lifecycle.

Responsibilities

Design and optimize end-to-end inference pipelines from request ingestion through execution and response
Build and evolve inference runtimes that balance latency, throughput, and concurrency under real-world load
Reason about batching, queuing, and scheduling tradeoffs, including their impact on tail latency and fairness
Manage KV cache allocation, placement, reuse, and eviction across models and requests
Optimize prefill and decode paths, including attention mechanisms and memory usage
Profile and debug inference performance issues across model, runtime, and system boundaries
Work closely with compilers, kernels, networking, and distributed systems to deliver end-to-end performance improvements

Qualifications

Strong software engineering fundamentals
Experience building or operating ML inference or model serving systems
Comfort reasoning about performance, memory usage, and system behavior under load

Preferred Qualifications

Experience with inference runtimes such as TensorRT-LLM, vLLM, or custom serving systems
Deep understanding of modern model architectures and attention mechanisms
Experience with batching, scheduling, and concurrency control in inference systems
Familiarity with KV cache management and memory placement strategies
Experience profiling and tuning latency- and throughput-critical systems
Software development experience in Python and C++

Skills

C++Python

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Member of Technical Staff, ML Systems / Inference

About the role

About Us

Role Overview

Responsibilities

Qualifications

Preferred Qualifications

Skills

Similar roles

Security Software Developer

Software Engineer (m/w/d)

Software Engineer

Don't send a generic resume