Skip to content
mimi

Senior Machine Learning Performance Architect

Fintal Partners

New York · On-site Full-time Senior 2w ago

About the role

About

A market-leading high-frequency trading firm is building a next-generation AI infrastructure platform to support large-scale model training and ultra-low latency inference workloads. They are seeking a Senior Machine Learning Performance Architect to operate at the intersection of ML research, systems engineering, and cutting-edge hardware.

This role is designed for engineers who bridge the gap between research teams developing state-of-the-art models and the hardware/platform teams responsible for maximizing performance across GPU infrastructure. The focus is on end-to-end optimization of ML workloads across compute, networking, memory, and distributed systems layers.

You will work closely with researchers to understand workload characteristics and partner with hardware and infrastructure engineers to ensure models are fully optimized for modern accelerator architectures. The work spans profiling, benchmarking, systems tuning, distributed training performance, and hardware-aware optimization.

Responsibilities

  • Optimize large-scale training and inference workloads across GPU clusters
  • Partner with ML researchers to improve model efficiency and hardware utilization
  • Profile and analyze bottlenecks across compute, memory, networking, and storage layers
  • Drive performance improvements across distributed training systems and inference pipelines
  • Work closely with hardware teams on accelerator performance, topology optimization, and scaling efficiency
  • Build tooling and benchmarks to evaluate system-level ML performance
  • Improve throughput, latency, reliability, and cluster efficiency for production AI workloads
  • Contribute to low-level optimization work across CUDA, NCCL, PyTorch, and distributed systems infrastructure

Requirements

  • Strong background in machine learning systems and performance engineering
  • Deep understanding of GPU architecture, distributed systems, and hardware-aware optimization
  • Experience with CUDA, PyTorch, NCCL, Triton, or similar ML infrastructure technologies
  • Strong systems programming skills in Python and/or C++
  • Experience profiling large-scale ML workloads and optimizing GPU utilization
  • Understanding of networking technologies such as InfiniBand, RDMA, or high-performance interconnects
  • Experience working closely with research teams on productionizing and scaling models
  • Computer Science, Engineering, Physics, Mathematics, or related technical degree preferred

The environment is highly technical, collaborative, and performance-driven, offering the opportunity to work on some of the most advanced AI infrastructure challenges in industry alongside leading researchers and engineers.

Skills

C++CUDAInfiniBandMachine LearningNCCLNetworkingNvidia TritonPerformance EngineeringPythonPyTorchRDMA

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free