ML Infrastructure Engineer
Jobgether
About the role
About
Join a cutting-edge AI infrastructure environment focused on powering the next generation of machine learning and large-scale AI workloads. This role offers the opportunity to work at the intersection of GPU performance engineering, deep learning optimization, and cloud-scale infrastructure development. You will contribute directly to benchmarking and optimizing advanced GPU platforms that support training and inference for complex neural networks and AI systems. Working alongside highly skilled engineering and hardware teams, you will help drive performance improvements across compute architectures, software stacks, and distributed AI environments. The position is ideal for engineers passionate about ML systems, large-scale model performance, and infrastructure innovation. With exposure to modern AI frameworks, high-performance GPU ecosystems, and international collaboration, this role provides a strong platform for technical growth and meaningful impact within the AI industry.
Accountabilities
- Benchmark and evaluate GPU platform performance for machine learning and AI workloads across various architectures, frameworks, and software environments.
- Collaborate closely with hardware and engineering teams to profile GPU performance at system and kernel levels and identify optimization opportunities.
- Analyze, debug, and optimize training and inference workloads to improve efficiency, scalability, and overall hardware utilization.
- Conduct acceptance testing for new GPU clusters to validate performance, stability, compatibility, and operational readiness for AI workloads.
- Perform experiments across multiple GPU configurations and interconnect strategies to assess system-level scalability and performance trade-offs.
- Develop internal tools, dashboards, and reporting frameworks to visualize performance metrics, bottlenecks, and infrastructure trends.
- Contribute to infrastructure best practices, internal tooling enhancements, and benchmarking methodologies for AI and ML environments.
- Support ongoing platform optimization efforts related to distributed training, inference acceleration, parallelism strategies, and hardware-aware performance tuning.
Requirements
- Strong theoretical foundation in machine learning, deep learning architectures, and AI system optimization principles.
- Deep understanding of performance optimization techniques for large neural network training and inference, including parallelism strategies, kernel optimization, batching, and hardware acceleration.
- Extensive experience with modern deep learning frameworks such as PyTorch, JAX, Megatron-LM, TensorRT-LLM, or equivalent technologies.
- Solid expertise with GPU technologies and software stacks including CUDA, NCCL, GPU drivers, and performance-related libraries.
- Experience profiling and debugging GPU workloads using tools such as Nsight, nvprof, perf, or similar performance analysis platforms.
- Familiarity with containerized and distributed environments including Docker and Kubernetes.
- Strong programming and scripting skills, particularly in Python and performance-oriented development workflows.
- Excellent problem-solving, analytical thinking, and communication skills with the ability to work independently in highly technical environments.
- Experience with LLM inference frameworks such as vLLM, SGLang, or TensorRT is considered a strong advantage.
- Familiarity with cloud-based ML ecosystems such as AWS, Google Cloud Platform, or Azure ML is beneficial.
- Contributions to open-source ML tooling, benchmarking frameworks, or infrastructure projects are highly valued.
Benefits
- Competitive compensation package aligned with experience and technical expertise.
- Flexible remote work environment supporting strong work-life balance.
- Access to continuous learning, career development, and growth opportunities within the AI infrastructure space.
- Opportunity to work on impactful AI projects shaping the future of machine learning infrastructure and cloud computing.
- Collaborative and innovation-driven engineering culture with strong technical ownership and autonomy.
- International work environment with exposure to globally distributed teams and advanced AI technologies.
- Fast-paced setting focused on bold thinking, experimentation, and continuous technical evolution.
- Opportunity to contribute to high-performance AI systems used by developers and enterprises worldwide.
How Jobgether Works
We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
Why Apply Through Jobgether?
Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free