Senior Software Engineer, HPC / Distributed Systems

GTN Technical Staffing

Dallas · On-site Full-time Senior $170k – $250k/yr Yesterday

About the role

Overview

We are seeking a Senior Software Engineer, HPC Scheduling Platform to help design, build, and scale a high-performance compute platform supporting large-scale research, machine learning, and batch workload execution.

This role sits on an HPC Scheduling team responsible for developing and operating distributed compute systems that enable complex research workloads to run efficiently across Kubernetes-based environments. The team is focused on pushing the boundaries of batch scheduling, multi-cluster orchestration, and scalable infrastructure for advanced ML and compute-intensive workloads.

A major focus of this role will be working on an open-source CNCF project used to support multi-cluster Kubernetes batch job scheduling at scale. This is a hands-on engineering position for someone who enjoys building production-grade software, working deeply with Kubernetes, and solving complex infrastructure challenges in high-scale environments.

The ideal candidate brings strong software engineering experience, a deep interest in Kubernetes and batch computing, and the ability to operate across infrastructure, distributed systems, and platform engineering.

Key Responsibilities

Software Engineering & Platform Development

Design, develop, and maintain high-quality software solutions with a strong focus on Go/Golang
Build scalable, reliable, and globally distributed systems that support large-scale research and ML workloads
Contribute to the development and enhancement of Kubernetes-based scheduling platforms, including Armada
Develop and maintain Kubernetes components such as controllers, operators, and custom platform services
Apply strong software architecture, computer science fundamentals, and data structure knowledge to guide technical design and code quality

Kubernetes, Scheduling & Distributed Systems

Build and operate containerized applications within Kubernetes environments
Support advanced workload orchestration, scheduling, and batch processing across multi-cluster environments
Work with HPC, Kubernetes, DAG-based workflows, and job scheduling systems such as Slurm
Help improve scheduling efficiency, workload placement, resource utilization, and platform reliability
Partner with engineering and research teams to support complex compute and ML workload requirements

Infrastructure, Data & Operations

Manage and optimize data interactions across relational and non-relational systems, particularly PostgreSQL
Support Linux-based systems as part of the core compute and scheduling platform
Apply networking fundamentals to troubleshoot, optimize, and improve platform connectivity and performance
Diagnose and resolve complex issues across software, infrastructure, Kubernetes, and distributed systems layers
Operate systems at scale in cloud environments, ideally AWS

Observability, Automation & Best Practices

Build and improve CI/CD pipelines, release processes, and platform engineering workflows
Implement and support observability practices using tools such as Prometheus, Grafana, and logging platforms
Work with event-driven systems and message queues such as Apache Kafka or Pulsar
Drive continuous improvement across reliability, scalability, automation, and engineering standards
Stay current with emerging technologies in Kubernetes, HPC, scheduling, and distributed systems

Required Qualifications

Strong software engineering background with hands-on experience developing production systems in Go/Golang
Experience developing Kubernetes components such as controllers, operators, or custom resources
Experience building, operating, or supporting distributed systems at scale
Strong working knowledge of Kubernetes, containers, Linux, and cloud infrastructure
Experience with batch computing, workload scheduling, HPC, or DAG-based workflow systems
Experience with PostgreSQL or similar relational database technologies
Familiarity with message queues or event-driven platforms such as Kafka, Pulsar, or similar tools
Experience with observability tools such as Prometheus, Grafana, logging systems, and operational dashboards
Ability to independently troubleshoot complex technical issues across infrastructure and application layers
Strong understanding of software design principles, data structures, and computer science fundamentals

Preferred Qualifications

Experience with Armada, Slurm, Volcano, Kueue, or similar scheduling technologies
Experience supporting ML, AI, research, or high-throughput compute workloads
Experience operating large-scale Kubernetes environments across multiple clusters
Experience with AWS or another major cloud provider
Background contributing to open-source infrastructure, platform, or CNCF projects
Experience with performance tuning, reliability engineering, and large-scale systems optimization

Ideal Profile

The ideal candidate is a hands-on software engineer with deep Kubernetes experience and a strong interest in batch scheduling, HPC, and distributed systems. This person should be comfortable writing production-grade Go, operating Linux and Kubernetes environments at scale, and solving complex scheduling and infrastructure challenges for high-performance research and ML workloads.

Skills

AWSArmadaApache KafkaCloudCNCFContainersDAGDockerGoGolangGrafanaHPCInfrastructureKueueKubernetesLinuxMachine LearningMLObservabilityPostgreSQLPrometheusPulsarPythonResearchSlurmSQLSystems EngineeringVolcano

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free