BS
GPU Systems Engineer
Base-2 Solutions
Bethesda · On-site Full-time Senior $152k – $172k/yr 2d ago
About the role
Position Summary
Support enterprise AI mission systems by designing, developing, and optimizing GPU clusters, with deep focus on operating systems, hardware, GPU platforms, and high-speed networking in a secure customer environment.
Essential Duties And Responsibilities
- Design, configure, and maintain GPU clusters.
- Collaborate with a multidisciplinary team to define and optimize architectures for performance, power efficiency, and required features.
- Work closely with AI/ML engineers to integrate GPUs with Linux-based systems.
- Optimize GPU drivers for compatibility, reliability, and performance.
- Analyze GPU performance, identify bottlenecks, and develop strategies to improve efficiency across hardware and software layers.
- Build and maintain debugging tools, profiling utilities, and performance analysis software for Linux environments.
- Leverage Bash, Python, Ansible, Puppet, and Salt for tooling and automation.
- Maintain technical documentation, architectural specifications, and Linux best practices.
- Support ATO activities and ensure compliance with federal security standards.
Required Qualifications
- Active TS/SCI with ability to obtain a CI Polygraph.
- Bachelor's degree with a minimum of six years of experience in the category field. Three additional years of experience may be substituted for the bachelor's degree.
- Experience managing NVIDIA GPU data center platforms, including DGX, HGX, H200, H100, and L4s.
- Knowledge of enterprise server components, including storage/network controllers, HBAs, and SSDs.
- Strong expertise with Linux distributions, including RHEL, Ubuntu, Oracle, and Rocky.
- Excellent problem-solving skills and the ability to collaborate within a team.
- Meet DoD 8570.11 IAT Level II certification requirements at a minimum; IAT Level III is also acceptable.
- U.S. citizenship is required due to the nature of the government contracts supported.
Preferred Qualifications
- Experience with Kubernetes cluster management and AI/ML workflow orchestration, including Argo, Airflow, and Kubeflow.
- Familiarity with GPU virtualization and cloud computing.
- Experience with Prometheus and Grafana for monitoring.
- Knowledge of distributed resource scheduling systems such as Slurm, LSF, or similar tools.
Required Education and Experience Equivalency
| Education | Years of Experience |
|---|---|
| High School Diploma/GED | 9 |
| Associates Degree | 9 |
| Bachelors’ Degree | 6 |
| Masters’ Degree | 6 |
| PhD | 6 |
Required Certifications
- DoD 8570.11 IAT Level II certification: Security+ CE, CCNA-Security, GICSP, GSEC, or SSCP.
Required Security Clearance
- Active TS/SCI with ability to obtain a CI Polygraph.
Pay & Benefit Highlights
Compensation
- Competitive fixed salary or hourly pay (based on experience, skills, location, and internal equity).
- Employee referral bonuses up to $10,000 per hired referral.
- Additional bonus opportunities for exceptional performance and contributions to business development and company growth (role-dependent).
Health
- 100% company-paid medical premiums for employees and eligible dependents.
- Choose from multiple plan options with CareFirst, Kaiser, and UnitedHealthcare, including PPO, POS, HMO, and HSA-compatible plans.
- 100% company-paid dental premiums for employees and eligible dependents.
- 100% company-paid vision premiums for employees and eligible dependents.
Income Protection
- 100% company-paid premiums for short-term disability.
- 100% company-paid premiums for long-term disability.
- 100% company-paid premiums for accidental death & dismemberment (AD&D).
- 100% company-paid premiums for life insurance up to $200,000.
Retirement
- 401(k) with immediate vesting: 4% company match plus a 4% non-elective company contribution (8% total).
- 401(k) pre-tax and Roth options.
Leave
- Up to 20 days of flexible paid time off (PTO).
- 11 paid floating holidays.
Work-Life Balance
- Flexible work schedules, including flex time and compressed work periods (contract and project-dependent).
Skills
AnsibleBashDockerGrafanaH100H200KubernetesL4sLinuxNVIDIAPuppetPythonSaltSlurm
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free