HPC Scientific Software Engineer - Research Computing

John Hopkins University

Baltimore · On-site Full-time Senior $100k – $175k/yr 3mo ago

About the role

About

We, at IT@JH Research Computing, are looking for a HPC Sr. Scientific Software Engineer to join our team and contribute to the design, construction, and support of Johns Hopkins Universitys high-performance computing and AI research infrastructure. We provide a dynamic environment that brings together systems and software engineering to deliver scalable and reproducible solutions for data-intensive research. Our team thrives on collaboration, continuous learning, and supporting innovative research initiatives. This full-time position offers a starting salary range of $99,800 to $175,000 annually, commensurate with experience, and is based at the Johns Hopkins Bayview campus, operating Monday to Friday from 8:30 AM to 5 PM.

Responsibilities

Design and implement strategies for deploying scientific software on HPC and AI systems.
Create computational workflows, selecting the most effective software configurations, utilizing tools such as Ansible for automation.
Assist teams in tuning and optimizing AI models and gateway applications like XDMoD, Coldfront, Open OnDemand, CryoSPARC Live, SBGrid, and AI Agents.
Analyze and enhance the performance of AI models and HPC applications, prioritizing GPU-enabled computing.
Establish parallel processing, distributed computing, and resource management methods for efficient job execution.
Develop, debug, and maintain software tools, libraries, and frameworks essential for HPC and AI tasks.
Collaborate with system teams and software vendors including NVIDIA, Intel, and Matlab to optimize performance.
Utilize CUDA, DNN, TensorRT, and Intel Compilers to boost system efficiency.
Oversee scientific software deployment across HPC, cloud, and colocation facilities.
Manage the installation, configuration, and upkeep of HPC packages using tools like CMake, Make, EasyBuild, Spack, and Lua module files.
Engage closely with cross-functional teams, including researchers and software developers, to tackle complex HPC/AI problems.
Mentor junior engineers and promote a culture of continuous learning.
Resolve technical challenges and conduct root cause analyses for HPC/AI software issues.
Implement solutions to enhance system reliability and prevent issues from reoccurring.
Conduct training workshops for researchers and students on troubleshooting, workflow optimization, and utilizing HPC systems effectively.
Remain updated on advancements in HPC and AI technologies and methodologies.
Integrate new research into current systems to enhance performance and capabilities.
Develop and oversee container orchestration strategies ensuring application scalability, reliability, and security.
Create thorough documentation for system architectures, performance metrics, and project progress.
Ensure adherence to security and regulatory requirements for all HPC and AI platforms.

Requirements

PhD in a quantitative discipline.
Five years of experience in HPC user support, software deployment, and performance optimization within an academic or research environment.
Additional education may substitute for required experience and additional related experience may substitute for required education beyond a high school diploma as permitted by the JHU equivalency formula.
Eight or more years of professional experience in high-performance computing, large-scale systems, or research software engineering (preferred).
Deep proficiency in Linux systems administration, performance tuning, and automation tools such as Ansible, Terraform, Jenkins, or similar (preferred).
Experience with cluster management, workload schedulers (e.g., Slurm), and distributed or parallel file systems (e.g., GPFS, Lustre, WekaFS, Ceph) (preferred).
Strong programming or scripting skills in languages such as Python, Bash, C/C++, Go, or Rust (preferred).
Familiarity with containerization and orchestration technologies used in HPC (e.g., Singularity, Apptainer, Docker, Kubernetes) (preferred).
Understanding of high-speed interconnects (InfiniBand, 100/400 Gb Ethernet) and storage/data access patterns for AI and analytics (preferred).
Experience in developing or maintaining CI/CD pipelines and module environments (Lmod/Spack) for research software (preferred).
Knowledge of GPU computing (CUDA, ROCm), MPI/OpenMP, and AI/ML frameworks (preferred).
Demonstrated ability to collaborate with researchers on performance optimization, workflow design, and reproducible computing (preferred).

Tech Stack

AI
Ansible
Bash
CI/CD
Cloud
Ceph
CUDA
Docker
ELK
Ethernet
Grafana
InfiniBand
Support
Jenkins
Kubernetes
Linux
Matlab
Prometheus
Python
Rust
Security
Terraform
Machine-Learning

Location

Keswick Road 3910, Baltimore, United States

Salary

$99,800 - 175,000 per year

Benefits

None explicitly listed.

Skills

AIAnsibleBashC++CephCI/CDCloudCMakeContainerizationCUDADockerEasyBuildELKEthernetGrafanaInfiniBandIntel CompilersJenkinsKubernetesLinuxLmodLuaMakeMatlabMPIOpenMPPrometheusPythonROCmRustSecuritySlurmSpackTerraformTensorRTXDMOD

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free