HPC Scientific Software Engineer - Research Computing
John Hopkins University
About the role
About
We, at IT@JH Research Computing, are looking for a HPC Sr. Scientific Software Engineer to join our team and contribute to the design, construction, and support of Johns Hopkins Universitys high-performance computing and AI research infrastructure. We provide a dynamic environment that brings together systems and software engineering to deliver scalable and reproducible solutions for data-intensive research. Our team thrives on collaboration, continuous learning, and supporting innovative research initiatives. This full-time position offers a starting salary range of $99,800 to $175,000 annually, commensurate with experience, and is based at the Johns Hopkins Bayview campus, operating Monday to Friday from 8:30 AM to 5 PM.
Responsibilities
- Design and implement strategies for deploying scientific software on HPC and AI systems.
- Create computational workflows, selecting the most effective software configurations, utilizing tools such as Ansible for automation.
- Assist teams in tuning and optimizing AI models and gateway applications like XDMoD, Coldfront, Open OnDemand, CryoSPARC Live, SBGrid, and AI Agents.
- Analyze and enhance the performance of AI models and HPC applications, prioritizing GPU-enabled computing.
- Establish parallel processing, distributed computing, and resource management methods for efficient job execution.
- Develop, debug, and maintain software tools, libraries, and frameworks essential for HPC and AI tasks.
- Collaborate with system teams and software vendors including NVIDIA, Intel, and Matlab to optimize performance.
- Utilize CUDA, DNN, TensorRT, and Intel Compilers to boost system efficiency.
- Oversee scientific software deployment across HPC, cloud, and colocation facilities.
- Manage the installation, configuration, and upkeep of HPC packages using tools like CMake, Make, EasyBuild, Spack, and Lua module files.
- Engage closely with cross-functional teams, including researchers and software developers, to tackle complex HPC/AI problems.
- Mentor junior engineers and promote a culture of continuous learning.
- Resolve technical challenges and conduct root cause analyses for HPC/AI software issues.
- Implement solutions to enhance system reliability and prevent issues from reoccurring.
- Conduct training workshops for researchers and students on troubleshooting, workflow optimization, and utilizing HPC systems effectively.
- Remain updated on advancements in HPC and AI technologies and methodologies.
- Integrate new research into current systems to enhance performance and capabilities.
- Develop and oversee container orchestration strategies ensuring application scalability, reliability, and security.
- Create thorough documentation for system architectures, performance metrics, and project progress.
- Ensure adherence to security and regulatory requirements for all HPC and AI platforms.
Requirements
- PhD in a quantitative discipline.
- Five years of experience in HPC user support, software deployment, and performance optimization within an academic or research environment.
- Additional education may substitute for required experience and additional related experience may substitute for required education beyond a high school diploma as permitted by the JHU equivalency formula.
- Eight or more years of professional experience in high-performance computing, large-scale systems, or research software engineering (preferred).
- Deep proficiency in Linux systems administration, performance tuning, and automation tools such as Ansible, Terraform, Jenkins, or similar (preferred).
- Experience with cluster management, workload schedulers (e.g., Slurm), and distributed or parallel file systems (e.g., GPFS, Lustre, WekaFS, Ceph) (preferred).
- Strong programming or scripting skills in languages such as Python, Bash, C/C++, Go, or Rust (preferred).
- Familiarity with containerization and orchestration technologies used in HPC (e.g., Singularity, Apptainer, Docker, Kubernetes) (preferred).
- Understanding of high-speed interconnects (InfiniBand, 100/400 Gb Ethernet) and storage/data access patterns for AI and analytics (preferred).
- Experience in developing or maintaining CI/CD pipelines and module environments (Lmod/Spack) for research software (preferred).
- Knowledge of GPU computing (CUDA, ROCm), MPI/OpenMP, and AI/ML frameworks (preferred).
- Demonstrated ability to collaborate with researchers on performance optimization, workflow design, and reproducible computing (preferred).
Tech Stack
- AI
- Ansible
- Bash
- CI/CD
- Cloud
- Ceph
- CUDA
- Docker
- ELK
- Ethernet
- Grafana
- InfiniBand
- Support
- Jenkins
- Kubernetes
- Linux
- Matlab
- Prometheus
- Python
- Rust
- Security
- Terraform
- Machine-Learning
Location
Keswick Road 3910, Baltimore, United States
Salary
$99,800 - 175,000 per year
Benefits
- None explicitly listed.
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free