Principal Architect
Arganteal Corporation
About the role
Overview
The Principal Architect plays a pivotal role in leading professional services engagements focused on High-Performance Computing (HPC) and AI technologies. You will coordinate cross-functional technical teams and liaise with engineers and clients on cutting-edge AI-driven projects. This position involves managing multiple concurrent customer projects, effectively integrating AI solutions with enterprise IT systems.
Role Summary
As a Principal Architect, you will be positioned at the forefront of the AI revolution, working with some of the most advanced hardware available. Your contributions will significantly impact research and enterprise environments as they transition into the future of technology. This role requires a senior-level individual skilled in the architecture, implementation, and optimization of large-scale HPC and AI platforms, primarily utilizing NVIDIA's data center ecosystem. The position combines hands-on technical work with advisory responsibilities to clients.
You will serve as a technical authority in areas such as GPU-accelerated computing, high-performance networking, and parallel storage platforms. Your influence will shape architectural standards and delivery outcomes, ensuring that customer deployments are successful, timely, and budget-friendly.
This is a remote work opportunity with the expectation of approximately 10% travel, with an increase during critical project phases or customer engagements.
Key Responsibilities
Architecture and Design:
- Lead the comprehensive architecture development of GPU-accelerated HPC and AI platforms, including both new AI factory designs and the optimization of existing environments.
- Architect integrated solutions encompassing Compute, Networking, and Storage using NVIDIA's HGX and DGX platforms, Grace CPU architectures, Spectrum-X networking, and high-performance parallel storage solutions.
- Devise storage architectures that are optimized for AI training, inference, and HPC workloads while balancing performance, scalability, resiliency, and cost.
- Establish reference architectures, design patterns, and best practices to facilitate repeatable, supportable customer deployments.
Platform Implementation and Optimization:
- Provide technical leadership during implementation phases, which includes cluster initialization, performance tuning, and workload optimization.
- Design and integrate workload orchestration and scheduling platforms utilizing NVIDIA Base Command Manager, Slurm, Kubernetes, and Run:AI.
- Enhance end-to-end data pipelines focusing on GPU utilization, storage throughput, metadata performance, and job scheduling efficiency.
- Identify and remedy performance bottlenecks across Compute, Networking, and Storage domains.
Storage Architecture & Data Performance:
- Design and validate high-performance storage solutions leveraging modern parallel and scale-out platforms.
- Bring hands-on experience with storage technologies such as VAST Data, WEKA, DDN, Lustre, or NetApp.
- Create storage solutions that cater to demanding AI and HPC workloads including high-throughput training pipelines and large-scale shared datasets.
- Collaborate with compute and networking designs to ensure seamless, high-performance architectures.
Technical Authority and Advisory:
- Act as a senior technical expert for HPC and AI architecture both internally and in customer interactions.
- Engage in customer discussions to validate architectural and delivery plans with an emphasis on design integrity and meticulous execution.
- Influence platform standards and architectural direction through expertise and demonstrated success.
Delivery Excellence:
- Proactively identify technical risks across computational, networking, storage, and orchestration layers, and develop mitigation strategies.
- Collaborate with project management to address risks and issues, ensuring production-ready and supportable platforms.
- Ensure adherence to best practices and templates for AI solution delivery by staff, contractors, and partners.
- Review technical documents, assessments, and outputs to maintain consistency and accuracy, aligning with company's standards.
Required Technical Expertise
Core Mastery Areas:
- Expertise in NVIDIA data center platforms, particularly HGX and DGX platforms.
- Architectural knowledge of GPU-accelerated compute for AI and HPC workloads.
- Proficient in high-performance networking architectures, notably with Spectrum-X.
- Experience in large-scale designs for AI factories and HPC platforms.
Storage Expertise:
- Architectural experience with high-performance parallel or scale-out storage solutions.
- In-depth understanding of storage performance metrics associated with AI and HPC workloads, including bandwidth, IOPS, latency, and metadata scaling.
- Experience integrating storage platforms like VAST Data, NetApp, WEKA, DDN, or Lustre within GPU-accelerated environments.
Working Proficiency:
- Familiarity with NVIDIA Base Command Manager for cluster management.
- Experience with Slurm for scheduling and resource management in HPC.
- Knowledge of Run:AI for optimizing GPU workloads.
- Proficiency in Kubernetes for managing GPU-accelerated workloads.
- Linux system administration skills in large-scale, performance-critical environments.
- Understanding containerized AI workflows and their interactions.
Additional Experience
- Track record of optimizing HPC or AI platforms for enhanced performance and cost-effectiveness.
- Experience with regulated or multi-site environments is an added advantage.
- Familiarity with liquid cooling, power/cooling design, and data center integration is highly preferred.
Leadership & Influence
- Senior role distinguished by technical authority rather than direct personnel management.
- Ability to mentor peers through design reviews and technical guidance.
- Comfortable functioning independently in complex, influential technical settings.
Documentation & Repeatability Expectations
- Maintain high-quality documentation, including design blueprints, configuration guides, and operational runbooks.
- Ensure technical artifacts align with company standards for clarity and accuracy.
- Develop reusable templates and reference architectures to streamline future projects.
- Promote a culture of documentation discipline for reproducible and supportable deployments.
Educational/Experience Requirements
- Bachelor's degree in a technical field or equivalent experience in architecting large-scale HPC or AI systems.
- An advanced degree (MS/PhD) in relevant areas is a plus.
- Experience: 10+ years in HPC, Data Center Architecture, or Systems Engineering.
- Bare Metal Focus: Strong understanding of on-premises hardware challenges.
- Demonstrated experience as a Senior or Lead Architect in AI projects.
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free