All jobs · Machine Learning Engineer jobs

Director, Engineering, AI Accelerator Cluster Systems

Google

Seattle · flexible Full-time Executive $307k – $427k/yr 2mo ago

About the role

About The Job

As the Director, Engineering, AI Accelerator Cluster Systems you will be responsible for driving the provisioning architecture, cluster operations experience, physical-to-software integration, and overall operator experience for Google Cloud’s bare metal accelerator infrastructure. In this role, you will sit at the intersection of physical systems engineering and the underlying software stack, leading strategic efforts to deliver AI infrastructure on prem.

Google Cloud accelerates every organization’s ability to digitally transform its business and industry. We deliver enterprise-grade solutions that leverage Google’s cutting-edge technology, and tools that help developers build more sustainably. Customers in more than 200 countries and territories turn to Google Cloud as their trusted partner to enable growth and solve their most critical business problems.

The US base salary range for this full-time position is $307,000-$427,000 + bonus + equity + benefits. Our salary ranges are determined by role, level, and location. Within the range, individual pay is determined by work location and additional factors, including job-related skills, experience, and relevant education or training. Your recruiter can share more about the specific salary range for your preferred location during the hiring process.

Please note that the compensation details listed in US role postings reflect the base salary only, and do not include bonus, equity, or benefits. Learn more about benefits at Google .

Responsibilities

Build physical infrastructure and custom networking, while designing data centers focused on custom cooling, rack density, and CDU placement.
Oversee the core bare-metal software stack, including drivers, OS, firmware, and accelerator release management.
Engineer systems with direct GPU/TPU access, ensuring high compute density and low latency for training foundation models.
Partner cross-functionally to optimize the end-to-end AI accelerator stack from large-scale networking to Kubernetes.
Collaborate with product and GTM leaders to shape the multi-year bare-metal AI infrastructure strategy.

Minimum Qualifications

Master's degree in Computer Science, Computer Engineering, a related technical field, or equivalent practical experience.
15 years of professional engineering experience.
5 years of experience in a senior leadership role managing large-scale infrastructure or systems teams.
Experience in engineering leadership managing both physical systems design (electrical/mechanical) and the software stack (operating systems, firmware, and drivers) for networking or computing products.

Preferred Qualifications

Experience building and scaling cloud platforms or high-performance computing infrastructure in a fast-paced, dynamic environment.
Exposure to, or experience with, physical data center engineering, infrastructure certification, and physical constraints (e.g., custom cooling schemes, power delivery, rack density).
Strong executive communication skills, with the ability to simplify complex silicon-to-software co-design topics to influence senior leadership and company strategy.
Technical expertise in bare metal provisioning, advanced accelerators (TPUs/GPUs), and high-performance cluster networking.
Deep technical expertise in ML workloads, AI infrastructure scaling, and the unique performance requirements of foundation models.

Benefits

In accordance with Washington state law, we are highlighting our comprehensive benefits package, which is available to all eligible US based employees. Benefits for this role include:

Health, dental, vision, life, disability insurance
Retirement Benefits: 401(k) with company match
Paid Time Off: 20 days of vacation per year, accruing at a rate of 6.15 hours per pay period for the first five years of employment
Sick Time: 40 hours/year (increased to 69 hours/year for Seattle) including 5 discretionary sick days per instance
Maternity Leave (Short-Term Disability + Baby Bonding): 28-30 weeks
Baby Bonding Leave: 18 weeks
Holidays: 13 paid days per year

Note: By applying to this position you will have an opportunity to share your preferred working location from the following: Kirkland, WA, USA; Seattle, WA, USA.

Google is proud to be an equal opportunity workplace and is an affirmative action employer. We are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity or Veteran status. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. See also Google's EEO Policy and EEO is the Law. If you have a disability or special need that requires accommodation, please let us know by completing our Accommodations for Applicants form .

Skills

AIKubernetesTPUGPU

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Director, Engineering, AI Accelerator Cluster Systems

About the role

About The Job

Responsibilities

Minimum Qualifications

Preferred Qualifications

Benefits

Skills

Similar roles

Fullstack Software Architect / Lead Engineer

AOSP Solution Architect (m/w/d)

Backend Engineer (Bangalore)

Don't send a generic resume