Skip to content
mimi

Sr. Infrastructure Reliability Engineer, Infrastructure Reliability

Amazon

Herndon · On-site Full-time Senior Yesterday

About the role

Role Overview

As an Infrastructure Reliability Engineer you will be proactively driving the reliability risk identification, assessment and mitigation for datacenter infrastructure equipment (Example: LV Generator, MV Transformers, LV SWGR, Breakers, UPS, HV Transformers, In‑rack Power shelf etc.). You will also be responsible for root cause analysis of critical equipment failures and drive the continuous improvements to improve datacenter availability for AWS customers. You will work closely with both internal and outside partners including suppliers to drive key aspects of product specification, risk identification plan and execution. You must be ownership minded, independent, action and results oriented to succeed in an open collaborative environment.

Responsibilities

  • Identify, assess, and mitigate reliability risks for datacenter infrastructure equipment.
  • Perform root cause analysis of critical equipment failures.
  • Drive continuous improvements to enhance datacenter availability for AWS customers.
  • Collaborate with internal teams and external suppliers on product specification, risk identification plans, and execution.
  • Use Physics‑of‑Failure based approaches to develop and implement analytical and empirical methods for product quality/reliability risk identification and assessment during design, manufacture, and deployment stages.
  • Conduct lifecycle environmental and operational stress‑driven risk analysis (thermal, electrical, chemical, mechanical) to identify overstress and fatigue‑related product weaknesses.
  • Evaluate product design quality/reliability risks and assess electronics manufacturing process‑related quality/reliability issues.
  • Apply statistical techniques and models to analyze test and field data.
  • Drive critical component identification and associated vendor selection and qualification requirements.
  • Utilize knowledge of process capability for electronic component production and system‑level performance requirements to establish critical‑to‑quality and reliability metrics.
  • Develop datacenter system‑level reliability models and perform reliability quantification and risk analysis for datacenter configuration optimization.
  • Use system reliability engineering tools such as reliability block diagrams, statistical modeling, and data analytics.
  • Monitor product performance in the field during the sustaining stage.
  • Lead root cause analysis of critical failures and implement corrective and preventive actions.
  • Conduct effective vendor auditing and quarterly review processes to improve datacenter availability.

Requirements

  • Experience using Physics‑of‑Failure based approaches for reliability risk identification and assessment.
  • Ability to drive AWS application‑specific requirements for lifecycle environmental and operational stress analysis.
  • Knowledge of statistical techniques and models for test and field data analysis.
  • Capability to evaluate both product design quality/reliability risks and electronics manufacturing process quality/reliability issues.
  • Understanding of process capability for electronic component production and system‑level performance requirements.
  • Familiarity with system reliability engineering tools (reliability block diagram, statistical modeling, data analytics).
  • Strong problem analysis and solving skills.
  • Excellent communication and vendor management abilities.
  • Proven track record in product reliability leadership, business negotiations, and program management.
  • Willingness and ability to travel within the US and internationally.

About the Team

AWS Infrastructure Services owns the design, planning, delivery, and operation of all AWS global infrastructure. In other words, we’re the people who keep the cloud running. We support all AWS data centers and all of the servers, storage, networking, power, and cooling equipment that ensure our customers have continual access to the innovation they rely on. We work on the most challenging problems, with thousands of variables impacting the supply chain — and we’re looking for talented people who want to help.

You’ll join a diverse team of software, hardware, and network engineers, supply chain specialists, security experts, operations managers, and other vital roles. You’ll collaborate with people across AWS to help us deliver the highest standards for safety and security while providing seemingly infinite capacity at the lowest possible cost for our customers. And you’ll experience an inclusive culture that welcomes bold ideas and empowers you to own them to completion.

Why AWS

Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform. We pioneered cloud computing and never stopped innovating — that’s why customers from the most successful startups to Global 500 companies trust our robust suite of products and services to power their businesses.

Diverse Experiences

AWS values diverse experiences. Even if you do not meet all of the preferred qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying.

Amazon values diverse experiences.

Requirements

  • Expertise in reliability engineering with a proven track record
  • Experience using Physics‑of‑Failure methodologies
  • Strong knowledge of statistical techniques and modeling for test and field data analysis
  • Familiarity with reliability engineering tools such as reliability block diagrams and data analytics
  • Ability to assess electronics manufacturing processes and component reliability
  • Experience in vendor management, auditing, and qualification
  • Strong problem‑analysis, communication, and program management skills
  • Willingness and ability to travel domestically and internationally

Responsibilities

  • Drive reliability risk identification, assessment, and mitigation for datacenter infrastructure equipment
  • Perform root cause analysis of critical equipment failures
  • Implement continuous improvements to increase datacenter availability for AWS customers
  • Collaborate with internal teams and external suppliers on product specification, risk identification plans, and execution
  • Apply Physics‑of‑Failure approaches to develop analytical and empirical reliability assessments during design, manufacture, and deployment
  • Conduct lifecycle environmental and operational stress analysis (thermal, electrical, chemical, mechanical) to identify overstress and fatigue issues
  • Evaluate electronics manufacturing process quality and reliability issues
  • Identify critical components, manage vendor selection and qualification requirements
  • Develop system‑level reliability models, reliability block diagrams, and perform reliability quantification for datacenter configuration optimization
  • Monitor field product performance and drive corrective and preventive actions
  • Lead vendor auditing and quarterly review processes
  • Provide program management and business negotiation support
  • Travel within the US and internationally as required

Skills

Physics‑of‑FailureReliability engineeringStatistical modelingReliability block diagramData analyticsRoot cause analysisVendor managementProgram managementRisk analysisThermal/electrical/chemical/mechanical stress analysisProcess capability assessmentCritical to quality metricsCommunication

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free