Site Reliability Engineering (SRE) and AIOps Lead Architect

Charles Schwab

Bridgewater · On-site Full-time Senior $210k – $240k/yr 2mo ago

About the role

Your Opportunity

At Schwab, you're empowered to make an impact on your career. Here, innovative thought meets creative problem solving, helping us "challenge the status quo" and transform the finance industry together.

We believe in the importance of in-office collaboration and fully intend for the selected candidate for this role to work on site in Austin, Texas.

In this role, you'll lead the technical vision and architecture for our Site Reliability Engineering (SRE) and AIOps function, shaping how reliability, automation, and intelligent operations scale across the enterprise. This is not a traditional production support role—it requires engineering and coding experience. You'll work at the intersection of cloud-native platforms, distributed systems, and AI-driven operations, partnering closely with Engineering, Product, Security, and Infrastructure leaders to build resilient, self-healing systems that support millions of clients. This is a highly visible leadership role where your expertise influences both technology strategy and how teams operate day to day.

What You'll Do

Define and own the end-to-end reliability architecture, including SLO/SLI frameworks, error budget policies, observability standards, and resilience patterns across distributed microservices environments.
Design and architect the AIOps platform encompassing ML-driven anomaly detection, predictive alerting, automated root cause analysis, event correlation, and intelligent remediation workflows.
Lead infrastructure and platform design decisions.
Guide architecture choices for cloud-native infrastructure spanning GCP, AWS, and Azure, Kubernetes orchestration, service mesh technologies like Istio and Envoy, infrastructure-as-code using Terraform or Pulumi, and multi-region disaster recovery strategies.
Architect a unified observability stack that integrates metrics, logs, traces, and events using Open. Telemetry, Grafana, Datadog, and custom ML pipelines for intelligent alerting.
Drive the architecture of automated remediation frameworks, self-healing infrastructure, chaos engineering pipelines, and progressive deployment strategies—including canary, blue-green, and feature flag approaches—to achieve zero-touch operations.
Establish architecture review boards, technical standards, design patterns, and reference architectures, where you'll lead technical due diligence and drive consistency across SRE and platform teams.
Build and mentor a team of senior SRE architects and engineers.
Foster a culture of engineering excellence, continuous learning, and innovation in reliability and AI-driven operations.
Align reliability and AIOps investments with business priorities and present technical strategies to executive stakeholders.

What You Bring

12+ years of experience in software development and engineering, infrastructure, or SRE, with at least 5+ years in a senior architecture or technical leadership role.
Deep expertise in distributed systems, cloud-native architectures, and large-scale production environments.
Hands-on experience with Kubernetes, Docker, service mesh, CI/CD pipelines, and infrastructure-as-code tools.
Strong understanding of ML/AI concepts and their application to operational intelligence—including anomaly detection, predictive scaling, log analysis, and automated diagnostics.
Proven experience designing observability platforms using Open. Telemetry, Prometheus, Grafana, Datadog, Splunk, or equivalent technologies.
Expertise in incident management frameworks, chaos engineering, and SLO-driven reliability practices.
Experience with major cloud platforms (AWS, GCP, Azure) at scale.
Strong communication skills and executive presence, with the ability to translate complex technical concepts for non-technical stakeholders.

Compensation and Benefits

This position offers a salary range of $210,000 to $240,000 per year. In addition to the base salary, this role is eligible for bonus or incentive opportunities.

As a full-time employee at Schwab, you'll receive a competitive benefits package that takes care of the whole you—both today and in the future. This includes a 401(k) with company match and an Employee Stock Purchase Plan, paid time for vacation and volunteering, and a 28-day sabbatical after every 5 years of service for eligible positions. You'll also have access to paid parental leave and family building benefits, tuition reimbursement, and comprehensive health, dental, and vision insurance.

The application deadline for this role is March 20, 2026.

Skills

AWSAzureCanary deploymentsChaos engineeringCI/CDDatadogDockerEnvoyFeature flagsGCPGrafanaInfrastructure as CodeIstioKubernetesMLOpenTelemetryPrometheusPulumiService MeshSplunkTerraform

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Site Reliability Engineering (SRE) and AIOps Lead Architect

About the role

Your Opportunity

What You'll Do

What You Bring

Compensation and Benefits

Skills

Similar roles

MCP Engineer / AI Backend Engineer

Senior Database Engineer

Team Leads

Don't send a generic resume