Senior Site Reliability Engineer (SRE) – Cloud & Distributed Systems NEW!
Dutech Systems
About the role
Senior Site Reliability Engineer (SRE) – Cloud & Distributed Systems
Location: Austin, TX
Date Posted: 4/2/2026 2:48:14 PM
Job Number: DTS1017187676
Job Type: Contract
Skills: SRE, DevOps, AWS, GCP, Kubernetes, Docker, Python, Go, Linux, Distributed Systems, Monitoring, Logging, SLIs, SLOs, CI/CD, Observability
Job Description
We are seeking an experienced Senior Site Reliability Engineer (SRE) to design, build, and operate highly scalable and reliable cloud-based systems. The ideal candidate will have a strong background in DevOps, distributed systems, and cloud infrastructure, with a focus on automation, observability, and system reliability.
This role involves working in a fast-paced environment to ensure system uptime, performance, and operational excellence.
Key Responsibilities
- Design, implement, and manage highly available, distributed systems
- Maintain and optimize cloud infrastructure (AWS/GCP)
- Develop automation scripts using Python, Go, Java, or Bash
- Manage containerized environments using Docker and Kubernetes
- Define and monitor SLIs, SLOs, and error budgets
- Implement monitoring, logging, and alerting solutions
- Lead incident management, root cause analysis (RCA), and postmortems
- Ensure system security and compliance within operational workflows
- Improve system reliability through performance tuning and optimization
- Collaborate with engineering teams to enhance deployment and release processes
- Create and maintain runbooks, dashboards, and operational documentation
Required Qualifications
- 8+ years of experience in SRE, DevOps, or Systems Engineering
- Strong expertise in Linux/Unix systems and system internals
- Proficiency in at least one programming/scripting language (Python, Go, Java, Bash)
- Experience designing and operating distributed systems
- Hands‑on experience with cloud platforms (AWS or GCP)
- Experience with Docker and Kubernetes
- Strong understanding of monitoring, alerting, and logging concepts
- Experience managing SLIs, SLOs, and error budgets
- Experience with incident management and RCA processes
Preferred Qualifications
- Experience with observability tools (Prometheus, Grafana, Datadog, Splunk, Application Insights)
- Experience supporting 24x7 production environments and on‑call rotations
- Knowledge of chaos engineering and resiliency testing
- Experience with canary deployments, feature flags, and progressive delivery
- Strong documentation and communication skills
Requirements
- 8+ years of experience in SRE, DevOps, or Systems Engineering
- Strong expertise in Linux/Unix systems and system internals
- Proficiency in at least one programming/scripting language (Python, Go, Java, Bash)
- Experience designing and operating distributed systems
- Hands-on experience with cloud platforms (AWS or GCP)
- Experience with Docker and Kubernetes
- Strong understanding of monitoring, alerting, and logging concepts
- Experience managing SLIs, SLOs, and error budgets
- Experience with incident management and RCA processes
Responsibilities
- Design, implement, and manage highly available, distributed systems
- Maintain and optimize cloud infrastructure (AWS/GCP)
- Develop automation scripts using Python, Go, Java, or Bash
- Manage containerized environments using Docker and Kubernetes
- Define and monitor SLIs, SLOs, and error budgets
- Implement monitoring, logging, and alerting solutions
- Lead incident management, root cause analysis (RCA), and postmortems
- Ensure system security and compliance within operational workflows
- Improve system reliability through performance tuning and optimization
- Collaborate with engineering teams to enhance deployment and release processes
- Create and maintain runbooks, dashboards, and operational documentation
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free