Skip to content
mimi

Senior Site Reliability Engineer (SRE) – Cloud & Distributed Systems NEW!

Dutech Systems

Austin · On-site Full-time Senior 5d ago

About the role

Senior Site Reliability Engineer (SRE) – Cloud & Distributed Systems

Location: Austin, TX
Date Posted: 4/2/2026 2:48:14 PM
Job Number: DTS1017187676
Job Type: Contract

Skills: SRE, DevOps, AWS, GCP, Kubernetes, Docker, Python, Go, Linux, Distributed Systems, Monitoring, Logging, SLIs, SLOs, CI/CD, Observability

Job Description

We are seeking an experienced Senior Site Reliability Engineer (SRE) to design, build, and operate highly scalable and reliable cloud-based systems. The ideal candidate will have a strong background in DevOps, distributed systems, and cloud infrastructure, with a focus on automation, observability, and system reliability.

This role involves working in a fast-paced environment to ensure system uptime, performance, and operational excellence.

Key Responsibilities

  • Design, implement, and manage highly available, distributed systems
  • Maintain and optimize cloud infrastructure (AWS/GCP)
  • Develop automation scripts using Python, Go, Java, or Bash
  • Manage containerized environments using Docker and Kubernetes
  • Define and monitor SLIs, SLOs, and error budgets
  • Implement monitoring, logging, and alerting solutions
  • Lead incident management, root cause analysis (RCA), and postmortems
  • Ensure system security and compliance within operational workflows
  • Improve system reliability through performance tuning and optimization
  • Collaborate with engineering teams to enhance deployment and release processes
  • Create and maintain runbooks, dashboards, and operational documentation

Required Qualifications

  • 8+ years of experience in SRE, DevOps, or Systems Engineering
  • Strong expertise in Linux/Unix systems and system internals
  • Proficiency in at least one programming/scripting language (Python, Go, Java, Bash)
  • Experience designing and operating distributed systems
  • Hands‑on experience with cloud platforms (AWS or GCP)
  • Experience with Docker and Kubernetes
  • Strong understanding of monitoring, alerting, and logging concepts
  • Experience managing SLIs, SLOs, and error budgets
  • Experience with incident management and RCA processes

Preferred Qualifications

  • Experience with observability tools (Prometheus, Grafana, Datadog, Splunk, Application Insights)
  • Experience supporting 24x7 production environments and on‑call rotations
  • Knowledge of chaos engineering and resiliency testing
  • Experience with canary deployments, feature flags, and progressive delivery
  • Strong documentation and communication skills

Requirements

  • 8+ years of experience in SRE, DevOps, or Systems Engineering
  • Strong expertise in Linux/Unix systems and system internals
  • Proficiency in at least one programming/scripting language (Python, Go, Java, Bash)
  • Experience designing and operating distributed systems
  • Hands-on experience with cloud platforms (AWS or GCP)
  • Experience with Docker and Kubernetes
  • Strong understanding of monitoring, alerting, and logging concepts
  • Experience managing SLIs, SLOs, and error budgets
  • Experience with incident management and RCA processes

Responsibilities

  • Design, implement, and manage highly available, distributed systems
  • Maintain and optimize cloud infrastructure (AWS/GCP)
  • Develop automation scripts using Python, Go, Java, or Bash
  • Manage containerized environments using Docker and Kubernetes
  • Define and monitor SLIs, SLOs, and error budgets
  • Implement monitoring, logging, and alerting solutions
  • Lead incident management, root cause analysis (RCA), and postmortems
  • Ensure system security and compliance within operational workflows
  • Improve system reliability through performance tuning and optimization
  • Collaborate with engineering teams to enhance deployment and release processes
  • Create and maintain runbooks, dashboards, and operational documentation

Skills

AWSCI/CDDevOpsDockerGCPGoKubernetesLinuxLoggingMonitoringObservabilityPythonSRESLISLODistributed Systems

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free