Skip to content
mimi

Site Reliability Engineer/SRE

Innosoul inc

Bee Cave · Hybrid Full-time Senior 6d ago

About the role

Job ID

TX-529601671

Title

Hybrid/Local TX Govt Site Reliability Engineer/SRE (15+) with DevOps/System Engineering, Linux/Unix, Python/Go/Java/Bash, AWS/Google Cloud Platform, Docker/Kubernetes, SLIs/SLOs, PrometheGrafana/Datadog/Splunk/Application Insights experience

Location

Austin, TX (HHSC)

Duration

3 Months

Work Arrangement

  • 3 days remote
  • 2 days onsite (Mondays and Thursdays) at the location listed above
  • Program will only accept LOCAL ONLY candidates for this position

Required Skills

  • 8 Required experience in systems engineering, DevOps, or site reliability engineering roles
  • 8 Required Strong experience with Linux/Unix systems and system internals
  • 8 Required Proficiency in one or more programming/scripting languages (Python, Go, Java, Bash)
  • 8 Required Experience designing and operating highly available, distributed systems
  • 8 Required Strong knowledge of cloud platforms (AWS, or Google Cloud Platform) and cloud-native services
  • 8 Required Experience with containerization and orchestration (Docker, Kubernetes)
  • 8 Required Strong understanding of monitoring, alerting, and logging concepts
  • 8 Required Experience defining and managing SLIs, SLOs, and error budgets
  • 8 Required Familiarity with incident management, root cause analysis (RCA), and postmortems
  • 8 Required Experience integrating security and compliance into operational workflows

Preferred Skills

  • 4 Preferred Familiarity with observability tools (Prometheus, Grafana, Application Insights, Datadog, Splunk)
  • 4 Preferred Experience operating 24×7 production environments with on-call rotations
  • 4 Preferred Experience with chaos engineering and resiliency testing
  • 4 Preferred Experience with feature flags, canary deployments, and progressive delivery
  • 4 Preferred Strong documentation skills for runbooks, dashboards, and operational standards

Description

8 or more years of experience, relies on experience and judgment to plan and accomplish goals, independently performs a variety of complicated tasks, a wide degree of creativity and latitude is expected.

Understands business objectives and problems, identifies alternative solutions, performs studies and cost/benefit analysis of alternatives. Analyzes user requirements, procedures, and problems to automate processing or to improve existing computer system: Confers with personnel of organizational units involved to analyze current operational procedures, identify problems, and learn specific input and output requirements, such as forms of data input, how data is to be; summarized, and formats for reports. Writes detailed description of user needs, program functions, and steps required to develop or modify computer program. Reviews computer system capabilities, specifications, and scheduling limitations to determine if requested program or program change is possible within existing system.

Site Reliability Engineer will be responsible for ensuring the reliability, availability, performance, and scalability of production systems by applying software engineering practices to infrastructure and operations. Partners with development teams to build resilient, observable, and automated platforms that meet defined service level objectives (SLOs).

Requirements

  • 8 or more years of experience in systems engineering, DevOps, or site reliability engineering roles
  • Strong experience with Linux/Unix systems and system internals
  • Proficiency in one or more programming/scripting languages (Python, Go, Java, Bash)
  • Experience designing and operating highly available, distributed systems
  • Strong knowledge of cloud platforms (AWS, or Google Cloud Platform) and cloud-native services
  • Experience with containerization and orchestration (Docker, Kubernetes)
  • Strong understanding of monitoring, alerting, and logging concepts
  • Experience defining and managing SLIs, SLOs, and error budgets
  • Familiarity with incident management, root cause analysis (RCA), and postmortems
  • Experience integrating security and compliance into operational workflows

Responsibilities

  • Ensuring the reliability, availability, performance, and scalability of production systems by applying software engineering practices to infrastructure and operations.
  • Partners with development teams to build resilient, observable, and automated platforms that meet defined service level objectives (SLOs).

Skills

AWSBashDockerGoGoogle Cloud PlatformGrafanaJavaKubernetesLinuxPythonSplunkUnix

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free