Senior Site Reliability Engineer (SRE)

Jobs via Dice

New York · On-site Full-time Senior 1mo ago

About the role

Role Overview

We are seeking a highly skilled Senior Site Reliability Engineer (SRE) to design, build, and scale observability and reliability solutions for enterprise-grade distributed systems. This role focuses on enhancing system reliability, performance, and operational excellence through advanced telemetry, automation, and cloud-native best practices, with a strong emphasis on AWS environments.

Key Responsibilities

Design, implement, and maintain end-to-end observability solutions, including metrics, logging, and distributed tracing.
Build and manage real-time monitoring dashboards and alerting systems using tools like Datadog, Splunk, Prometheus, Grafana, or the ELK stack.
Define and enforce Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets to ensure system reliability.
Lead incident response efforts, including troubleshooting, root cause analysis, and rapid resolution of production issues.
Drive reliability engineering practices, including post-mortems and continuous improvement initiatives.
Automate operational and monitoring workflows using Python, Bash, or Go.
Develop self-healing systems and auto-remediation capabilities to reduce manual intervention.
Collaborate closely with DevOps, Cloud, and Security teams to enhance CI/CD pipelines and infrastructure resilience.
Optimize application and infrastructure performance, scalability, and availability in cloud environments.
Advocate for reliability, observability, and operational best practices within engineering teams.

Core Qualifications

10+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering roles.
Extensive expertise in observability and monitoring tools (Datadog, Splunk, Prometheus, Grafana, ELK stack).
Hands-on experience managing incidents and participating in on-call rotations in production environments.
Proficiency in Linux system administration, networking fundamentals, and performance tuning.
Strong programming/scripting skills in Python, Bash, Go, or similar languages.
Experience with containerization and orchestration tools like Docker and Kubernetes.
Extensive experience designing and maintaining CI/CD pipelines (Jenkins, GitHub Actions, GitLab CI, etc.).
Solid understanding of distributed systems, high availability, and scalability patterns.

AWS and Cloud Expertise

Hands-on experience with AWS services such as EC2, ECS/EKS, Lambda, S3, RDS, DynamoDB, and VPC networking.
Familiarity with AWS native observability tools, including CloudWatch, X-Ray, and CloudTrail.
Familiarity with automating infrastructure using AWS CloudFormation or Terraform.
Experience implementing scalable, fault-tolerant architectures in AWS environments.
Understanding of cost optimization and performance tuning in cloud-native systems.

Bonus Skills

Experience with AIOps, anomaly detection, and predictive monitoring solutions.
Knowledge of Infrastructure as Code (Terraform, Ansible, Pulumi).
Exposure to security monitoring, compliance, and integration with observability platforms.
Experience with event-driven architectures and streaming platforms (e.g., Kafka).
Familiarity with chaos engineering and resilience testing practices.

Skills

AWSBashCloudFormationCI/CDDockerELK stackGoGrafanaInfrastructure as CodeJenkinsKafkaKubernetesLambdaLinuxMonitoringObservabilityPrometheusPythonReliability EngineeringSRESplunkTerraformX-Ray

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Senior Site Reliability Engineer (SRE)

About the role

Role Overview

Key Responsibilities

Core Qualifications

AWS and Cloud Expertise

Bonus Skills

Skills

Similar roles

MCP Engineer / AI Backend Engineer

Software Engineer

Senior Database Engineer

Don't send a generic resume