Skip to content
mimi

Senior Site Reliability Engineer (SRE)

Jobs via Dice

New York · On-site Full-time Senior 6d ago

About the role

Role Overview

We are seeking a highly skilled Senior Site Reliability Engineer (SRE) to design, build, and scale observability and reliability solutions for enterprise-grade distributed systems. This role focuses on enhancing system reliability, performance, and operational excellence through advanced telemetry, automation, and cloud-native best practices, with a strong emphasis on AWS environments.

Key Responsibilities

  • Design, implement, and maintain end-to-end observability solutions, including metrics, logging, and distributed tracing.
  • Build and manage real-time monitoring dashboards and alerting systems using tools like Datadog, Splunk, Prometheus, Grafana, or the ELK stack.
  • Define and enforce Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets to ensure system reliability.
  • Lead incident response efforts, including troubleshooting, root cause analysis, and rapid resolution of production issues.
  • Drive reliability engineering practices, including post-mortems and continuous improvement initiatives.
  • Automate operational and monitoring workflows using Python, Bash, or Go.
  • Develop self-healing systems and auto-remediation capabilities to reduce manual intervention.
  • Collaborate closely with DevOps, Cloud, and Security teams to enhance CI/CD pipelines and infrastructure resilience.
  • Optimize application and infrastructure performance, scalability, and availability in cloud environments.
  • Advocate for reliability, observability, and operational best practices within engineering teams.

Core Qualifications

  • 10+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering roles.
  • Extensive expertise in observability and monitoring tools (Datadog, Splunk, Prometheus, Grafana, ELK stack).
  • Hands-on experience managing incidents and participating in on-call rotations in production environments.
  • Proficiency in Linux system administration, networking fundamentals, and performance tuning.
  • Strong programming/scripting skills in Python, Bash, Go, or similar languages.
  • Experience with containerization and orchestration tools like Docker and Kubernetes.
  • Extensive experience designing and maintaining CI/CD pipelines (Jenkins, GitHub Actions, GitLab CI, etc.).
  • Solid understanding of distributed systems, high availability, and scalability patterns.

AWS and Cloud Expertise

  • Hands-on experience with AWS services such as EC2, ECS/EKS, Lambda, S3, RDS, DynamoDB, and VPC networking.
  • Familiarity with AWS native observability tools, including CloudWatch, X-Ray, and CloudTrail.
  • Familiarity with automating infrastructure using AWS CloudFormation or Terraform.
  • Experience implementing scalable, fault-tolerant architectures in AWS environments.
  • Understanding of cost optimization and performance tuning in cloud-native systems.

Bonus Skills

  • Experience with AIOps, anomaly detection, and predictive monitoring solutions.
  • Knowledge of Infrastructure as Code (Terraform, Ansible, Pulumi).
  • Exposure to security monitoring, compliance, and integration with observability platforms.
  • Experience with event-driven architectures and streaming platforms (e.g., Kafka).
  • Familiarity with chaos engineering and resilience testing practices.

Skills

AWSBashCloudFormationCI/CDDockerELK stackGoGrafanaInfrastructure as CodeJenkinsKafkaKubernetesLambdaLinuxMonitoringObservabilityPrometheusPythonReliability EngineeringSRESplunkTerraformX-Ray

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free