Senior Site Reliability Engineer (SRE)

Jobs via Dice

Berkeley Heights · On-site Full-time Senior 2mo ago

About the role

Role Overview

We are seeking a highly skilled Senior Site Reliability Engineer (SRE) to design, build, and scale observability and reliability solutions across enterprise-grade distributed systems. This role focuses on improving system reliability, performance, and operational excellence through advanced telemetry, automation, and cloud-native best practices, with a strong emphasis on AWS environments.

Key Responsibilities

Design, implement, and maintain end-to-end observability solutions including metrics, logging, and distributed tracing.
Build and manage real-time monitoring dashboards and alerting systems using tools such as Datadog, Splunk, Prometheus, Grafana, or ELK.
Develop and enforce Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets to ensure system reliability.
Lead incident response efforts, including troubleshooting, root cause analysis, and rapid resolution of production issues.
Drive reliability engineering practices including post-incident reviews and continuous improvement initiatives.
Automate operational and monitoring workflows using Python, Bash, or Go.
Develop self-healing systems and auto-remediation capabilities to reduce manual intervention.
Collaborate closely with DevOps, Cloud, and Security teams to enhance CI/CD pipelines and infrastructure resilience.
Optimize application and infrastructure performance, scalability, and availability in cloud environments.
Champion reliability, observability, and operational best practices across engineering teams.

Required Qualifications

10+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering roles.
Strong expertise in observability and monitoring tools (Datadog, Splunk, Prometheus, Grafana, ELK stack).
Hands-on experience managing incidents and participating in on-call rotations in production environments.
Proficiency in Linux system administration, networking fundamentals, and performance tuning.
Strong programming/scripting skills in Python, Bash, Go, or similar languages.
Experience with containerization and orchestration tools such as Docker and Kubernetes.
Proven experience designing and maintaining CI/CD pipelines (Jenkins, GitHub Actions, GitLab CI, etc.).
Solid understanding of distributed systems, high availability, and scalability patterns.

AWS & Cloud Expertise

Hands-on experience with AWS services such as EC2, ECS/EKS, Lambda, S3, RDS, DynamoDB, and VPC networking.
Experience with AWS-native observability tools including CloudWatch, X-Ray, and CloudTrail.
Familiarity with infrastructure automation using AWS CloudFormation or Terraform.
Experience implementing scalable, fault-tolerant architectures in AWS environments.
Understanding of cost optimization and performance tuning in cloud-native systems.

Nice-to-Have Skills

Experience with AIOps, anomaly detection, and predictive monitoring solutions.
Knowledge of Infrastructure as Code (Terraform, Ansible, Pulumi).
Exposure to security monitoring, compliance, and integration with observability platforms.
Experience with event-driven architectures and streaming platforms (e.g., Kafka).
Familiarity with chaos engineering and resilience testing practices.

Skills

AWS CloudFormationBashCI/CDDatadogDockerELK stackEC2ECSEKSELKGrafanaGoGitHub ActionsGitLab CIJenkinsKubernetesLambdaLinuxPrometheusPythonRDSS3SplunkTerraformVPCX-Ray

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Senior Site Reliability Engineer (SRE)

About the role

Role Overview

Key Responsibilities

Required Qualifications

AWS & Cloud Expertise

Nice-to-Have Skills

Skills

Similar roles

MCP Engineer / AI Backend Engineer

Senior Database Engineer

Team Leads

Don't send a generic resume