JV
Senior Site Reliability Engineer (SRE)
Jobs via Dice
Berkeley Heights · On-site Full-time Senior 1mo ago
About the role
Role Overview
We are seeking a highly skilled Senior Site Reliability Engineer (SRE) to design, build, and scale observability and reliability solutions across enterprise-grade distributed systems. This role focuses on improving system reliability, performance, and operational excellence through advanced telemetry, automation, and cloud-native best practices, with a strong emphasis on AWS environments.
Key Responsibilities
- Design, implement, and maintain end-to-end observability solutions including metrics, logging, and distributed tracing.
- Build and manage real-time monitoring dashboards and alerting systems using tools such as Datadog, Splunk, Prometheus, Grafana, or ELK.
- Develop and enforce Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets to ensure system reliability.
- Lead incident response efforts, including troubleshooting, root cause analysis, and rapid resolution of production issues.
- Drive reliability engineering practices including post-incident reviews and continuous improvement initiatives.
- Automate operational and monitoring workflows using Python, Bash, or Go.
- Develop self-healing systems and auto-remediation capabilities to reduce manual intervention.
- Collaborate closely with DevOps, Cloud, and Security teams to enhance CI/CD pipelines and infrastructure resilience.
- Optimize application and infrastructure performance, scalability, and availability in cloud environments.
- Champion reliability, observability, and operational best practices across engineering teams.
Required Qualifications
- 10+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering roles.
- Strong expertise in observability and monitoring tools (Datadog, Splunk, Prometheus, Grafana, ELK stack).
- Hands-on experience managing incidents and participating in on-call rotations in production environments.
- Proficiency in Linux system administration, networking fundamentals, and performance tuning.
- Strong programming/scripting skills in Python, Bash, Go, or similar languages.
- Experience with containerization and orchestration tools such as Docker and Kubernetes.
- Proven experience designing and maintaining CI/CD pipelines (Jenkins, GitHub Actions, GitLab CI, etc.).
- Solid understanding of distributed systems, high availability, and scalability patterns.
AWS & Cloud Expertise
- Hands-on experience with AWS services such as EC2, ECS/EKS, Lambda, S3, RDS, DynamoDB, and VPC networking.
- Experience with AWS-native observability tools including CloudWatch, X-Ray, and CloudTrail.
- Familiarity with infrastructure automation using AWS CloudFormation or Terraform.
- Experience implementing scalable, fault-tolerant architectures in AWS environments.
- Understanding of cost optimization and performance tuning in cloud-native systems.
Nice-to-Have Skills
- Experience with AIOps, anomaly detection, and predictive monitoring solutions.
- Knowledge of Infrastructure as Code (Terraform, Ansible, Pulumi).
- Exposure to security monitoring, compliance, and integration with observability platforms.
- Experience with event-driven architectures and streaming platforms (e.g., Kafka).
- Familiarity with chaos engineering and resilience testing practices.
Skills
AWS CloudFormationBashCI/CDDatadogDockerELK stackEC2ECSEKSELKGrafanaGoGitHub ActionsGitLab CIJenkinsKubernetesLambdaLinuxPrometheusPythonRDSS3SplunkTerraformVPCX-Ray
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free