JV
Senior Site Reliability Engineer (SRE)
Jobs via Dice
New York · On-site Full-time Senior 6d ago
About the role
Role Overview
We are seeking a highly skilled Senior Site Reliability Engineer (SRE) to design, build, and scale observability and reliability solutions for enterprise-grade distributed systems. This role focuses on enhancing system reliability, performance, and operational excellence through advanced telemetry, automation, and cloud-native best practices, with a strong emphasis on AWS environments.
Key Responsibilities
- Design, implement, and maintain end-to-end observability solutions, including metrics, logging, and distributed tracing.
- Build and manage real-time monitoring dashboards and alerting systems using tools like Datadog, Splunk, Prometheus, Grafana, or the ELK stack.
- Define and enforce Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets to ensure system reliability.
- Lead incident response efforts, including troubleshooting, root cause analysis, and rapid resolution of production issues.
- Drive reliability engineering practices, including post-mortems and continuous improvement initiatives.
- Automate operational and monitoring workflows using Python, Bash, or Go.
- Develop self-healing systems and auto-remediation capabilities to reduce manual intervention.
- Collaborate closely with DevOps, Cloud, and Security teams to enhance CI/CD pipelines and infrastructure resilience.
- Optimize application and infrastructure performance, scalability, and availability in cloud environments.
- Advocate for reliability, observability, and operational best practices within engineering teams.
Core Qualifications
- 10+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering roles.
- Extensive expertise in observability and monitoring tools (Datadog, Splunk, Prometheus, Grafana, ELK stack).
- Hands-on experience managing incidents and participating in on-call rotations in production environments.
- Proficiency in Linux system administration, networking fundamentals, and performance tuning.
- Strong programming/scripting skills in Python, Bash, Go, or similar languages.
- Experience with containerization and orchestration tools like Docker and Kubernetes.
- Extensive experience designing and maintaining CI/CD pipelines (Jenkins, GitHub Actions, GitLab CI, etc.).
- Solid understanding of distributed systems, high availability, and scalability patterns.
AWS and Cloud Expertise
- Hands-on experience with AWS services such as EC2, ECS/EKS, Lambda, S3, RDS, DynamoDB, and VPC networking.
- Familiarity with AWS native observability tools, including CloudWatch, X-Ray, and CloudTrail.
- Familiarity with automating infrastructure using AWS CloudFormation or Terraform.
- Experience implementing scalable, fault-tolerant architectures in AWS environments.
- Understanding of cost optimization and performance tuning in cloud-native systems.
Bonus Skills
- Experience with AIOps, anomaly detection, and predictive monitoring solutions.
- Knowledge of Infrastructure as Code (Terraform, Ansible, Pulumi).
- Exposure to security monitoring, compliance, and integration with observability platforms.
- Experience with event-driven architectures and streaming platforms (e.g., Kafka).
- Familiarity with chaos engineering and resilience testing practices.
Skills
AWSBashCloudFormationCI/CDDockerELK stackGoGrafanaInfrastructure as CodeJenkinsKafkaKubernetesLambdaLinuxMonitoringObservabilityPrometheusPythonReliability EngineeringSRESplunkTerraformX-Ray
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free