Senior Site Reliability Engineer - SRE
tsworks
About the role
About tsworks
tsworks is a leading technology innovator, providing transformative products and services designed for the digital-first world. Our mission is to provide domain expertise, innovative solutions and thought leadership to drive exceptional user and customer experiences. Demonstrating this commitment, we have a proven track record of championing digital transformation for industries such as Banking, Travel and Hospitality, and Retail (including e-commerce and omnichannel), as well as Distribution and Supply Chain, delivering impactful solutions that drive efficiency and growth. We take pride in fostering a workplace where your skills, ideas, and attitude shape meaningful customer engagements.
About Team
We are looking for an experienced and highly skilled Senior Site Reliability Engineer (SRE) to join our team and play a key role in ensuring the high availability, scalability, and reliability of our infrastructure. The ideal candidate will have 7+ years of experience in site reliability engineering, cloud computing, infrastructure automation, and monitoring, with a deep understanding of modern DevOps and SRE practices.
Responsibilities
- Architect, design, and maintain high availability, scalable, and resilient infrastructure to support business-critical applications.
- Lead the implementation and management of Infrastructure as Code (IaC) using AWS CDK, ensuring infrastructure is automated, repeatable, and secure.
- Develop and optimize automation for deployments, configuration management, and infrastructure provisioning across cloud (AWS) and container orchestration platforms (Kubernetes, EKS, ECS).
- Enhance and maintain CI/CD pipelines, ensuring smooth and automated application and infrastructure deployments.
- Design and implement monitoring and observability solutions using tools such as Datadog, Prometheus, Grafana, ensuring proactive identification and resolution of performance bottlenecks and failures.
- Lead incident response and root cause analysis efforts, ensuring high levels of service availability and quick resolution of infrastructure issues.
- Continuously improve infrastructure performance, scalability, and reliability through best practices, automation, and innovation.
- Mentor and coach junior engineers, sharing knowledge, best practices, and expertise in site reliability engineering.
Requirements
Key Attributes and Qualifications
7-10+ years of experience in Site Reliability Engineering, DevOps, or a related field.
Expertise in cloud computing, particularly AWS, with deep knowledge of infrastructure design and best practices.
Experience with multi-cloud environments, including Azure and GCP, is highly desirable.
Proficiency with AWS CDK is essential, with additional experience in Terraform and Ansible considered a strong advantage.
Strong experience with Kubernetes and container orchestration platforms (EKS, ECS), including deploying, scaling, and managing workloads.
Advanced scripting and programming skills (Python, Bash, or similar) for automation and infrastructure management.
In-depth knowledge of monitoring, logging, and observability tools (Datadog, Prometheus, Grafana, ELK, etc.).
Preferred knowledge of Content Delivery Networks (CDNs) for optimizing application performance and scalability.
Excellent communication and leadership skills, with experience mentoring junior engineers and driving technical excellence.
Mandatory Work Experience in Project
Kubernetes-Docker
CI/CID Pipeline
Scripting - terraform, helm
Monitoring
Good to Have
- Application Knowledge (Java/Maven/Angular)
Requirements
- 7-10+ years of experience in Site Reliability Engineering, DevOps, or a related field.
- Expertise in cloud computing, particularly AWS, with deep knowledge of infrastructure design and best practices.
- Proficiency with AWS CDK is essential, with additional experience in Terraform and Ansible considered a strong advantage.
- Strong experience with Kubernetes and container orchestration platforms (EKS, ECS), including deploying, scaling, and managing workloads.
- Advanced scripting and programming skills (Python, Bash, or similar) for automation and infrastructure management.
- In-depth knowledge of monitoring, logging, and observability tools (Datadog, Prometheus, Grafana, ELK, etc.).
- Excellent communication and leadership skills, with experience mentoring junior engineers and driving technical excellence.
Responsibilities
- Architect, design, and maintain high availability, scalable, and resilient infrastructure to support business-critical applications.
- Lead the implementation and management of Infrastructure as Code (IaC) using AWS CDK, ensuring infrastructure is automated, repeatable, and secure.
- Develop and optimize automation for deployments, configuration management, and infrastructure provisioning across cloud (AWS) and container orchestration platforms (Kubernetes, EKS, ECS).
- Enhance and maintain CI/CD pipelines, ensuring smooth and automated application and infrastructure deployments.
- Design and implement monitoring and observability solutions using tools such as Datadog, Prometheus, Grafana, ensuring proactive identification and resolution of performance bottlenecks and failures.
- Lead incident response and root cause analysis efforts, ensuring high levels of service availability and quick resolution of infrastructure issues.
- Continuously improve infrastructure performance, scalability, and reliability through best practices, automation, and innovation.
- Mentor and coach junior engineers, sharing knowledge, best practices, and expertise in site reliability engineering.
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free