Senior DevOps Engineer – SRE
Triune Infomatics Inc
About the role
Overview
We are seeking a highly skilled Senior DevOps Engineer – Site Reliability Engineering (SRE) to lead the design, implementation, and reliability of scalable cloud infrastructure. This role focuses on ensuring high availability, performance optimization, and automation across AWS environments.
The ideal candidate will bring deep expertise in AWS, monitoring, and automation, with a strong SRE mindset to support mission‑critical applications in a 24/7 production environment. You will work closely with engineering and operations teams to build resilient systems, improve observability, and drive operational excellence.
Required Skills
- Strong hands‑on experience with AWS cloud services and infrastructure management
- Experience implementing alerts, alarms, and notifications using CloudWatch and/or Dynatrace
- Experience working with AWS services such as Kafka, ECS, and EKS
- Expertise in Infrastructure as Code (IaC) using Terraform or AWS CDK
- Strong background in automation and configuration management
- Experience with CI/CD pipelines (Jenkins, Azure DevOps, or similar tools)
- Proven Site Reliability Engineering (SRE) experience in production environments
- Strong Linux system administration and OS‑level troubleshooting skills
- Experience supporting 24/7 production environments, including incident response and RCA
- Solid understanding of monitoring, observability, and performance tuning
- Experience with networking fundamentals (TCP/IP, DNS, load balancing)
Preferred Skills
- AWS certifications (DevOps Engineer or Solutions Architect)
- Experience with Ansible, Python scripting, or other automation tools
- Familiarity with high availability (HA) and disaster recovery (DR) architectures
- Experience with container orchestration and microservices architecture
- Knowledge of security best practices and vulnerability management tools
- Experience working in enterprise‑scale environments
- Exposure to Java/.NET application deployments
- Understanding of databases (SQL Server, Oracle)
- Strong troubleshooting and problem‑solving skills across infrastructure and applications
- Experience with multi‑region / multi‑AZ AWS deployments
Requirements
- Strong hands-on experience with AWS cloud services and infrastructure management
- Experience implementing alerts, alarms, and notifications using CloudWatch and/or Dynatrace
- Experience working with AWS services such as Kafka, ECS, and EKS
- Expertise in Infrastructure as Code (IaC) using Terraform or AWS CDK
- Strong background in automation and configuration management
- Experience with CI/CD pipelines (Jenkins, Azure DevOps, or similar tools)
- Proven Site Reliability Engineering (SRE) experience in production environments
- Strong Linux system administration and OS-level troubleshooting skills
- Experience supporting 24/7 production environments, including incident response and RCA
- Solid understanding of monitoring, observability, and performance tuning
- Experience with networking fundamentals (TCP/IP, DNS, load balancing)
Responsibilities
- Lead the design, implementation, and reliability of scalable cloud infrastructure.
- Ensure high availability, performance optimization, and automation across AWS environments.
- Support mission-critical applications in a 24/7 production environment.
- Work closely with engineering and operations teams to build resilient systems, improve observability, and drive operational excellence.
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free