T
Senior Site Reliability Engineer / Devops
Tangentia
Toronto · Hybrid Full-time Senior 1w ago
About the role
Role
Senior DevOps & Site Reliability Engineer
Location
Toronto, ON
Interview Mode
Virtual
Key Responsibilities
- Oversee the reliability, availability, and performance of Apigee Hybrid and Google Distributed Cloud environments, ensuring robust SRE practices.
- Manage and automate certificate management processes, including renewals, deployments, and compliance checks.
- Plan and execute upgrades and maintenance activities for Apigee Hybrid and distributed cloud infrastructure, minimizing downtime and ensuring seamless transitions.
- Implement and maintain monitoring solutions using Dynatrace and Splunk, proactively identifying and resolving issues to ensure system health and performance.
- Troubleshoot complex production incidents, perform root cause analysis, and drive incident resolution to restore service quickly and prevent recurrence.
- Develop and maintain automation scripts and Ansible playbooks for operational efficiency, including tasks such as Kubernetes context retrieval, proxy configuration, and container management.
- Collaborate with cross-functional teams to ensure security, compliance, and best practices are followed across all SRE activities.
- Mentor and guide team members in SRE methodologies, fostering a culture of continuous improvement and operational excellence.
Required Skills
- 3 years of experience in Site Reliability Engineering or related roles.
- Experience with Apigee Hybrid, Google Distributed Cloud, Azure, GCP, and Kubernetes.
- Advanced DevOps and SRE skills: CI/CD, automation, monitoring, infrastructure as code.
- Certificate management scripting and automation.
- Proficiency with Ansible for configuration management and orchestration.
- Experience with APM tools such as Dynatrace, Splunk
- Programming experience with python
Technologies
- Ansible (Software)
- Apigee Hybrid
- API Management
- Azure Kubernetes Service (AKS)
- CI/CD
- Dynatrace APM
- Google Anthos
- Kubernetes
- Public Key Infrastructure
- Python (Programming Language)
- Red Hat Enterprise Linux (RHEL)
- Site Reliability Engineering
- Splunk
- Terraform
- VMware Tangentia
Requirements
- 3 years of experience in Site Reliability Engineering or related roles.
- Experience with Apigee Hybrid, Google Distributed Cloud, Azure, GCP, and Kubernetes.
- Advanced DevOps and SRE skills: CI/CD, automation, monitoring, infrastructure as code.
- Certificate management scripting and automation.
- Proficiency with Ansible for configuration management and orchestration.
- Experience with APM tools such as Dynatrace, Splunk
- Programming experience with python
Responsibilities
- Oversee the reliability, availability, and performance of Apigee Hybrid and Google Distributed Cloud environments, ensuring robust SRE practices.
- Manage and automate certificate management processes, including renewals, deployments, and compliance checks.
- Plan and execute upgrades and maintenance activities for Apigee Hybrid and distributed cloud infrastructure, minimizing downtime and ensuring seamless transitions.
- Implement and maintain monitoring solutions using Dynatrace and Splunk, proactively identifying and resolving issues to ensure system health and performance.
- Troubleshoot complex production incidents, perform root cause analysis, and drive incident resolution to restore service quickly and prevent recurrence.
- Develop and maintain automation scripts and Ansible playbooks for operational efficiency, including tasks such as Kubernetes context retrieval, proxy configuration, and container management.
- Collaborate with cross-functional teams to ensure security, compliance, and best practices are followed across all SRE activities.
- Mentor and guide team members in SRE methodologies, fostering a culture of continuous improvement and operational excellence.
Skills
AnsibleAPI ManagementApigee HybridAzureAzure Kubernetes Service (AKS)CI/CDDockerDynatrace APMGCPGoogle AnthosGoogle Distributed CloudInfrastructure as CodeKubernetesMonitoringPythonPublic Key InfrastructureRed Hat Enterprise Linux (RHEL)Site Reliability EngineeringSplunkTerraformVMware Tangentia
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free