Skip to content
mimi

Senior Site Reliability Engineer / Devops

Tangentia

Toronto · Hybrid Full-time Senior 1w ago

About the role

Role

Senior DevOps & Site Reliability Engineer

Location

Toronto, ON

Interview Mode

Virtual

Key Responsibilities

  • Oversee the reliability, availability, and performance of Apigee Hybrid and Google Distributed Cloud environments, ensuring robust SRE practices.
  • Manage and automate certificate management processes, including renewals, deployments, and compliance checks.
  • Plan and execute upgrades and maintenance activities for Apigee Hybrid and distributed cloud infrastructure, minimizing downtime and ensuring seamless transitions.
  • Implement and maintain monitoring solutions using Dynatrace and Splunk, proactively identifying and resolving issues to ensure system health and performance.
  • Troubleshoot complex production incidents, perform root cause analysis, and drive incident resolution to restore service quickly and prevent recurrence.
  • Develop and maintain automation scripts and Ansible playbooks for operational efficiency, including tasks such as Kubernetes context retrieval, proxy configuration, and container management.
  • Collaborate with cross-functional teams to ensure security, compliance, and best practices are followed across all SRE activities.
  • Mentor and guide team members in SRE methodologies, fostering a culture of continuous improvement and operational excellence.

Required Skills

  • 3 years of experience in Site Reliability Engineering or related roles.
  • Experience with Apigee Hybrid, Google Distributed Cloud, Azure, GCP, and Kubernetes.
  • Advanced DevOps and SRE skills: CI/CD, automation, monitoring, infrastructure as code.
  • Certificate management scripting and automation.
  • Proficiency with Ansible for configuration management and orchestration.
  • Experience with APM tools such as Dynatrace, Splunk
  • Programming experience with python

Technologies

  • Ansible (Software)
  • Apigee Hybrid
  • API Management
  • Azure Kubernetes Service (AKS)
  • CI/CD
  • Dynatrace APM
  • Google Anthos
  • Kubernetes
  • Public Key Infrastructure
  • Python (Programming Language)
  • Red Hat Enterprise Linux (RHEL)
  • Site Reliability Engineering
  • Splunk
  • Terraform
  • VMware Tangentia

Requirements

  • 3 years of experience in Site Reliability Engineering or related roles.
  • Experience with Apigee Hybrid, Google Distributed Cloud, Azure, GCP, and Kubernetes.
  • Advanced DevOps and SRE skills: CI/CD, automation, monitoring, infrastructure as code.
  • Certificate management scripting and automation.
  • Proficiency with Ansible for configuration management and orchestration.
  • Experience with APM tools such as Dynatrace, Splunk
  • Programming experience with python

Responsibilities

  • Oversee the reliability, availability, and performance of Apigee Hybrid and Google Distributed Cloud environments, ensuring robust SRE practices.
  • Manage and automate certificate management processes, including renewals, deployments, and compliance checks.
  • Plan and execute upgrades and maintenance activities for Apigee Hybrid and distributed cloud infrastructure, minimizing downtime and ensuring seamless transitions.
  • Implement and maintain monitoring solutions using Dynatrace and Splunk, proactively identifying and resolving issues to ensure system health and performance.
  • Troubleshoot complex production incidents, perform root cause analysis, and drive incident resolution to restore service quickly and prevent recurrence.
  • Develop and maintain automation scripts and Ansible playbooks for operational efficiency, including tasks such as Kubernetes context retrieval, proxy configuration, and container management.
  • Collaborate with cross-functional teams to ensure security, compliance, and best practices are followed across all SRE activities.
  • Mentor and guide team members in SRE methodologies, fostering a culture of continuous improvement and operational excellence.

Skills

AnsibleAPI ManagementApigee HybridAzureAzure Kubernetes Service (AKS)CI/CDDockerDynatrace APMGCPGoogle AnthosGoogle Distributed CloudInfrastructure as CodeKubernetesMonitoringPythonPublic Key InfrastructureRed Hat Enterprise Linux (RHEL)Site Reliability EngineeringSplunkTerraformVMware Tangentia

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free