AL
Site Reliability Engineer (OpenShift & Infrastructure)
Accion Labs
Winnipeg · On-site Contract Mid Level Today
About the role
Responsibilities & Skills
- Install, configure, upgrade, and administer OpenShift clusters (OCP) in on-premise and cloud environments.
- Manage OCP internal networking, ingress, egress, and cluster services.
- Configure and integrate LDAP authentication and access management.
- Implement TLS and MTLS encryption, and manage certificate lifecycle for secure communications.
- Implement GitOps workflows using ArgoCD for continuous delivery and environment consistency.
- Automate platform and application provisioning using Terraform and Ansible.
- Configure and maintain F5 LTM load balancers.
- Configure and manage DNS, networking, and subnets.
- Build and manage monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, ELK).
- Define and enforce SLIs/SLOs and error budgets for services running on OCP.
- Lead incident response, root cause analysis (RCA), and postmortems.
- Build automation for self‑healing, scaling, and zero-touch operations.
- Ensure high availability, disaster recovery, and failover strategies are implemented.
- Secure platform and workloads following enterprise security standards.
- Support application deployments and CI/CD pipelines on OpenShift.
- Troubleshoot networking, cluster, and deployment issues end-to-end.
- Apply SRE best practices to improve reliability, scalability, and performance.
- Collaborate with development and platform teams to optimize system operations.
Skills
AnsibleArgoCDAWS LambdaCertificate ManagementCI/CDDockerELKF5 LTMGitOpsGrafanaLDAPMTLSNetworkingOpenShiftPrometheusRCASRETerraformTLS
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free