Skip to content
mimi

Site Reliability Engineer (OpenShift & Infrastructure)

Accion Labs

Winnipeg · On-site Contract Mid Level Today

About the role

Responsibilities & Skills

  • Install, configure, upgrade, and administer OpenShift clusters (OCP) in on-premise and cloud environments.
  • Manage OCP internal networking, ingress, egress, and cluster services.
  • Configure and integrate LDAP authentication and access management.
  • Implement TLS and MTLS encryption, and manage certificate lifecycle for secure communications.
  • Implement GitOps workflows using ArgoCD for continuous delivery and environment consistency.
  • Automate platform and application provisioning using Terraform and Ansible.
  • Configure and maintain F5 LTM load balancers.
  • Configure and manage DNS, networking, and subnets.
  • Build and manage monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, ELK).
  • Define and enforce SLIs/SLOs and error budgets for services running on OCP.
  • Lead incident response, root cause analysis (RCA), and postmortems.
  • Build automation for self‑healing, scaling, and zero-touch operations.
  • Ensure high availability, disaster recovery, and failover strategies are implemented.
  • Secure platform and workloads following enterprise security standards.
  • Support application deployments and CI/CD pipelines on OpenShift.
  • Troubleshoot networking, cluster, and deployment issues end-to-end.
  • Apply SRE best practices to improve reliability, scalability, and performance.
  • Collaborate with development and platform teams to optimize system operations.

Skills

AnsibleArgoCDAWS LambdaCertificate ManagementCI/CDDockerELKF5 LTMGitOpsGrafanaLDAPMTLSNetworkingOpenShiftPrometheusRCASRETerraformTLS

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free