Skip to content
mimi

AI Site Reliability Engineer

Astra North Infoteck Inc.

Montreal · Hybrid Full-time Senior 2w ago

About the role

Role

SRE +AI

Hybrid

3 days in office-

Location

Montreal

Experience

8+ years of experience as a Site Reliability Engineer or in a similar role, with hands-on experience in supporting IaaS platforms with networking and system engineer-ing knowledge.

Roles and Responsibilities

  • Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)
  • Design and build automation for core platform capabilities, reducing manual toil
  • Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
  • Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards
  • Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
  • Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting
  • Optimize cost vs. performance tradeoffs in large-scale compute environments
  • Harden systems for security, compliance, auditability, and data governance
  • Collaborate across teams (cloud engineers, data engineers, infrastructure, secu-rity) to ensure safe deployment, rollout, rollback, and integration of new systems
  • Define disaster recovery (DR) strategies, backup/restore practices, fault toler-ance mechanisms
  • Maintain runbooks, operational playbooks, documentation, and training materials
  • Participate in on-call rotations and respond to production incidents 24/7 as needed
  • Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability

Skills

  • Production experience in SRE / Infrastructure / ops for large-scale systems
  • Strong programming/scripting skills (Python, Go, Java, or equivalent)
  • Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
  • Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
  • Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
  • Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
  • Networking & systems engineering knowledge (TCP/IP, DNS, routing, load bal-ancing, distributed storage)
  • Solid experience in capacity planning, performance tuning, scaling, and incident response
  • Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improve-ments
  • Experience in regulated environments (financial services, compliance, audit, se-curity) is a strong plus
  • Excellent communication, documentation, and cross-team collaboration skills
  • Proven track record of reducing operational toil via

Requirements

  • Production experience in SRE / Infrastructure / ops for large-scale systems
  • Strong programming/scripting skills (Python, Go, Java, or equivalent)
  • Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
  • Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
  • Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
  • Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
  • Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
  • Solid experience in capacity planning, performance tuning, scaling, and incident response
  • Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements
  • Experience in regulated environments (financial services, compliance, audit, security) is a strong plus
  • Excellent communication, documentation, and cross-team collaboration skills
  • Proven track record of reducing operational toil via automation

Responsibilities

  • Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)
  • Design and build automation for core platform capabilities, reducing manual toil
  • Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
  • Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards
  • Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
  • Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting
  • Optimize cost vs. performance tradeoffs in large-scale compute environments
  • Harden systems for security, compliance, auditability, and data governance
  • Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems
  • Define disaster recovery (DR) strategies, backup/restore practices, fault tolerance mechanisms
  • Maintain runbooks, operational playbooks, documentation, and training materials
  • Participate in on-call rotations and respond to production incidents 24/7 as needed
  • Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability

Skills

AnsibleCloudFormationDockerDatadogELKEFKGoGrafanaHelmJavaKubernetesPrometheusPythonTerraform

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free