Skip to content
mimi

System Reliability Engineer

Galactic Minds INC

Montreal · On-site Full-time 5d ago

About the role

Systems Reliability Engineer (SRE)
Montreal, QC, Canada (Onsite)
Long-Term Contract (C2C)

About the Role

We are looking for a skilled Systems Reliability Engineer (SRE) to join our Reliability & Production Engineering team. This role focuses on enhancing system availability, scalability, performance, and resilience by applying strong software engineering practices.

Key Responsibilities

  • Design, build, and maintain scalable and reliable distributed systems
  • Troubleshoot issues across infrastructure, application, and network layers
  • Improve automation for deployment, monitoring, and system management
  • Collaborate with engineering teams on system design and architecture
  • Identify and mitigate system reliability risks proactively
  • Participate in design reviews and operational readiness processes
  • Work in a global, follow-the-sun support model

Required Skills & Experience

  • Strong troubleshooting and root cause analysis skills
  • Experience with monitoring tools: AppDynamics, Grafana, Splunk, or Dynatrace
  • Hands‑on with automation/configuration tools (Ansible, GitHub, etc.)
  • Scripting experience in Python, Shell, or similar languages
  • Understanding of distributed systems, microservices, cloud, and system architecture
  • Knowledge of databases, load balancing, caching, and system performance
  • Experience managing or supporting large‑scale systems (preferred)

Qualifications

  • Bachelor’s degree in Computer Science, Engineering, or related field

What We’re Looking For

  • Problem solver with a passion for reliability engineering
  • Team player with strong ownership and accountability
  • Comfortable in fast‑paced, evolving environments

Requirements

  • Strong troubleshooting and root cause analysis skills
  • Experience with monitoring tools: AppDynamics, Grafana, Splunk, or Dynatrace
  • Hands-on with automation/configuration tools (Ansible, GitHub, etc.)
  • Scripting experience in Python, Shell, or similar languages
  • Understanding of distributed systems, microservices, cloud, and system architecture
  • Knowledge of databases, load balancing, caching, and system performance
  • Experience managing or supporting large-scale systems (preferred)

Responsibilities

  • Design, build, and maintain scalable and reliable distributed systems
  • Troubleshoot issues across infrastructure, application, and network layers
  • Improve automation for deployment, monitoring, and system management
  • Collaborate with engineering teams on system design and architecture
  • Identify and mitigate system reliability risks proactively
  • Participate in design reviews and operational readiness processes
  • Work in a global, follow-the-sun support model

Skills

AnsibleAppDynamicsCloudDatabasesDistributed systemsDynatraceGrafanaGitHubLoad balancingMicroservicesMonitoringNetworkPythonShellSplunkSystem architectureSystem performance

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free