GM
System Reliability Engineer
Galactic Minds INC
Montreal · On-site Full-time 5d ago
About the role
Systems Reliability Engineer (SRE)
Montreal, QC, Canada (Onsite)
Long-Term Contract (C2C)
About the Role
We are looking for a skilled Systems Reliability Engineer (SRE) to join our Reliability & Production Engineering team. This role focuses on enhancing system availability, scalability, performance, and resilience by applying strong software engineering practices.
Key Responsibilities
- Design, build, and maintain scalable and reliable distributed systems
- Troubleshoot issues across infrastructure, application, and network layers
- Improve automation for deployment, monitoring, and system management
- Collaborate with engineering teams on system design and architecture
- Identify and mitigate system reliability risks proactively
- Participate in design reviews and operational readiness processes
- Work in a global, follow-the-sun support model
Required Skills & Experience
- Strong troubleshooting and root cause analysis skills
- Experience with monitoring tools: AppDynamics, Grafana, Splunk, or Dynatrace
- Hands‑on with automation/configuration tools (Ansible, GitHub, etc.)
- Scripting experience in Python, Shell, or similar languages
- Understanding of distributed systems, microservices, cloud, and system architecture
- Knowledge of databases, load balancing, caching, and system performance
- Experience managing or supporting large‑scale systems (preferred)
Qualifications
- Bachelor’s degree in Computer Science, Engineering, or related field
What We’re Looking For
- Problem solver with a passion for reliability engineering
- Team player with strong ownership and accountability
- Comfortable in fast‑paced, evolving environments
Requirements
- Strong troubleshooting and root cause analysis skills
- Experience with monitoring tools: AppDynamics, Grafana, Splunk, or Dynatrace
- Hands-on with automation/configuration tools (Ansible, GitHub, etc.)
- Scripting experience in Python, Shell, or similar languages
- Understanding of distributed systems, microservices, cloud, and system architecture
- Knowledge of databases, load balancing, caching, and system performance
- Experience managing or supporting large-scale systems (preferred)
Responsibilities
- Design, build, and maintain scalable and reliable distributed systems
- Troubleshoot issues across infrastructure, application, and network layers
- Improve automation for deployment, monitoring, and system management
- Collaborate with engineering teams on system design and architecture
- Identify and mitigate system reliability risks proactively
- Participate in design reviews and operational readiness processes
- Work in a global, follow-the-sun support model
Skills
AnsibleAppDynamicsCloudDatabasesDistributed systemsDynatraceGrafanaGitHubLoad balancingMicroservicesMonitoringNetworkPythonShellSplunkSystem architectureSystem performance
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free