SRE/DevOps Engineer
Jobgether
About the role
Location: Contrecoeur
We are currently looking for an SRE/Dev Ops Engineer in Canada. This role sits at the frontline of enterprise platform reliability, ensuring the stability, availability, and performance of large‑scale cloud and hybrid systems. You will act as the first line of response for incidents across modern infrastructure environments, including Kubernetes, APIs, databases, and cloud‑native services. Working in a highly operational and collaborative setting, you will monitor systems, execute runbooks, and support rapid incident resolution to minimize downtime.
The position combines hands‑on technical troubleshooting with structured operational processes, where precision and communication are critical. You will contribute directly to service reliability by identifying issues, escalating intelligently, and improving documentation and automation opportunities. This is a high‑impact role ideal for professionals who thrive in fast‑paced, incident‑driven environments and enjoy keeping complex systems running smoothly. Accountabilities • Monitor system health across cloud and on‑prem environments using observability tools such as dashboards, logs, and alerting systems. • Perform first‑line incident triage, identify system anomalies, and execute standardized runbooks for resolution or escalation. • Troubleshoot application and infrastructure issues across Kubernetes, APIs, databases, and cloud services to isolate root causes. • Communicate incident status clearly and effectively to stakeholders, ensuring timely updates and accurate reporting. • Support deployment operations and routine tasks by following predefined operational procedures and workflows. • Document incidents, identify gaps in runbooks, and contribute to continuous improvement of operational knowledge bases. • Assist in onboarding new applications into operational monitoring and support frameworks. • Collaborate with engineering and L2/L3 teams to ensure smooth escalation and resolution of complex issues. Requirements • 2–5 years of experience in IT operations, NOC, SRE, or Dev Ops‑related roles. • Strong understanding of Linux, Kubernetes basics, and networking fundamentals. • Experience working with observability tools such as Prometheus, Grafana, Splunk, ELK, or similar platforms. • Ability to follow structured operational workflows, including runbooks and incident management procedures. • Basic scripting knowledge in Python, Bash, or Power Shell for minor automation or script adjustments. • Familiarity with cloud platforms such as AWS, Azure, or GCP is a strong plus. • Understanding of troubleshooting techniques (DNS, logs, connectivity checks, networking tools). • Strong analytical and problem‑solving mindset with a focus on incident resolution and root cause identification. • Effective communication skills for incident reporting and stakeholder updates. • Nice to have: exposure to Service Now, Jira, xMatters, SQL/No
SQL basics, or AI‑assisted operational tools. Benefits • Competitive compensation aligned with experience and technical expertise. • Flexible working arrangements depending on role and location. • Comprehensive health and wellness support programs. • Opportunities for continuous learning, upskilling, and career development. • Exposure to large‑scale cloud‑native and enterprise systems. • Inclusive and diverse work environment focused on collaboration and innovation. • Strong emphasis on work‑life balance and employee well‑being. • Access to modern tools, platforms, and automation‑driven operations practices. #J-18808-Ljbffr
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free