Observability Engineer – Production Support & Monitoring (SRE)

Mississauga · Hybrid Contract Mid Level 3mo ago

Apply with a tailored resume Save job

About the role

Contract

6 months (high likelihood of extension)

Location

core downtown Toronto

Schedule

Hybrid – 2 days onsite

Rate

market rate (looking for the best experience/rate ratio)

Main Deliverables

Ensure reliability, performance, and capacity of enterprise production platforms
Own and operate observability and monitoring tooling across infrastructure and applications
Execute automation and operational hygiene to support roadmap-driven growth

Technical Stack

Monitoring & Observability: ITRS Geneos (primary), ISINGA / Insignia, Faddom, Corvil, Dynatrace
Infrastructure: Linux / Unix, VMware, AWS (CloudWatch)
Scripting & Automation: Perl, Bash / Shell, Python
Messaging / Middleware: IBM MQ, Market Data Monitoring
Databases: SQL-based relational databases (operational support)
ITSM & Collaboration: ServiceNow, Microsoft Teams
Legacy / Transition: SCOM (planned decommissioning)

Must‑Haves

5+ years of experience in Production Support, SRE, or Operations Engineering
Strong, hands-on ITRS Geneos experience in enterprise production environments
Advanced scripting skills in Perl, Bash/Shell, and Python
Experience supporting large-scale production environments (hundreds to thousands of servers)
Strong Linux / Unix systems knowledge
Experience with enterprise monitoring platforms (Geneos, Dynatrace, Corvil, Faddom)
Experience with incident and event management using ServiceNow
Operational SQL skills for troubleshooting and validation
Willingness to participate in a defined on-call rotation

Other Requirements

Experience monitoring infrastructure and applications (CPU, memory, disk, network, processes)
Experience with capacity planning, trend analysis, and platform scaling
Familiarity with monitoring integrations: AWS CloudWatch, VMware, IBM MQ and Synthetic Monitoring, Market Data Monitoring
Experience integrating alerts with: ServiceNow, Microsoft Teams, Email and webhook-based notifications
Exposure to hybrid environments (on‑prem + cloud)

Responsibilities

Provide L2/L3 production support for business‑critical platforms
Operate and enhance enterprise monitoring platforms, with Geneos as the core solution
Perform capacity planning and infrastructure performance analysis
Develop automation to:
- Execute hygiene routines (log cleanup, validation, health checks)
- Reduce alert noise and manual operational effort
- Support reporting and alert validation
Configure monitoring for: Infrastructure, applications, APIs, logs, batch jobs, FIX, file watches, databases
Maintain runbooks, SOPs, and monitoring configuration lifecycle
Participate in incident response, RCA, and post‑incident remediation
Support monitoring platform rollouts, onboarding, and gateway scaling
Improve on-call effectiveness through tuning, automation, and proactive monitoring

Skills

AWS CloudWatchBashCorvilDynatraceFaddomIBM MQITRS GeneosISGNA / InsigniaLinuxMarket Data MonitoringMicrosoft TeamsPerlPythonServiceNowShellSQLSCOMUnixVMware

Similar roles

Platform Engineer

Geckotools

Early Talent, Cloud Integrations (Microsoft Azure)

Nebius Group

Data Platform Engineer

NN Group

€4k – €7k/mo

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free