Observability Engineer – Production Support & Monitoring (SRE)
Mississauga · Hybrid Contract Mid Level 4d ago
About the role
Contract
6 months (high likelihood of extension)
Location
core downtown Toronto
Schedule
Hybrid – 2 days onsite
Rate
market rate (looking for the best experience/rate ratio)
Main Deliverables
- Ensure reliability, performance, and capacity of enterprise production platforms
- Own and operate observability and monitoring tooling across infrastructure and applications
- Execute automation and operational hygiene to support roadmap-driven growth
Technical Stack
- Monitoring & Observability: ITRS Geneos (primary), ISINGA / Insignia, Faddom, Corvil, Dynatrace
- Infrastructure: Linux / Unix, VMware, AWS (CloudWatch)
- Scripting & Automation: Perl, Bash / Shell, Python
- Messaging / Middleware: IBM MQ, Market Data Monitoring
- Databases: SQL-based relational databases (operational support)
- ITSM & Collaboration: ServiceNow, Microsoft Teams
- Legacy / Transition: SCOM (planned decommissioning)
Must‑Haves
- 5+ years of experience in Production Support, SRE, or Operations Engineering
- Strong, hands-on ITRS Geneos experience in enterprise production environments
- Advanced scripting skills in Perl, Bash/Shell, and Python
- Experience supporting large-scale production environments (hundreds to thousands of servers)
- Strong Linux / Unix systems knowledge
- Experience with enterprise monitoring platforms (Geneos, Dynatrace, Corvil, Faddom)
- Experience with incident and event management using ServiceNow
- Operational SQL skills for troubleshooting and validation
- Willingness to participate in a defined on-call rotation
Other Requirements
- Experience monitoring infrastructure and applications (CPU, memory, disk, network, processes)
- Experience with capacity planning, trend analysis, and platform scaling
- Familiarity with monitoring integrations: AWS CloudWatch, VMware, IBM MQ and Synthetic Monitoring, Market Data Monitoring
- Experience integrating alerts with: ServiceNow, Microsoft Teams, Email and webhook-based notifications
- Exposure to hybrid environments (on‑prem + cloud)
Responsibilities
- Provide L2/L3 production support for business‑critical platforms
- Operate and enhance enterprise monitoring platforms, with Geneos as the core solution
- Perform capacity planning and infrastructure performance analysis
- Develop automation to:
- Execute hygiene routines (log cleanup, validation, health checks)
- Reduce alert noise and manual operational effort
- Support reporting and alert validation
- Configure monitoring for: Infrastructure, applications, APIs, logs, batch jobs, FIX, file watches, databases
- Maintain runbooks, SOPs, and monitoring configuration lifecycle
- Participate in incident response, RCA, and post‑incident remediation
- Support monitoring platform rollouts, onboarding, and gateway scaling
- Improve on-call effectiveness through tuning, automation, and proactive monitoring
Skills
AWS CloudWatchBashCorvilDynatraceFaddomIBM MQITRS GeneosISGNA / InsigniaLinuxMarket Data MonitoringMicrosoft TeamsPerlPythonServiceNowShellSQLSCOMUnixVMware
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free