Skip to content
mimi

IT Operations Reliability Engineer

CU Direct Corporation

Canada · On-site Full-time Mid Level $99k – $124k/yr 1mo ago

About the role

The IT Operations Reliability Engineer is responsible for ensuring the stability, reliability, and operational readiness of enterprise systems. This role owns core IT operational functions, including incident response, change management, release readiness, and recurring operational reporting. Operating in a DevOps-focused environment, this position requires strong independent execution, proactive risk identification, disciplined documentation, and clear, concise communication. Success in this role is measured by consistency, follow-through, and the ability to surface and address risks before they impact the business. This is not a software development role, but it requires sound technical judgment, system-level thinking, and the ability to work closely with engineers to diagnose issues, mitigate risk, and improve overall system resilience. The role exists to establish operational reliability as a measurable, scalable discipline, reducing reactive incidents, improving resilience, and increasing organizational confidence as the platform grows.

About you

You are a self-driven, conscientious, fiscally responsible, self-aware, passionate and compassionate professional. You are comfortable with ambiguity, eternally curious, and love problem solving. You operate as an owner and work with a growth mindset. You are extremely productive on your own, and act as a multiplier collaborating with others. You are tireless in questioning the status quo and pursue the best answers to the hardest problems to the benefit of the business. Your focus is strong and capable of context switching and pivoting with the business. In the vacuum of leadership, you assume it.

What You'll Be Doing:

  • Independently own recurring operational deliverables and reports, ensuring they are completed accurately and on schedule with a high degree of autonomy
  • Monitor system performance, availability, and reliability to maintain high uptime and service quality
  • Use observability tools (e.g., Datadog, Grafana) to identify trends, risks, and potential failure modes before they result in business impact
  • Define and evolve operational standards across IT Operations
  • Influence engineering roadmaps through data-driven operational insights
  • Establish, monitor, and refine service level objectives (SLOs) and error budgets aligned with business priorities and customer impact
  • Conduct trend analysis and systematic risk reviews to reduce repeat incidents and operational noise
  • Partner with engineering to prioritize reliability improvements based on incident patterns and performance data

Process Discipline and Continuous Improvement:

  • Maintain accurate shift notes, dashboard, and operational documentation that reflect current system health
  • Track and analyze KPIs related to uptime, performance, scalability, SLAs/SLOs, MTTA, and MTTR
  • Use operational metrics and observability data to identify systematic issues, recurring failure patterns, and opportunities for automation or resilience improvements
  • Define, measure, and report on reliability metrics including error budgets, availability targets, and service health indicators
  • Use operational data to guide trade-offs between feature velocity and long-term stability

Incident, Change and Release Management:

  • Lead blameless post incident reviews focused on systemic remediation, not individual fault
  • Ensure operational readiness for changes and releases through documented reviews, validation, and clear readiness criteria
  • Partner with engineering, infrastructure, and security teams to ensure systems can support evolving transaction volumes and business needs
  • Validate that production changes meet defined reliability and observability standards before release

Communication and Accountability:

  • Proactively communicate status, risks, and blockers with proactive communication
  • Escalate issues early with clear context, impact assessment, and recommend next steps
  • Translate technical details into clear, actionable information for stakeholders
  • Participation in a shared on call rotation to respond to production incidents, with clear escalation paths, documented runbooks, and sustainable on call practices

Systems and Environment:

  • Support a modern, cloud based enterprise platform (e.g., AWS/Azure), including containerized services (e.g., Kubernetes), CI/CD deployment pipelines, infrastructure as code, and third party integrations
  • Work closely with engineering teams operating in distributed systems and environments with high availability and scalability requirements
  • Experience with scripting (e.g., Python, PowerShell, Bash) to support automation and operational tooling
  • Familiarity with infrastructure as code tools (e.g., Terraform, CloudFormation)

What Success Looks Like:

  • Recurring operational deliverables are completed consistently and on schedule
  • Risks are identified and escalated before incidents occur
  • On Call responsibilities are handled with ownership, clarity, and follow through
  • Operational documentation and reporting are accurate, timely, and trusted
  • Stakeholders have clear visibility into system health and operational posture
  • Measurable reduction in repeat incidents and operational noise
  • Clear reliability standards adopted across engineering and operations

Education:

  • Bachelor's degree in information systems, computer science, or a related field. Relevant work experience may be considered in lieu of educational requirements.

Experience:

  • 5+ years of progressive experience in IT Operations, Reliability Engineering, Production Support, or related operational engineering roles
  • Demonstrated ownership of incident, change, and operational processes
  • Hands on experience with monitoring and observability tools
  • Strong written and verbal communication skills
  • Demonstrated experience owning production system with customer impact

Preferred Qualifications:

  • Experience supporting enterprise or regulated environments
  • Familiarity with DevOps principles and cross functional collaboration
  • Experience working in Agile or Scrum based environments
  • Strong Experience with ITSM frameworks (e.g., ITIL)
  • Strong operational mindset with a focus on reliability and continuous improvement
  • Experience in creating or improving operational automation (e.g., alerting logic, runbooks, self healing tasks, or workflow automation)

Who Thrives in This Role:

  • Professionals who value accountability and follow through
  • Individuals comfortable operating with visibility and clear expectations
  • Engineers who proactively identify and communicate risk
  • Professionals motivated by building calm, predictable operations in complex systems

Benefits:

  • Paid Time Off
  • 401(k) (8% match)
  • College Tuition Benefits/ Tuition Reimbursement
  • Good benefits options
  • Company Culture! Cultural and Holiday celebrations, Theme days like Star Wars Day & Bring Your Kids to Work Day, Monthly Townhalls and Quarterly Company Meetings that ensure awareness, inclusion, and transparency.

The starting salary range for this full time position in Irvine, CA is $99,400 - $124,300 per year. This base pay will take into consideration internal equity, candidate's geographic region, job related knowledge and experience among other factors. Origence maintains a highly competitive compensation program. Under company guidelines, this position is eligible for an annual bonus to provide an incentive to achieve targeted goals. Bonuses are awarded at company's discretion on an individual basis.

Origence is an equal opportunity employer. All recruitment, hiring, training, compensation, benefits, discipline, and other terms and conditions of employment will be based upon an individual's qualifications regardless of race, religion, color, sex, gender identity, sexual orientation, national origin, ancestry, military service, marital status, pregnancy, age, protected medical condition, genetic information, disability or any other category protected by federal, state or local law.

Skills

AWSAzureBashCloudFormationCI/CDDatadogDevOpsGrafanaInfrastructure as CodeKubernetesObservabilityPowerShellPythonTerraform

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free