Senior Site Reliability Engineer (Systems Operations Engineer)

Openkyber

US · Hybrid Contract Senior $61 – $66/hr 3mo ago

About the role

About the Role

We are seeking a highly skilled Senior Site Reliability Engineer (SRE) to support key Shared Services Operations Technology platforms, including Payment Evaluations, Regulatory Operations, Financial Crimes, and Business & Real Estate Evaluation. You will be part of a team responsible for maintaining availability, performance, and reliability across ~85 applications that support KYC, AML, and other critical financial-crimes-related workloads. This role blends software engineering, systems operations, and cloud-native reliability practices to drive automation, enhance resilience, and support modernization across a large enterprise ecosystem. You will also help evolve AIOps capabilities, including predictive alerting, self-healing workflows, and AI/ML-driven incident analysis. Some occasional weekend work or overtime may be required for critical system support.

What You'll Do

Site Reliability & Operations

Lead SRE practices that enhance system availability, performance, and scalability across multi-cloud environments.
Support and improve critical applications and customer journeys; lead incident response and blameless postmortems.
Conduct root-cause analysis and drive long-term remediation of recurrent issues.
Define and enforce operational readiness and Non-Functional Requirements (NFRs) during platform modernization.

Automation & Tooling

Design and implement automation to eliminate operational toil and improve service reliability.
Build frameworks for automated SLO/SLI tracking, availability metrics, error budgeting, and customer impact analysis.
Implement self-healing and autonomic systems using AI/ML, RPA, and intelligent monitoring.

Monitoring, Observability & AIOps

Develop and enhance monitoring, alerting, and observability capabilities.
Drive adoption of AIOps platforms to support anomaly detection, predictive alerting, and automated incident resolution.

Collaboration & Leadership

Collaborate with platform teams, product owners, and technology partners across the COO Technology organization.
Mentor peers and champion SRE best practices across engineering teams.
Identify process gaps across domains and recommend scalable, long-term improvements.

Required Qualifications

5+ years in Systems Engineering, Site Reliability Engineering, Technology Architecture, or related fields (or equivalent military/training/education experience).
2+ years performing as part of an SRE team.
Strong written and verbal communication skills.

Technical Skills

Software Development

Proficiency in Python and/or Java/J2EE.
Experience with REST APIs, microservices, Kafka/MQ, and modern integration patterns.
Familiarity with JavaScript frameworks (React, Bootstrap).
Strong SQL skills and database schema design experience.

Infrastructure & Cloud

Expertise with Linux and container orchestration (Kubernetes, OpenShift/OCP strongly preferred).
Experience with PCF, AWS, Google Cloud Platform, or Azure environments.

CI/CD & Automation Tools:

Jenkins, GitLab, SonarQube, Artifactory, Ansible.

Observability & AIOps Tools:

Grafana, Prometheus, Splunk/ELK, AppDynamics, Elastic, ThousandEyes, Aternity, Google Cloud Logging.

AIOps Platforms:

Moogsoft, AI/ML-based analytics frameworks.

Operations & Data ITSM Tools:

ServiceNow, Remedy, IBM Netcool.

Databases:

Oracle, DB2, SQL Server, MongoDB, Hadoop/Cloudera, Spark, Teradata.

Foundational AI Knowledge

Understanding of common AI/ML concepts (classification, regression, clustering, anomaly detection).
Ability to work with structured/unstructured data for model evaluation.
Awareness of ethical/operational considerations in AI systems.
Experience integrating AI into automation workflows is a plus.

Preferred Qualifications

Experience with AutoSys.
Prior experience in corporate banking or financial services.
Strong interest in AI-driven operations and AIOps.

Skills

AnsibleAppDynamicsAWSAzureBootstrapsDB2DockerElasticELKGoogle Cloud PlatformGrafanaHadoopIBM NetcoolJenkinsJIRAKafkaKubernetesLinuxMicroservicesMongoDBOracleOpenShiftPCFPrometheusPythonRPAReactRemedyServiceNowSparkSplunkSQLTeradataThousandEyesUnixVMwareXML

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free