Lead Platform Reliability Engineer
Company 1 - The Manufacturers Life Insurance Company
About the role
About the Role
The Lead Platform Reliability Engineer (PRE) ensures the stability, performance, and scalability of the shared platform that supports internal AI solution development. It combines software engineering, SRE practices, and operations to keep the platform reliable and developer-friendly.
Position Responsibilities
- Reliability and performance: Define SLOs/SLIs, track operations budgets, reduce MTTR, capacity plan, and tune autoscaling.
- Observability: Build and maintain logging, metrics, tracing, and alerting; instrument platform components; create runbooks and dashboards.
- Incident response: On-call for platform incidents; triage, mitigate, root-cause, and drive postmortems and corrective actions.
- Automation and tooling: Develop self-service capabilities, AIOps/MLOps/GitOps/CICD pipelines, and operational automations (provisioning, upgrades, backups).
- Infrastructure as code: Manage clusters, networks, storage, and policies via Terraform/Ansible; prevent configuration drift.
- Security and compliance: Enforce identity/RBAC, secrets management, supply chain security, and regulatory controls; collaborate with risk and audit.
- Scalability and cost: Optimize resource usage, plan capacity, control spend (rightsizing, autoscaling, reservations/spot).
- Change management: Safe rollouts, progressive delivery, and policy-as-code guardrails.
- Platform productization: Treat the platform as a product, define operations SLAs in alignment to product roadmap, service catalog, and developer experience.
- Collaborate with global engineering, security, and AI governance teams to ensure compliance with cross-geo regulations and Asia’s data residency requirements.
- Operate scalable backend services supporting high-traffic agent interactions, retrieval operations, and real-time execution flows.
- Maintain AI services runbooks, playbooks, and enablement for GOCC
Required Qualifications
- Bachelor’s in Computer Science/Engineering or equivalent experience (not strictly required if skills demonstrated).
- 5-8 years experience in DevOps/Platform Engineering or Production Operations.
- Proven track record operating large-scale distributed systems and running on-call.
- Operational experience with cloud-native development: Azure, Kubernetes, containers, CI/CD, and observability stacks.
- Knowledge with Python and/or Java/Scala/TypeScript for building backend services and automation.
- Understanding of AI solution, LLM systems, retrieval architectures, embeddings, vector stores, prompt/tool orchestration, and agent workflow fundamentals.
- Knowledge of API design, asynchronous workflows, concurrency, reliability engineering (SLOs, error budgets), and performance tuning.
- Familiarity with security, governance, and compliance for AI/data systems (authN/authZ, data protection, audit logging, model governance).
- Ability to collaborate across global teams and translate business requirements into platform capabilities and operational SLAs.
Preferred Qualifications
- ITIL & ITSM certification
- Azure Administrator/DevOps certificate (nice to have)
- Kubernetes: CKA/CKS certificate (nice to have)
- HashiCorp Terraform Associate certificate (nice to have)
What We Offer
- We’ll empower you to learn and grow the career you want.
- We’ll recognize and support you in a flexible environment where well-being and inclusion are more than just words.
- As part of our global team, we’ll support you in shaping the future you want to see.
About Manulife and John Hancock
Manulife Financial Corporation is a leading international financial services provider, helping people make their decisions easier and lives better. To learn more about us, visit .
Manulife is an Equal Opportunity Employer
At Manulife/John Hancock, we embrace our diversity. We strive to attract, develop and retain a workforce that is as diverse as the customers we serve and to foster an inclusive work environment that embraces the strength of cultures and individuals. We are committed to fair recruitment, retention, advancement and compensation, and we administer all of our practices and programs without discrimination on the basis of race, ancestry, place of origin, colour, ethnic origin, citizenship, religion or religious beliefs, creed, sex (including pregnancy and pregnancy-related conditions), sexual orientation, genetic characteristics, veteran status, gender identity, gender expression, age, marital status, family status, disability, or any other ground protected by applicable law.
It is our priority to remove barriers to provide equal access to employment. A Human Resources representative will work with applic
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free