Senior Director, Customer Reliability & Technical Operations (Pharmacy)
Remotejobs
About the role
Overview
The Senior Director, Customer Reliability & Technical Operations is accountable for the reliability, availability, performance, and operational health of Inovalon customer pharmacies using the ScriptMed SaaS platform. This leader directs a global operations organization (United States and India) responsible for ITIL-aligned Incident, Problem, and Change Management, as well as the technical functions that keep the platform stable and scalable, including Cloud Infrastructure Engineering, Database Administration, DevOps, and Site Reliability Engineering (SRE). The role partners closely with Product, Engineering, Security, and Customer Success to proactively detect and remediate issues using DataDog observability and ServiceNow ITSM workflows, ensuring customers experience dependable service and predictable outcomes.
Scope and Impact
Owns day-to-day and sustained operational performance for ScriptMed, including uptime, performance, incident response, and service restoration across customer pharmacies and tenant environments. Leads a blended onshore/offshore operating model, ensuring 24x7 coverage, clear escalation paths, and consistent execution of operational processes. Establishes and matures a Network Operations Center (NOC) and evolves it into an AI-enabled Intelligent Operations Management Center, improving detection, triage, and automation. Drives operational discipline across Incident, Problem, and Change Management, reducing customer-impacting events, shortening MTTR, and preventing recurrence. Provides executive-level visibility into platform health and risk, enabling informed decisions on investment, capacity, and reliability improvements.
Key Responsibilities
Leadership & Operating Model
- Define and execute the customer reliability and technical operations strategy aligned to ScriptMed business objectives, SLAs, and customer expectations.
- Build and lead high-performing teams across the U.S. and India, including staffing, performance management, coaching, and career development.
- Establish clear on-call and escalation models, operational playbooks, and governance routines (daily ops review, incident review, weekly change review, reliability council).
- Partner with Engineering, Product, Security, and Customer Success to align priorities, manage operational risk, and drive continuous improvement.
Incident Management (Service Restoration)
- Own the incident management lifecycle, including detection, triage, escalation, customer-impact assessment, communications, and service restoration.
- Ensure strong runbooks, incident roles, and standards for severity classification, timelines, and stakeholder updates.
- Use DataDog monitoring and alerting to proactively identify issues and reduce customer impact through early detection and fast response.
- Lead post-incident reviews, ensuring corrective actions are assigned, tracked, and validated.
Problem Management (Prevention and Root Cause)
- Establish and mature a problem management program that drives root cause analysis, corrective and preventive actions, and measurable reduction in repeat incidents.
- Create a consistent approach for trend analysis, known error management, and prevention backlog creation.
- Partner with Engineering and Architecture to prioritize reliability improvements and reduce technical debt that drives operational instability.
Change Management (Risk Reduction)
- Own change management processes to ensure reliable deployments, infrastructure changes, and operational updates with appropriate approvals and controls.
- Define change classification standards, risk scoring, blackout windows, validation steps, and rollback plans.
- Partner with DevOps and Engineering to implement change automation and quality gates that reduce change-related incidents.
NOC / Intelligent Operations Management Center
- Stand up and operationalize a Network Operations Center responsible for real-time monitoring, initial triage, and coordinated response.
- Mature NOC capabilities into an AI-enabled Intelligent Operations Management Center that uses automation, correlation, noise reduction, predictive insights, and self-healing where appropriate.
- Define and track operational KPIs (availability, MTTR, MTTD, incident volume by cause, change success rate, alert noise, customer-impact minutes).
Cloud Infrastructure Engineering (GCP)
- Lead teams responsible for cloud infrastructure reliability, capacity planning, scaling, patching, cost optimization, and resiliency improvements within Google Cloud Platform (GCP).
- Ensure platform availability and disaster recovery posture through tested backups, failover processes, RPO/RTO alignment, and resilience engineering.
- Drive standardization of infrastructure as code, configuration management, and secure-by-design controls in partnership with Security.
Database Administration (Oracle)
- Lead database administration teams responsible for availability, performance, backup/recovery, patching, access controls, and operational support of
Responsibilities
- Define and execute the customer reliability and technical operations strategy aligned to ScriptMed business objectives, SLAs, and customer expectations.
- Build and lead high-performing teams across the U.S. and India, including staffing, performance management, coaching, and career development.
- Establish clear on-call and escalation models, operational playbooks, and governance routines (daily ops review, incident review, weekly change review, reliability council).
- Partner with Engineering, Product, Security, and Customer Success to align priorities, manage operational risk, and drive continuous improvement.
- Own the incident management lifecycle, including detection, triage, escalation, customer-impact assessment, communications, and service restoration.
- Ensure strong runbooks, incident roles, and standards for severity classification, timelines, and stakeholder updates.
- Use DataDog monitoring and alerting to proactively identify issues and reduce customer impact through early detection and fast response.
- Lead post-incident reviews, ensuring corrective actions are assigned, tracked, and validated.
- Establish and mature a problem management program that drives root cause analysis, corrective and preventive actions, and measurable reduction in repeat incidents.
- Create a consistent approach for trend analysis, known error management, and prevention backlog creation.
- Partner with Engineering and Architecture to prioritize reliability improvements and reduce technical debt that drives operational instability.
- Own change management processes to ensure reliable deployments, infrastructure changes, and operational updates with appropriate approvals and controls.
- Define change classification standards, risk scoring, blackout windows, validation steps, and rollback plans.
- Partner with DevOps and Engineering to implement change automation and quality gates that reduce change-related incidents.
- Stand up and operationalize a Network Operations Center responsible for real-time monitoring, initial triage, and coordinated response.
- Mature NOC capabilities into an AI-enabled Intelligent Operations Management Center that uses automation, correlation, noise reduction, predictive insights, and self-healing where appropriate.
- Define and track operational KPIs (availability, MTTR, MTTD, incident volume by cause, change success rate, alert noise, customer-impact minutes).
- Lead teams responsible for cloud infrastructure reliability, capacity planning, scaling, patching, cost optimization, and resiliency improvements within Google Cloud Platform (GCP).
- Ensure platform availability and disaster recovery posture through tested backups, failover processes, RPO/RTO alignment, and resilience engineering.
- Drive standardization of infrastructure as code, configuration management, and secure-by-design controls in partnership with Security.
- Lead database administration teams responsible for availability, performance, backup/recovery, patching, access controls, and operational support of
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free