Skip to content
mimi

Principal Site Reliability Engineer (SRE)

Oracle

Harrisburg · On-site Full-time Lead $86k – $200k/yr Today

About the role

Our Team

Building off our Cloud momentum, Oracle has formed a new organization – Oracle Health Data, Analytics Platform. This team focuses on product development and strategy for Oracle Health, building a complete platform that supports modernized, automated healthcare. It is a net‑new line of business with an entrepreneurial spirit, aiming to become a world‑class engineering center.

About the Job

The Principal Site Reliability Engineer (SRE) provides technical leadership for the core data platforms behind Oracle Health’s Data & Analytics Platform. You will own shared, mission‑critical systems used by multiple products and teams, leading the design and operation of large‑scale, stateful distributed platforms (Hadoop ecosystem components, Kafka, Storm) deployed on Oracle Big Data Service and managed via Ansible‑ and Terraform‑based automation.

What You'll Do

Platform Ownership & Technical Leadership

  • Own end‑to‑end reliability, scalability, and operability of shared data platforms
  • Define platform standards, architectural direction, and operational guardrails
  • Influence cross‑team technical decisions and long‑term platform strategy
  • Drive platform evolution and reliability strategy across the data ecosystem

Architecture & Design

  • Lead platform architecture and design reviews
  • Clearly articulate system behavior, dependencies, and failure modes
  • Make principled trade‑offs between reliability, performance, cost, and complexity
  • Provide guidance and guardrails enabling downstream teams to use platforms safely

Operations Engineering

  • Establish capacity models, scaling strategies, and operational best practices
  • Design platforms that behave predictably under load, failure, and change
  • Own platform lifecycle events: upgrades, expansions, decommissioning, and recovery

Distributed Systems Expertise

  • Operate and evolve stateful distributed systems where data placement, replication, and recovery are critical
  • Reason about failure modes such as backpressure, rebalancing, region movement, replication lag, and rolling upgrades

Security

  • Operate and maintain Kerberized platforms, including authentication, authorization, and secure service‑to‑service communication
  • Treat security as a first‑class architectural concern

Automation

  • Design and evolve an Ansible‑ and Terraform‑driven automation framework
  • Treat automation as production software: versioned, reviewed, tested, and improved
  • Eliminate operational toil by encoding reliability and safety into the platform

Incident Leadership & Prevention

  • Serve as the ultimate escalation point for complex or ambiguous incidents
  • Focus on eliminating entire classes of failure, not just resolving individual issues

Representation

  • Represent SRE and platform engineering in high‑visibility and sensitive forums
  • Communicate clearly with engineering leadership and partner teams

Responsibilities

The team operates within the Oracle Health Data & Analytics Platform, supporting the core product HealtheIntent. You will manage the big‑data and streaming infrastructure that enables downstream teams to deliver reliable customer‑facing solutions at scale while continuously improving operability and efficiency.

Required Experience

  • 8+ years operating large‑scale, customer‑facing distributed platforms
  • Deep experience with HDFS, YARN, HBase, Kafka, Storm, or similar systems
  • Strong background in Linux, networking, and distributed‑system troubleshooting
  • Infrastructure‑as‑Code using Ansible and Terraform
  • Scripting and automation using Python, Ruby, and Bash
  • Hands‑on experience operating Kerberized environments
  • Proven ability to define and document technical architecture for complex systems
  • Demonstrated ownership of shared platforms with broad blast radius and multiple downstream consumers
  • Experience designing observability and capacity models for distributed platforms

Required Qualifications

  • U.S. Citizenship and eligibility for a Federal Security Clearance
  • 10+ years of technical experience relevant to this position
  • Ability to communicate effectively and build rapport with team members
  • BS or MS in Computer Science, or equivalent

Benefits

  1. Medical, dental, and vision insurance, including expert medical opinion
  2. Short‑term disability and long‑term disability
  3. Life insurance and AD&D
  4. Supplemental life insurance (Employee/Spouse/Child)
  5. Health care and dependent care Flexible Spending Accounts
  6. Pre‑tax commuter and parking benefits
  7. 401(k) Savings and Investment Plan with company match
  8. Paid time off: Flexible Vacation for salaried employees; accrued vacation for others (13 days/year first 3 years, 18 days thereafter)
  9. 11 paid holidays
  10. Paid sick leave: 72 hours upon hire, refreshed each calendar year (carry‑over up to 112 hours)
  11. Paid parental leave
  12. Adoption assistance
  13. Employee Stock Purchase Plan
  14. Financial planning and group legal services
  15. Voluntary benefits including auto, homeowner, and pet insurance

Disclaimer

Certain US customer or client‑facing roles may be required to comply with applicable requirements, such as immunization and occupational health mandates.

Range and benefit information provided in this posting are specific to the stated locations only.
US hiring range: $86,400 – $199,500 per annum, may be eligible for bonus and equity.


Oracle is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability, protected veteran status, or any other characteristic protected by law.

Requirements

  • U.S. Citizenship and eligibility for a Federal Security Clearance
  • 10+ years of technical experience relevant to this position
  • Ability to communicate effectively and build rapport with team members
  • BS or MS in Computer Science, or equivalent

Responsibilities

  • Own the end-to-end reliability, scalability, and operability of shared data platforms
  • Define platform standards, architectural direction, and operational guardrails
  • Influence cross-team technical decisions and long-term platform strategy
  • Drive long-term platform evolution and influence reliability strategy across the data ecosystem
  • Lead platform architecture and design reviews
  • Clearly articulate system behavior, dependencies, and failure modes
  • Make principled trade-offs between reliability, performance, cost, and complexity
  • Provide guidance and guardrails that enable downstream teams to use platforms safely and effectively
  • Establish capacity models, scaling strategies, and operational best practices
  • Design platforms that behave predictably under load, failure, and change
  • Own platform lifecycle events: upgrades, expansions, decommissioning, and recovery
  • Operate and evolve stateful distributed systems where data placement, replication, and recovery are critical
  • Reason about failure modes such as backpressure, rebalancing, region movement, replication lag, and rolling upgrades
  • Operate and maintain Kerberized platforms, including authentication, authorization, and secure service-to-service communication
  • Treat security as a first-class architectural concern
  • Design and evolve an Ansible- and Terraform-driven automation framework
  • Treat automation as production software: versioned, reviewed, tested, and improved
  • Eliminate operational toil by encoding reliability and safety into the platform
  • Serve as the ultimate escalation point for complex or ambiguous incidents
  • Focus on eliminating entire classes of failure, not just resolving individual issues
  • Represent SRE and platform engineering in high-visibility and sensitive forums
  • Communicate clearly with engineering leadership and partner teams
  • We operate the big data and streaming infrastructure that enables downstream teams to deliver reliable customer-facing solutions at scale, while continuously improving operability and efficiency.

Benefits

dental_coveragepaid_time_offhealth_insurance

Skills

AnsibleBashHadoopHBaseHDFSKafkaKerberosLinuxnetworkingPythonRubyStormTerraformYARN

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free