Senior Principal Infrastructure & Reliability Engineer
Oracle
About the role
About
Join Oracle's Health Data Intelligence (HDI) team as a Senior Principal Infrastructure & Reliability Engineer, specializing in Site Reliability Engineering for sophisticated healthcare analytics platforms. In this pivotal role, you will design, build, and manage robust and scalable infrastructure as well as data pipelines that drive mission-critical analytics on a global scale. You will play a key role in shaping the future of cloud operations by enhancing automation, observability, and AI-driven reliability practices. This includes exploring the implementation of Generative AI and intelligent automation to elevate incident response, system resilience, and operational efficiency. Collaborate with a dedicated team to deliver effective solutions that manage extensive datasets with accuracy and speed while continually advancing system reliability and operational excellence. U.S. citizenship is required for this position, as the successful candidate will need to obtain (and maintain) a U.S. government security clearance after hire.
Responsibilities
Collaborate with the Site Reliability Engineering (SRE) team to share ownership of services and platform components. Gain a deep understanding of the entire system architecture, dependencies, and production performance.
- Design, build, and maintain reliable, scalable, and secure infrastructure to support large-scale analytics workloads.
- Enhance system reliability through automation, monitoring, and performance improvements.
- Promote the adoption of AI-assisted methodologies for operations, which includes:
- Improving observability and alerting systems.
- Supporting automated incident detection and remediation efforts.
- Exploring intelligent automation for managing infrastructure lifecycles.
- Partner with development teams to enhance service architecture, scalability, and operability.
- Participate in on-call rotations and serve as a go-to resource for escalating complex production issues.
- Conduct root cause analysis and develop long-term remedial solutions.
- Utilize your expertise in distributed systems to troubleshoot and optimize system performance.
- Drive ongoing improvements in DevOps/SRE practices, including CI/CD, Infrastructure as Code, and wide-scale automation.
Develop & Maintain
- Implement and enhance infrastructure for the Oracle HDI Analytics Platform.
- Ensure system uptime, reliability, and scalability.
AI-Driven Automation (NEW)
- Design and implement solutions powered by GenAI or autonomous agents for:
- Enhanced observability and anomaly detection.
- Automated incident triage and remediation.
- Infrastructure provisioning and lifecycle management.
- Creation of tools and frameworks to support self-service and autonomous operations.
Data Pipeline Execution
- Construct and optimize scalable data pipelines using Vertica and ETL frameworks.
Operational Excellence
- Apply DevOps/SRE best practices to automate deployments and streamline operations.
- Enhance observability utilizing Prometheus/Grafana along with AI-driven insights.
Cloud Integration
- Support multi-cloud initiatives spanning OCI, AWS, and Azure.
- Optimize cost, performance, and compliance across different environments.
Incident Response
- Engage in on-call rotations.
- Implement preventative measures and automated solutions for remediation.
Collaboration
- Collaborate closely with engineers to achieve technical roadmaps.
- Engage in code reviews and contribute to infrastructure enhancements.
What You Bring
- Over 10 years of experience in software engineering, with more than 8 years in cloud infrastructure, SRE, or DevOps.
- Proven record of overseeing production system reliability in cloud environments.
Core Expertise
- Cloud infrastructure design and automation.
- Understanding of distributed systems and performance optimization.
- Knowledge of data warehousing and ETL frameworks.
AI-Native Experience
- Demonstrated experience in applying GenAI, LLMs, and agentic frameworks in infrastructure or operations settings.
- Experience in building or integrating AI-driven automation within DevOps/SRE workflows.
- Familiarity with tools such as LangChain, AutoGPT, or custom AI agents.
Technical Skills
- Expertise in Terraform, Docker, and Kubernetes.
- Proficiency in observability stacks (Prometheus, Grafana).
- Strong coding skills in Python, Java, or Go.
Additional Strengths
- A problem-solving mindset focused on automation and scalability.
- Experience in enhancing system reliability through intelligent automation techniques.
Preferred Qualifications
- Experience in the healthcare sector or regulated environments (HIPAA, compliance frameworks).
- Familiarity with Oracle HDI or extensive analytics platforms.
- Prior experience in environments necessitating security clearance.
- Experience in developing self-healing or autonomous infrastructure systems.
Why Join Oracle HDI?
- Take ownership of and influence AI-native SRE and automation strategies for a mission-critical platform.
- Engage with large-scale, data-intensive healthcare systems.
- Be part of Oracle's commitment to AI-driven infrastructure and healthcare innovation.
- Help build the future of autonomous, self-healing cloud platforms.
- Collaborate with top-tier engineers to address complex, real-world challenges.
About Us
Oracle unites data, infrastructure, applications, and expertise to drive innovations across industries, enhancing care and outcomes globally. With AI embedded in our products and services, we empower customers to translate that promise into a better future for all. True innovation flourishes when everyone can contribute. We are committed to cultivating a diverse workforce that provides opportunities for all, supported by competitive benefits that include flexible medical, life insurance, and retirement options. We encourage employees to engage with their communities through our volunteer programs. We prioritize inclusivity for individuals with disabilities at all phases of the employment process. If you require assistance or accommodations due to a disability, please contact us. Oracle is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability, and protected veteran status, or any other characteristic protected by law. Oracle will consider for employment qualified applicants with arrest and conviction records in accordance with applicable law.
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free