Senior Infrastructure and Reliability Engineer

Oracle

Dover · On-site Full-time Senior $79k – $158k/yr 2mo ago

About the role

Job Overview

Become a vital member of Oracle's Health Data Intelligence (HDI) team as a Senior Infrastructure and Reliability Engineer, where your expertise in Site Reliability Engineering will help optimize large-scale healthcare analytics platforms. In this role, you will be responsible for designing, building, and operating resilient and scalable infrastructure and data pipelines that support essential global analytics.

Join us in transforming cloud operations through the advancement of automation, observability, and AI-driven reliability practices. You will have the opportunity to leverage Generative AI and intelligent automation to enhance incident response, increase system resilience, and boost operational efficiency.

Collaborate with a dynamic team to create reliable solutions capable of handling massive datasets with efficiency, while focusing on continuous improvement of system reliability and operational excellence.

U.S. citizenship is required for this position, as the successful candidate will need to obtain (and maintain) a U.S. government security clearance after hire.

Key Skills Required

Infrastructure & Reliability
- Proven experience creating and managing high-availability, fault-tolerant systems.
- Understanding of distributed systems, performance monitoring, and resiliency patterns.
- Expertise in incident response, root-cause analysis, and troubleshooting production issues.
AI-Native Engineering
- Hands-on experience with Generative AI or Agentic AI applications for:
  - Infrastructure lifecycle management.
  - Observability and anomaly detection.
  - Automated incident response and remediation.
  - Designing AI-driven workflows to enhance operational efficiency and reliability.
  - Building or integrating autonomous agents for SRE and DevOps solutions.
Cloud & Multi-Cloud Ecosystems
- Extensive experience with multi-cloud environments including OCI, AWS, and Azure.
- Strong proficiency in cloud infrastructure design, deployment, and resource optimization.
- Experience managing hybrid or cross-cloud architectures.
DevOps/SRE Practices
- Expertise in CI/CD pipelines (Jenkins, Kubernetes).
- Familiarity with Infrastructure as Code (Terraform).
- Experience with observability tools (Prometheus, Grafana).
- A focus on automation-first operations.
Data Technologies
- Proficiency in Data Warehousing platforms (e.g., Vertica, Snowflake).
- Experience with ETL frameworks for large-scale data processing.
- Understanding of columnar storage systems.
BI & Reporting
- Experience with BI tools like Tableau, Power BI, or Oracle Analytics.
Programming & Tools
- Strong programming skills in Python, Java, or Go.
- Experience with Docker, Kubernetes, and shell scripting.
Problem-Solving
- Excellent troubleshooting skills with an ability to conduct root-cause analysis.
- Experience in resolving complex production challenges in distributed systems.

Responsibilities

Collaborate with the Site Reliability Engineering (SRE) team in managing platform components and delivering services. Develop a comprehensive understanding of system architecture and performance.
Design, construct, and maintain dependable, scalable, and secure infrastructure for large-scale analytics workloads.
Enhance system reliability via automation, performance optimization, and monitoring.
Drive the incorporation of AI-assisted methods for operational tasks, including: improving observability and alerting, automating incident detection, and exploring AI for infrastructure management.
Work alongside development teams to refine service architecture and improve scalability.
Participate in on-call rotations to support complex production issues.
Conduct root cause analysis and implement long-term solutions to minimize recurrences.
Use distributed systems knowledge to troubleshoot issues and enhance system performance.
Promote continuous improvement in DevOps/SRE practices, including CI/CD and automation at scale.

Development & Maintenance

Implement and optimize infrastructure for the Oracle HDI Analytics Platform, ensuring uptime and scalability.

AI-Driven Automation

Design and apply GenAI-powered or agent-based solutions for: observability, incident management, and infrastructure lifecycle activities.
Create tools and frameworks that facilitate self-service and autonomous operations.

Data Pipeline Utilization

Build and enhance scalable data pipelines using Vertica and ETL technologies.

Operational Excellence

Utilize DevOps/SRE practices to automate operations and deployments.
Improve observability using Prometheus/Grafana along with AI insights.

Cloud Integration

Support multi-cloud strategies across OCI, AWS, and Azure, optimizing costs and performance.

Incident Management

Engage in on-call rotations and develop automated remediation solutions.

Collaboration

Work closely with engineers to fulfill technical objectives.
Contribute to code reviews and enhance infrastructure practices.

Your Qualifications

8+ years of software engineering experience, including 5+ years in cloud infrastructure, SRE, or DevOps.
Demonstrated accountability for maintaining production system reliability in cloud environments.

Your Expertise

Cloud infrastructure design and automation.
Distributed systems and performance optimization.
Data warehousing and ETL frameworks.

AI Experience

Evidence of applying GenAI / LLMs / agent frameworks in operations.
Experience creating AI-powered automation for DevOps/SRE.
Familiarity with tools such as LangChain, AutoGPT, or custom AI agents.

Preferred Background

Experience in the healthcare sector or regulated environments (HIPAA, compliance frameworks).
Knowledge of Oracle HDI or large-scale analytics systems.
Background in environments requiring security clearance.
Experience developing self-healing or autonomous infrastructure solutions.

Why Work with Oracle HDI?

Shape the AI-native SRE and automation strategy for a crucial platform.
Engage with extensive, data-driven healthcare systems.
Contribute to Oracle's focus on AI-enhanced infrastructure and healthcare advancements.
Help build the future of self-healing cloud frameworks.
Collaborate with leading engineers on significant, real-world challenges.

Career Level - IC3

Disclaimer:

Specific U.S.-based roles may have immunization, occupational health mandates, or drug testing requirements.

Range and benefit information specified here is for designated locations only.

US: Salary range is between $79,100 and $158,200 per annum. Possible eligibility for bonuses and equity.

Oracle has broad salary ranges to account for differences in experience, skills, market conditions, and locations and to reflect various products, industries, and lines of business.

Candidates are placed within salary ranges based on these factors and internal equity.

Oracle US offers extensive benefits, including: medical, dental, and vision insurance; short/long-term disability; life insurance; flexible spending accounts; 401(k) with matching; flexible vacation; paid holidays; sick leave; parental leave; employee stock purchase plans; and more.

The role typically accepts applications for at least three calendar days from posting or until the position is filled.

About Us

Oracle combines data, infrastructure, applications, and expertise to drive innovations across industries, providing vital support to healthcare systems. With embedded AI across our offerings, we enable users to transform their visions into impactful realities. Join a company leading AI and cloud solutions for better outcomes.

We empower a diverse workforce to thrive with competitive benefits, flexible options, and opportunities to give back through community service initiatives.

Our commitment extends to inclusivity for individuals with disabilities throughout the hiring process. Please reach out to request accommodations if needed.

Oracle is an Equal Employment Opportunity Employer, welcoming all qualified applicants regardless of race, color, religion, sex, national origin, sexual orientation, gender identity, disability, or veteran status.

Skills

AWSAzureCI/CDDockerETLGrafanaGoInfrastructure as CodeJenkinsJavaKubernetesOCIPrometheusPythonSRETerraformVertica

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free