Senior AI Infrastructure Reliability Specialist
Oracle
About the role
About the Role
Join Oracle's Health Data Intelligence (HDI) team as a Senior AI Infrastructure Reliability Specialist, where you'll be at the forefront of Site Reliability Engineering for cutting-edge healthcare analytics platforms. In this role, you'll be responsible for designing, building, and operating robust infrastructure and data pipelines that are essential for mission-critical analytics on a global scale.
Your contributions will advance the future of cloud operations by enhancing automation, observability, and AI-assisted reliability practices. You'll explore innovative uses of Generative AI and intelligent automation to streamline incident responses, boost system resilience, and enhance operational efficiency.
Collaborate within a dynamic team to deliver powerful solutions that effectively manage extensive datasets with unmatched precision and performance, while constantly striving to enhance system reliability and operational excellence.
Important: U.S. citizenship is required for this position, as candidates will need to obtain and maintain a U.S. government security clearance after hire.
Responsibilities
- Collaborate with the Site Reliability Engineering (SRE) team to share ownership of services and platform components, developing an in-depth understanding of end-to-end system architecture, dependencies, and production behavior.
- Design, build, and operate resilient, scalable, and secure infrastructure tailored to large-scale analytics workloads.
- Enhance system reliability through automation, effective monitoring, and performance optimization.
- Participate in AI-assisted operational improvements such as:
- Enhancing observability and alerting systems.
- Supporting automated incident detection and remediation techniques.
- Exploring intelligent automation strategies for infrastructure lifecycle management.
- Partner with development teams to refine service architecture, scalability, and maintainability.
- Engage in on-call rotations and act as an escalation point for complex production issues.
- Conduct root cause analysis and implement enduring solutions to prevent future issues.
- Leverage expertise in distributed systems to troubleshoot and optimize performance.
- Drive continuous improvement in DevOps/SRE practices, including CI/CD, Infrastructure as Code, and large-scale automation efforts.
Develop & Maintain
- Implement and optimize infrastructure specifically for the Oracle HDI Analytics Platform.
- Ensure system uptime, reliability, and scalability.
AI-Driven Automation
- Design and develop GenAI-powered or agent-based solutions for:
- Observability and anomaly detection.
- Incident triage and remediation.
- Infrastructure provisioning and lifecycle management.
- Creating tools and frameworks that facilitate self-service and autonomous operations.
Data Pipeline Execution
- Build and optimize scalable data pipelines using Vertica and ETL frameworks.
Operational Excellence
- Apply DevOps/SRE practices for automating deployments and operations.
- Enhance observability using Prometheus/Grafana along with AI-driven insights.
Cloud Integration
- Support multi-cloud initiatives across OCI, AWS, and Azure.
- Optimize cost, performance, and compliance within environments.
Incident Response
- Participate in on-call rotations.
- Implement preventative and automated remediation strategies.
Collaboration
- Work closely with engineers to execute technical roadmaps.
- Contribute to code reviews and infrastructure enhancements.
What You Bring
- 8+ years in software engineering, with a minimum of 5 years in cloud infrastructure, SRE, or DevOps.
- Proven track record of ensuring production system reliability in cloud settings.
Core Expertise
- Expertise in cloud infrastructure design and automation.
- Experience with distributed systems and performance optimization.
- Knowledge of data warehousing and ETL frameworks.
AI-Native Experience
- Demonstrated application of GenAI / LLMs / agentic frameworks in infrastructure or operations.
- Experience creating or integrating AI-driven automation in DevOps/SRE workflows.
- Familiarity with tools like LangChain, AutoGPT, or custom AI agents.
Technical Skills
- Proficient in Terraform, Docker, Kubernetes.
- Familiarity with observability stacks (Prometheus, Grafana).
- Strong programming skills in Python, Java, or Go.
Additional Strengths
- Strong problem-solving mindset focused on automation and scalability.
- Experience improving system reliability through intelligent automation.
Preferred Qualifications
- Experience in healthcare or regulated environments (HIPAA, compliance frameworks).
- Familiarity with Oracle HDI or large-scale analytics platforms.
- Previous work in environments requiring security clearance.
- Experience in creating self-healing or autonomous infrastructure systems.
Why Join Oracle HDI?
- Take ownership of and shape the AI-native SRE and automation strategy for a crucial platform.
- Engage with large-scale, data-intensive healthcare systems.
- Be part of Oracle's commitment to AI-driven infrastructure and healthcare innovation.
- Contribute to the future of self-healing cloud platforms.
- Work alongside top-notch engineers addressing complex, real-world challenges.
Career Level
IC3
This position may require compliance with applicable health mandates.
Compensation
U.S. Hiring Range: $79,100 - $158,200 per annum (eligibility for bonus and equity included).
Oracle maintains broad salary ranges to consider variations in skills, experience, and market conditions.
Benefits
Oracle US provides a comprehensive benefits package, including:
- Medical, dental, and vision plans.
- Short and long-term disability coverage.
- Life insurance and AD&D.
- Flexible Spending Accounts.
- 401(k) savings plan with company match.
- Paid vacation and holidays.
- Paid parental leave.
- Employee Stock Purchase Plan.
Applications generally accepted for at least three calendar days or until the position is filled.
About Us
Oracle integrates data, infrastructure, applications, and expertise to fuel innovations, including life-saving care. With AI embedded in our offerings, we help turn potential into a reality. At Oracle, we empower all employees to contribute, nurturing a diverse workforce with competitive benefits. Oracle is committed to supporting community involvement through volunteer initiatives.
Oracle is an Equal Employment Opportunity Employer. Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability, or protected veteran status.
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free