Site Reliability Engineer (SRE) – AI & Incident Management

Praxis HR Solution

Gurugram · On-site Full-time 3w ago

Apply with a tailored resume Save job

About the role

Job Title

Site Reliability Engineer (SRE) – AI & Incident Management

Location

Pune | Gurugram | Noida (Hybrid / On-site)

Employment Type

Full-Time

Notice Period

Immediate Joiners to 30 Days

Job Summary

We are looking for a highly motivated Site Reliability Engineer (SRE) with strong expertise in AI-driven systems and Incident Management. The ideal candidate will be responsible for ensuring reliability, scalability, and performance of critical production systems. This role requires hands-on experience in automation, monitoring, and incident response to maintain high system availability.

Key Responsibilities • Ensure high availability, reliability, and performance of production systems. • Monitor infrastructure and applications to detect and resolve issues proactively. • Manage incident response, troubleshooting, and root cause analysis (RCA). • Implement automation to improve operational efficiency and reduce manual efforts. • Work closely with development teams to improve system reliability and deployment processes. • Utilize AI/ML tools or AI-enabled platforms to enhance monitoring and incident prediction. • Maintain SLA, SLO, and SLI metrics for system reliability. • Build and maintain observability solutions (logging, metrics, tracing). • Participate in on-call rotations and handle production incidents.

Required Skills • Strong experience in Site Reliability Engineering (SRE) • Hands-on experience with Incident Management and Production Support • Knowledge of AI tools / AI-driven automation / AI-based monitoring • Experience with Cloud Platforms (AWS / Azure / GCP) • Familiarity with Monitoring Tools (Prometheus, Grafana, Datadog, Splunk, etc.) • Experience with Linux / scripting (Python, Bash) • Knowledge of CI/CD pipelines and DevOps practices • Understanding of containerization (Docker, Kubernetes)

Preferred Qualifications • Experience with AIOps platforms • Knowledge of Infrastructure as Code (Terraform / Ansible) • Strong debugging and problem-solving skills • Experience working in high-availability distributed systems

Why Join Us • Opportunity to work on modern AI-driven infrastructure • Exposure to large-scale production environments • Collaborative and growth-focused work culture

How to Apply

Interested candidates with Immediate to 30 days notice period can apply via Indeed or share their updated resume.

Job Types: Full-time, Permanent

Pay: ₹1,200,000.00 per year

Benefits: • Cell phone reimbursement • Food provided • Health insurance • Paid sick time • Paid time off • Provident Fund • Work from home

Work Location: In person

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Site Reliability Engineer (SRE) – AI & Incident Management

About the role

Similar roles

Innovative Senior Financial Analyst for Strategic Decision Support

Civil Technical Manager Overseeing Design and Engineering Teams

Assistant Project Manager - Electrical

Don't send a generic resume