Service Reliability Engineer ( SRE / Site Reliability Engineer )

The Hartford India

Hyderabad · On-site Full-time Senior 2d ago

About the role

About the Company

Our client is a leader in property and casualty insurance, employee benefits and mutual funds. One of the largest insurers in the United States with many decades of expertise, this company is widely recognized for its service excellence, sustainability practices, trust and integrity.

Role Overview

We are seeking an experienced and highly motivated Sr Staff Reliability Engineer. The Sr Staff Reliability Engineer will have end-to-end accountability for the reliability of IT services within a defined application portfolio. A prerequisite to the role will be a “build-to-manage”, problem-solving and innovative mindset applied to the design, build, test, deploy, change and maintenance of services drawing from deep engineering expertise. The Sr Staff Reliability Engineer will actively contribute to sustained advancement of the RE practice within and beyond a given area of responsibility.

Key Measures of Success

Service stability
Effective delivery and environment instrumentation
Deployment quality
Technical debt reduction
Asset resiliency
Risk/security compliance
Cost efficiency
Proactive and preventative maintenance mechanisms
Top quartile operating norms

Responsibilities

Guide the use of best-in-class software engineering standards and design practices for instrumenting code/application technology stack to enable the generation of relevant metrics on overall technology health – availability, performance, quality, currency and resiliency.
Serve as key liaison between the architecture and software engineering teams to influence the technical strategy for the organization, keeping in mind its cross-functional impacts, integration across the organization, and architecture rationalization.
Function as the go-to technical leader for the applications supported, requiring depth and breadth of knowledge in technologies, applications, integration, interfaces and business domain.
Design, build, and maintain scalable and reliable systems for production environments.
Identify and mitigate risks to system reliability, security, and performance.
Develop effective tooling, alerts, and response mechanisms to identify and address reliability risks leveraging automation to support problem prevention, detection, mitigation, and resolution.
Enhance the delivery flow by engineering the appropriate solutions to increase delivery speed while adhering to technology standards for sustained reliability.

IT Ops Responsibilities

Independently drive the triaging and service restoration of all high impact incidents in order to minimize the mean time to service restoration and impact to the business.
Partner with infrastructure teams to design and implement intelligent incident routing, enhanced monitoring/alerting capabilities and automated service restoration processes.
Achieve and maintain the continuity of Hartford and third-party assets that support a business function.
Accountable for keeping the IT application and infrastructure metadata repositories current.

Required Skills & Experience

System Thinking end-to-end - Broad understanding of enterprise architectures and complex (backend) systems (understand more than the component itself)
Expert experience with Performance and Observability tools such as DynaTrace, Splunk, TrueSight, CloudWatch, CloudTrail, and related tools.
Strong solution architecture orientation to enable expedient troubleshooting, issue-resolution and root-cause removal in a hybrid cloud environment.
Experience with continuous integration and DevOps methodologies, preferred tools such as GitHub, Jenkins, Nexus, Rally, SonarQube etc..
Experience with cloud platforms (AW, GCP, or Azure)
Deep understanding of Linux systems, containers (Docker), and orchestration tools (Kubernetes)
Strong hybrid cloud experience (private and public) across various service delivery models – IaaS, PaaS, SaaS.
Strong communication (verbally and written) / collaboration / negotiation skill, working in a diverse team cross business units
Understanding FinOps or cost-optimization practices in the cloud.
Experience with API gateways, and network-level observability.
Experience in regulated environments (Insurance)
AWS Solutions Architect certification

Requirements

System Thinking end-to-end - Broad understanding of enterprise architectures and complex (backend) systems (understand more than the component itself)
Expert experience with Performance and Observability tools such as DynaTrace, Splunk, TrueSight, CloudWatch, CloudTrail, and related tools.
Strong solution architecture orientation to enable expedient troubleshooting, issue-resolution and root-cause removal in a hybrid cloud environment.
Experience with continuous integration and DevOps methodologies, preferred tools such as GitHub, Jenkins, Nexus, Rally, SonarQube etc..
Experience with cloud platforms (AW, GCP, or Azure)
Deep understanding of Linux systems, containers (Docker), and orchestration tools (Kubernetes)
Strong hybrid cloud experience (private and public) across various service delivery models – IaaS, PaaS, SaaS.
Strong communication (verbally and written) / collaboration / negotiation skill, working in a diverse team cross business units
Understanding FinOps or cost-optimization practices in the cloud.
Experience with API gateways, and network-level observability.
Experience in regulated environments (Insurance)

Responsibilities

Guide the use of best-in-class software engineering standards and design practices for instrumenting code/application technology stack to enable the generation of relevant metrics on overall technology health - availability, performance, quality, currency and resiliency.
Serve as key liaison between the architecture and software engineering teams to influence the technical strategy for the organization, keeping in mind its cross-functional impacts, integration across the organization, and architecture rationalization.
Function as the go-to technical leader for the applications supported, requiring depth and breadth of knowledge in technologies, applications, integration, interfaces and business domain.
Design, build, and maintain scalable and reliable systems for production environments.
Identify and mitigate risks to system reliability, security, and performance.
Develop effective tooling, alerts, and response mechanisms to identify and address reliability risks leveraging automation to support problem prevention, detection, mitigation, and resolution.
Enhance the delivery flow by engineering the appropriate solutions to increase delivery speed while adhering to technology standards for sustained reliability.
Independently drive the triaging and service restoration of all high impact incidents in order to minimize the mean time to service restoration and impact to the business.
Partner with infrastructure teams to design and implement intelligent incident routing, enhanced monitoring/alerting capabilities and automated service restoration processes.
Achieve and maintain the continuity of Hartford and third-party assets that support a business function.
Accountable for keeping the IT application and infrastructure metadata repositories current.

Skills

AWSAzureCloudWatchCloudTrailDockerGCPGitHubJenkinsKubernetesLinuxNexusRallySplunkSonarQubeTrueSightDynaTrace

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Service Reliability Engineer ( SRE / Site Reliability Engineer )

About the role

About the Company

Role Overview

Key Measures of Success

Responsibilities

IT Ops Responsibilities

Required Skills & Experience

Requirements

Responsibilities

Skills

Similar roles

Technical Lead / AI Engineer / Founding Engineer

DevOps Engineer - Full-time

Software Developer/Engineer (Freelancer)

Don't send a generic resume