Skip to content
mimi

Senior Site Reliability Engineer-III

GreyOrange Robotics Inc.

India ยท On-site Full-time Senior 2w ago

About the role

As a Site Reliability Engineer (SRE) at Grey Orange Inc., you will play a crucial role in ensuring the reliability and scalability of our cloud-native microservices infrastructure. Your responsibilities will include:

- Defining and enforcing Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets across microservices. - Architecting an observability stack comprising metrics, logs, and traces to drive operational insights. - Automating toil and manual operations through the development of robust tooling and runbooks. - Owning the incident response lifecycle, including detection, triage, Root Cause Analysis (RCA), and postmortems. - Collaborating with product teams to build fault-tolerant systems. - Championing performance tuning, capacity planning, and scalability testing. - Optimizing costs while maintaining the reliability of our cloud infrastructure.

In order to succeed in this role, you must have the following skills:

- 6+ years of experience in SRE, Infrastructure, or Backend roles using Cloud Native Technologies. - 2+ years of specific experience in an SRE role. - Strong familiarity with monitoring and observability tools such as Datadog, Prometheus, Grafana, and ELK. - Experience with infrastructure-as-code tools like Terraform and Ansible. - Proficiency in Kubernetes, service mesh technologies like Istio or Linkerd, and container orchestration. - Deep understanding of distributed systems, networking, and failure domains. - Expertise in automation using Python, Bash, or Go. - Proficient in incident management, SLAs/SLOs, and system tuning. - Hands-on experience with GCP, AWS, or Azure, along with cloud cost optimization. - Participation in on-call rotations and experience in running large-scale production systems.

Additionally, it would be beneficial if you have the following nice-to-have skills:

- Familiarity with chaos engineering practices and tools such as Gremlin and Litmus. - Background in performance testing and load simulation using tools like Gatling, Locust, k6, or JMeter.

Working at Grey Orange Inc., you will have the opportunity to collaborate with a lean team of passionate and talented individuals who share a common goal of supercharging brick-and-mortar retail stores in the e-commerce era. The company values experimentation and encourages thinking outside the box. Moreover, a culture of problem-solving is deeply ingrained in our DNA, offering you the chance to make a significant impact with your work.

Grey Orange Inc. is an equal employment opportunity employer that values diversity and inclusivity. We do not discriminate against any applicant or employee based on various protected categories, ensuring a fair and inclusive work environment for all. As a Site Reliability Engineer (SRE) at Grey Orange Inc., you will play a crucial role in ensuring the reliability and scalability of our cloud-native microservices infrastructure. Your responsibilities will include:

- Defining and enforcing Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets across microservices. - Architecting an observability stack comprising metrics, logs, and traces to drive operational insights. - Automating toil and manual operations through the development of robust tooling and runbooks. - Owning the incident response lifecycle, including detection, triage, Root Cause Analysis (RCA), and postmortems. - Collaborating with product teams to build fault-tolerant systems. - Championing performance tuning, capacity planning, and scalability testing. - Optimizing costs while maintaining the reliability of our cloud infrastructure.

In order to succeed in this role, you must have the following skills:

- 6+ years of experience in SRE, Infrastructure, or Backend roles using Cloud Native Technologies. - 2+ years of specific experience in an SRE role. - Strong familiarity with monitoring and observability tools such as Datadog, Prometheus, Grafana, and ELK. - Experience with infrastructure-as-code tools like Terraform and Ansible. - Proficiency in Kubernetes, service mesh technologies like Istio or Linkerd, and container orchestration. - Deep understanding of distributed systems, networking, and failure domains. - Expertise in automation using Python, Bash, or Go. - Proficient in incident management, SLAs/SLOs, and system tuning. - Hands-on experience with GCP, AWS, or Azure, along with cloud cost optimization. - Participation in on-call rotations and experience in running large-scale production systems.

Additionally, it would be beneficial if you have the following nice-to-have skills:

- Familiarity with chaos engineering practices and tools such as Gremlin and Litmus. - Background in performance testing and load simulation using tools like Gatling, Locust, k6, or JMeter.

Working at Grey Orange Inc., you will have the opportunity to collaborate with a lean team of passionate and talented individuals who share a common goal of supercharging brick-and-mortar re

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free