Junior Site Reliability Engineer

Gruve

Pimpri-Chinchwad · On-site Full-time Entry Level Today

About the role

About Gruve

Gruve is an innovative software services startup dedicated to transforming enterprises to AI powerhouses. We specialize in cybersecurity, customer experience, cloud infrastructure, and advanced technologies such as Large Language Models (LLMs). Our mission is to assist our customers in their business strategies utilizing their data to make more intelligent decisions. As a well-funded early-stage startup, Gruve offers a dynamic environment with strong customer and partner networks.

Position Summary

We are seeking a technically strong and detail-oriented Junior Site Reliability Engineer – GPU & Kubernetes to join our Infrastructure and ML Operations team. This role is responsible for supporting GPU-powered inference platforms, Kubernetes clusters, and ML Ops toolchains across diverse environments. The engineer will work closely with senior team members to deploy, maintain, and troubleshoot systems spanning from bare-metal infrastructure to containerized applications.

Key Responsibilities

Assist with Kubernetes operations on platforms such as Rafay, OpenShift, Mirantis, native Kubernetes, and Rancher.
Support GPU infrastructure: CUDA, ROCm, DGX/HGX reference architectures, NIMs, operators, and exporters.
Help configure and maintain ML Ops toolchains (e.g., Kubeflow) including model upgrades and lifecycle workflows.
Participate in on-call, follow runbooks, and perform first-level incident triage.
Contribute to automation and infrastructure-as-code for repeatable operations.

Basic Qualifications

1–2 years of experience in Linux, scripting (Python, Bash, or Go), or cloud operations.
Familiarity with containerization platforms (pods, services, namespaces, persistent volumes).
Foundational understanding of GPU or ML Ops concepts.
Understanding of Kubernetes fundamentals.

Preferred Qualifications

Exposure to CNI networking (Calico, SR-IOV, Isovalent).
Experience with Kubeflow or similar ML Ops systems.
Familiarity with GPU operators or exporters.

Why Gruve

At Gruve, we foster a culture of innovation, collaboration, and continuous learning. We are committed to building a diverse and inclusive workplace where everyone can thrive and contribute their best work. If you’re passionate about technology and eager to make an impact, we’d love to hear from you.

Gruve is an equal opportunity employer. We welcome applicants from all backgrounds and thank all who apply; however, only those selected for an interview will be contacted.

Requirements

1–2 years of experience in Linux, scripting (Python, Bash, or Go), or cloud operations.
Familiarity with containerization platforms (pods, services, namespaces, persistent volumes).
Foundational understanding of GPU or ML Ops concepts.
Understanding of Kubernetes fundamentals.

Responsibilities

Assist with Kubernetes operations on platforms such as Rafay, OpenShift, Mirantis, native Kubernetes, and Rancher.
Support GPU infrastructure: CUDA, ROCm, DGX/HGX reference architectures, NIMs, operators, and exporters.
Help configure and maintain ML Ops toolchains (e.g., Kubeflow) including model upgrades and lifecycle workflows.
Participate in on-call, follow runbooks, and perform first-level incident triage.
Contribute to automation and infrastructure-as-code for repeatable operations.

Skills

BashCloud operationsContainerizationCUDADGX/HGXGoGPUKubernetesKubeflowLinuxLLMsML OpsNIMsOpenShiftOperatorsPodsPythonROCmRancherRafayServicesVolumes

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free