Skip to content
mimi

Site Reliability Engineer

FarmGPU

Placerville · On-site Full-time Mid Level 4d ago

About the role

About

FarmGPU is redefining the future of GPU-powered cloud computing, delivering cost-effective, scalable, high-performance GPU infrastructure tailored for AI developers, startups, and enterprises globally. Our vertically integrated platform transforms data centers into AI-optimized facilities, accelerates storage-intensive training and inference workflows, and delivers on-demand compute via strategic partnerships such as with RunPod Secure Cloud. With sustainability, performance, and innovation at our core, we challenge the status quo of traditional cloud providers.

What You'll Do

  • Monitor and maintain production systems across GPU servers, storage, and networking using Grafana dashboards, alerting pipelines, and documented runbooks; respond to incidents and upscale appropriately.
  • Troubleshoot and resolve issues on Linux-based bare-metal systems: service failures, hardware faults, network degradation, and storage anomalies.
  • Execute and improve automation using existing Ansible playbooks and Python/bash scripts; identify operational gaps and contribute improvements to reduce manual intervention.
  • Manage configuration and deployments across the server fleet using pull-based configuration management tooling, ensuring consistency and auditability.
  • Coordinate hardware maintenance: node replacements, firmware updates, drive swaps, and hands‑on rack‑level operations in our datacenter.
  • Support production reliability for customer‑facing GPU compute workloads hosted on RunPod Secure Cloud and direct enterprise deployments.
  • Develop and track SLIs and SLOs in partnership with the engineering team to measure and improve service reliability.
  • Participate in on‑call and shift rotations, including evenings, nights, and weekends as part of 24/7 operations coverage.

What You Bring

  • Strong working knowledge of Linux systems —comfortable with the command line, process/service management, log analysis, and hands‑on troubleshooting in a production environment.
  • Experience with monitoring and observability tools, particularly Grafana and Prometheus—able to navigate dashboards, interpret metric trends, and act on alerts.
  • Proficiency in scripting and automation: Python and/or bash for operational task automation; experience running Ansible playbooks in production.
  • Solid understanding of distributed system concepts and the ability to troubleshoot complex issues across multiple layers of the stack.
  • Familiarity with datacenter networking fundamentals: IP addressing, VLANs, switching, OSI layers 3/4—enough to diagnose and resolve common connectivity issues.
  • Experience with bare‑metal server environments, including hardware diagnostics, BMC/IPMI management, and routine maintenance.
  • Working knowledge of containerization: Docker and/or Kubernetes at an operational level.
  • Solid troubleshooting methodology and attention to detail; comfortable following and improving documented runbooks.
  • Willingness to work on‑site in Rancho Cordova, CA, including shift rotations covering evenings, nights, and weekends.

Preferred Qualifications

  • 3+ years in a production SRE, DevOps, or infrastructure operations role.
  • Experience implementing and tracking SLIs and SLOs for production services.
  • Familiarity with GPU server environments (NVIDIA H100/H200/B200) or HPC infrastructure.
  • Experience with storage platforms such as NVMe, NAS, or VAST Data in a production setting.
  • Exposure to security and compliance practices: secret management, access control, Linux hardening, SOC 2 familiarity.
  • Experience with cloud platforms (AWS, GCP, or Azure) or hybrid datacenter/cloud environments.
  • Relevant certifications such as RHCSA, CKA, or AWS Certified DevOps Engineer.

Why FarmGPU?

  • Hands‑on work with cutting‑edge hardware —you'll operate some of the most advanced AI compute infrastructure available.
  • Strong technical team —our senior SREs and software engineers have deep expertise and are invested in building solid operational practices.
  • High ownership —your work directly impacts the reliability of customer AI workloads.
  • Located in Rancho Cordova, CA, in the heart of a growing AI and robotics ecosystem.

Compensation

This is a full‑time, on‑site position in Rancho Cordova, CA. Remote work is not available for this role.

Job ID: #J-18808-L

Requirements

  • Strong working knowledge of Linux systems —comfortable with the command line, process/service management, log analysis, and hands-on troubleshooting in a production environment.
  • Experience with monitoring and observability tools , particularly Grafana and Prometheus—able to navigate dashboards, interpret metric trends, and act on alerts.
  • Proficiency in scripting and automation : Python and/or bash for operational task automation; experience running Ansible playbooks in production.
  • Solid understanding of distributed system concepts and the ability to troubleshoot complex issues across multiple layers of the stack.
  • Familiarity with datacenter networking fundamentals : IP addressing, VLANs, switching, OSI layers 3/4—enough to diagnose and resolve common connectivity issues.
  • Experience with bare-metal server environments , including hardware diagnostics, BMC/IPMI management, and routine maintenance.
  • Working knowledge of containerization : Docker and/or Kubernetes at an operational level.
  • Solid troubleshooting methodology and attention to detail; comfortable following and improving documented runbooks.
  • Willingness to work on-site in Rancho Cordova, CA , including shift rotations covering evenings, nights, and weekends.

Responsibilities

  • Monitor and maintain production systems across GPU servers, storage, and networking using Grafana dashboards, alerting pipelines, and documented runbooks; respond to incidents and upscale appropriately.
  • Troubleshoot and resolve issues on Linux-based bare-metal systems: service failures, hardware faults, network degradation, and storage anomalies.
  • Execute and improve automation using existing Ansible playbooks and Python/bash scripts; identify operational gaps and contribute improvements to reduce manual intervention.
  • Manage configuration and deployments across the server fleet using pull-based configuration management tooling, ensuring consistency and auditability.
  • Coordinate hardware maintenance : node replacements, firmware updates, drive swaps, and hands-on rack-level operations in our datacenter.
  • Support production reliability for customer-facing GPU compute workloads hosted on RunPod Secure Cloud and direct enterprise deployments.
  • Develop and track SLIs and SLOs in partnership with the engineering team to measure and improve service reliability.
  • Participate in on-call and shift rotations , including evenings, nights, and weekends as part of 24/7 operations coverage.

Skills

AnsibleBashBMC/IPMIDockerGrafanaKubernetesLinuxNVIDIA H100/H200/B200NVMeOSIPrometheusPythonSOC 2VLANs

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free