Site Reliability Engineer III
CSC Holdings LLC
About the role
Job Summary
As a Site Reliability Engineer III, you will be a primary driver in the long-term management and stabilization of our Hybrid Cloud infrastructure. We maintain a permanent dual-hosting strategy, operating both Google Cloud Platform (GCP) and mission-critical On-Premises Unix/Linux footprint. You will bridge the gap between physical hardware and modern cloud-native operations, applying software engineering principles to ensure our systems are scalable, secure, and predictable across all platforms.
The Mission: Hybrid Reliability & Stabilization
Your mission is to unify our GCP and On-Premises environments into a single, reliable platform. Your first 12 months will focus on Stabilization and Observability. You will lead the transition away from "toil" (manual, repetitive operations) toward high-leverage automation, aggressively addressing on-prem technical debt while implementing modern SRE practices across our global data centers and cloud projects.
Responsibilities
- Hybrid Platform Standardization: Audit, harden, and standardize Unix (Solaris/AIX) and Linux (RHEL/Ubuntu) environments across both GCP Compute Engine and physical bare-metal servers.
- Infrastructure Stewardship (DC Support): Serve as the engineering lead for our Eastern U.S. data centers; ensure hardware health, power redundancy, and physical security standards are enforced through code and automated checks.
- Storage Engineering (Specialization): Architect and manage enterprise-grade SAN/NAS environments alongside GCP Cloud Storage/Persistent Disk. Optimize for low latency and high IOPS while ensuring all data-at-rest complies with our Annual Encryption Strategy.
- Automation of Toil: Design and maintain robust automation pipelines (Ansible, Terraform, Python) to ensure configuration parity and eliminate drift between cloud and on-premises environments.
- Vulnerability Management: Transition the fleet from a "vulnerable" state to a "reliable" one by establishing a sustainable, automated monthly patching cadence.
- Unified Observability: Implement and scale a "single pane of glass" monitoring stack (Prometheus, Grafana, Loki) to provide real-time health metrics for the entire hybrid estate.
- Incident Response & Post-Mortems: Participate in a sustainable on-call rotation. Lead Blameless Post-Mortems for incidents involving cross-platform dependencies to ensure we "fix the system, not the person."
Qualifications
Technical Requirements (SRE3)
- OS Internals: Deep proficiency in Linux (RHEL/Ubuntu) and Unix (Solaris/AIX) administration and kernel tuning
- Cloud Proficiency: Hands-on experience with GCP (IAM, VPC, Compute Engine) or equivalent public cloud providers
- Infrastructure as Code: Proven ability to manage complex environments using Terraform and Ansible
- Storage Protocols: Proficiency in Fiber Channel, iSCSI, and NFS. Experience with enterprise arrays (NetApp, Dell/EMC, or Pure Storage) is highly preferred
- Software Engineering: Strong scripting ability in Python or Go to build internal tools and automation.
- Security: Strong understanding of CVE lifecycles and cryptographic standards (AES-256)
The Ideal Candidate
- Bachelor’s degree in Telecommunications, Computer Engineering, or related discipline
- 6+ years of experience in IP networking and infrastructure support, with at least 4 years in reliability-focused roles
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free