Site Reliability Principal Specialist, IT Operations
Sherweb
About the role
Location
Remote (from Canada) – Full remote job, offer available from Canada
About Sherweb
We work to simplify the cloud for IT professionals so they can focus on what really matters, making their customers’ lives better. Find out how we do that here: www.sherweb.com/about/.
Sherweb is, above all, an environment where the needs of our customers are at the heart of our actions. We are committed to living our values of passion, teamwork and integrity every day.
Overview
The Site Reliability Principal Specialist on the IT Operations team implements a proactive, resilient, and scalable approach to site reliability across all Sherweb platforms.
This is a senior technical individual contributor position responsible for shaping how reliability is designed, governed, and sustained across systems. The role elevates reliability from reactive operations to an engineered discipline—intentional, measurable, and scalable—ensuring platforms operate predictably as Sherweb grows in scale, complexity, and customer impact.
Operating at a broad organizational scope, this role acts as a principal‑level technical leader across IT Operations. It sets reliability direction and drives consistency through technical authority, influence, and partnership. The role serves as a technical counterpart to senior engineering, infrastructure, and platform leaders to shape operational strategy across multiple teams.
Responsibilities
- Define and evolve reliability standards across platforms and services, including service level objectives (SLOs) and service level indicators (SLIs), to improve mission‑critical services.
- Establish a shared reliability language and expectations across IT Operations Teams.
- Drive consistency in monitoring and operational practices across services, systems, and platforms.
- Influence system and operational design to improve reliability, availability, and resilience.
- Drive the reduction of operational toil through automation, AI, platform capabilities, and repeatable operational patterns.
- Improve end‑to‑end observability and system understanding, enabling teams to reason clearly about system behavior and failure modes; improve logging, metrics, tracing, and telemetry across systems.
- Enable teams to take end‑to‑end ownership of platform reliability, including deeper investigation across infrastructure and application layers.
- Partner closely with infrastructure and platform teams to ensure access, tooling, and visibility support full operational ownership and to drive reliability improvements.
- Act as a reliability advocate and technical advisor during operational reviews, incident learning, and platform evolution.
- Partner closely with DevOps teams to implement reliability and observability as code, ensuring integration with CI/CD pipelines and platform tooling.
Requirements
Education
- Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related field, or equivalent practical experience.
Experience
- 10+ years of experience in Site Reliability Engineering, operating and improving large‑scale production environments.
- Demonstrated experience improving the reliability, availability, and scalability of production systems, platforms, and services.
- Hands‑on experience operating distributed systems in business‑critical and customer‑facing environments.
- Proven experience reducing manual operational work through automation and standardization.
- Experience defining and applying reliability standards (e.g., SLOs, error budgets) across multiple services or platforms.
- Demonstrated ability to influence technical direction across multiple teams without direct authority.
Core Skills
- Strong understanding of distributed systems, failure modes, and operational resilience.
- Solid experience with observability practices (metrics, logs, traces) and system diagnostics.
- Ability to analyze complex systems end‑to‑end across infrastructure, platform, and application layers.
Technical Leadership
- Strong systems thinking with a track record of addressing reliability issues through design rather than reactive intervention.
- Experience acting as a trusted technical advisor to senior engineers and leaders.
- Ability to clearly communicate complex reliability concepts to both technical and non‑technical stakeholders.
Certifications (Assets)
- Cloud platform: Microsoft Azure Solutions Architect Expert or DevOps Engineer Expert.
- Certifications related to reliability, operations, or systems engineering (e.g., Kubernetes, Linux, or observability platforms).
- Equivalent demonstrated expertise through hands‑on experience is acceptable in lieu of formal certifications.
Benefits
Culture & Environment
- A fast‑paced work environment that adapts to you.
- A friendly and diverse work culture with inclusion and equality at the heart of our actions.
- State‑of‑the‑art technology and tools.
- A results‑oriented culture where talent, action, and thinking outside the box are given due recognition.
Compensation & Perks
- Base salary ranging between $91,000 and $130,000 yearly.
- Annual salary review based on performance.
- Vacation time that considers your previous experience.
- Advanced paid hours to recharge your batteries (holidays and mobile days).
- Flexible benefits plan that adapts to your needs.
- Flexible savings fund option.
- Monthly home internet allowance.
Growth & Development
- A career path with opportunities to learn and grow.
- Proximity to your direct manager and open, honest communication to support your development.
- Multiple initial and on‑the‑job training opportunities and tools to track your progress and help you scale up in your career.
Community
- “Sherweblife” – a rich calendar of activities that allow us to gather virtually and face‑to‑face throughout the year.
Additional Information
- English Requirement: Sherweb has international customers and fluency in English is required to ensure proper service delivery. The main tasks involve written and oral communication with an English‑speaking clientele at all times.
- Pay Transparency: The salary range provided is an indication of expectations for this role. Final compensation will be tailored to reflect the specific qualifications and expertise of the chosen candidate, ensuring competitiveness and equity.
#LI-Remote
#LI-VB1
*This offer from “Sherweb” has been enriched by Jobgether.com and got a 82% flex
Requirements
- 10+ years of experience in Site Reliability Engineering, operating and improving largescale, production environments.
- Demonstrated experience improving the reliability, availability, and scalability of production systems, platforms and services.
- Handson experience operating distributed systems in business critical and customer facing environments.
- Proven experience reducing manual operational work through automation and standardization.
- Experience defining and applying reliability standards (e.g., SLOs, error budgets) across multiple services or platforms.
- Demonstrated ability to influence technical direction across multiple teams without direct authority.
- Strong understanding of distributed systems, failure modes, and operational resilience.
- Solid experience with observability practices (metrics, logs, traces) and system diagnostics.
- Ability to analyze complex systems end to end across infrastructure, platform, and application layers.
- Strong systems thinking with a track record of addressing reliability issues through design rather than reactive intervention.
- Experience acting as a trusted technical advisor to senior engineers and leaders.
- Ability to clearly communicate complex reliability concepts to both technical and nontechnical stakeholders.
Responsibilities
- Define and evolve reliability standards across platforms and services, including service level objectives (SLOs), service level indicators (SLIs), to improve mission-critical services.
- Establish a shared reliability language and expectations across IT Operations Teams.
- Drive consistency in monitoring and operational practices across services, systems and platforms.
- Influence system and operational design to improve reliability, availability and resilience.
- Drive the reduction of operational toil through automation, AI, platform capabilities, and repeatable operational patterns.
- Improve end to end observability and system understanding, enabling teams to reason clearly about system behavior and failure modes.
- Improves logging, metrics, tracing, and telemetry across systems.
- Enable teams to take end to end ownership of platform reliability, including deeper investigation across infrastructure and application layers.
- Partner closely with infrastructure and platform teams to ensure access, tooling, and visibility support full operational ownership and to drive reliability improvements.
- Act as a reliability advocate and technical advisor during operational reviews, incident learning, and platform evolution.
- Partner closely with DevOps teams to implement reliability and observability as code, ensuring integration with CI/CD pipelines and platform tooling.
Benefits
Skills
Similar roles
Technical Lead / AI Engineer / Founding Engineer
AVYON TECHNOLOGIES PRIVATE LIMITED
Technical Recruiter II | (Tech Hiring) || 5-7 Yrs || Hyderabad (Hybrid) || Diverse Hiring || Contract 12 M ||
Accedepro Private Limited
Solutions Architect, Retail, Restaurants, Consumer Packaging
Amazon Web Services (AWS)
$131k – $195k/yr
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free