Senior or Lead SRE w/Java
WorkNovas LLC
About the role
Position Description
The Sr./ Lead Site Reliability Engineer designs, enhances, and operates highly reliable, scalable, and observable production systems in an Azure-based environment. This role blends software engineering with systems administration to build resilient infrastructure, automate operations, and improve system performance. The engineer applies strong engineering principles to operational challenges with a focus on reliability, automation, observability, and continuous improvement.
Core responsibilities include engineering led incident response, implementing permanent corrective actions, reducing operational toil, and proactively preventing failures. The role contributes to code fixes, owns Dynatrace based observability, and delivers custom reliability and operational reporting to improve system health and availability. Participation in a scheduled-on call rotation is required.
Minimum Requirement
- 4-year Computer Science, Information Systems, Engineering degree or relevant experience. (Degree, university, and year must be on the resume)
- 8+ Years of Site reliability experience.
Advanced SRE Leadership Responsibilities
- Provide technical leadership for SRE practices across multiple services or platforms.
- Define and evolve reliability standards, operational best practices, and incident response frameworks.
- Influence system architecture and design decisions to ensure scalability, resilience, and operability.
- Serve as a subject matter expert for reliability, availability, and production risk management.
- Act as the lead escalation point for complex and business critical production incidents.
- Lead high severity incident response, coordinating across engineering, platform, and security teams.
- Drive blameless post incident reviews and ensure corrective actions are prioritized and completed.
- Improve call processes, escalation models, and incident response effectiveness.
- Own the strategy and implementation of Dynatrace based observability, including dashboards and alerting standards.
- Establish and monitor reliability signals (availability, latency, error rates) across critical systems.
- Identify reliability risks and lead mitigation initiatives before customer impact occurs.
- Define and maintain leadership level reliability and operational reporting.
- Use production data to drive prioritization of reliability investments and operational improvements.
- Communicate reliability posture, risks, and recommendations to senior engineering leadership.
- Mentor and guide senior and mid level SREs and production support engineers.
- Support hiring, onboarding, and technical evaluation of SRE talent.
- Collaborate with squad members to define iteration plans and commitments.
- Ensure compliance with HIPAA and other security regulations.
Critical Skills
- Strong experience with monitoring and observability tools (Dynatrace experience is a plus).
- Hands-on experience with GitHub Actions for CI/CD automation.
- Proficiency in Kubernetes and Docker for container orchestration.
- Familiarity with Azure cloud services.
- Experience with Ansible.
- Demonstrated experience in automation of infrastructure and operational processes using scripting or configuration management tools.
- Java application changes (Fixing production bugs/ Adding resiliency, error handling, or safeguards)
- SQL / database changes (Schema updates or migrations/ Indexing or query optimization/ Rolling changes out safely in production)
- Knowledge of SRE principles (SLIs, SLOs, error budgets).
- Automate repetitive operational work using Ansible, Python, Bash, or similar tools.
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free