Associate Director Application Support Engineering / SRE Lead (Site Reliability Engineer Lead)
The Depository Trust & Clearing Corporation (DTCC)
About the role
About DTCC
DTCC is at the forefront of innovation in the financial markets, committed to helping employees grow and succeed. The company fosters a thriving internal community and strives to create a workplace that reflects the diversity of the world it serves.
The Information Technology group delivers secure, reliable technology solutions that enable DTCC to be the trusted infrastructure of the global capital markets. The team develops essential infrastructure capabilities, implements data standards and governance, and provides high‑quality information to meet client needs.
Pay and Benefits
- Competitive compensation, including base pay and annual incentive
- Comprehensive health and life insurance and well‑being benefits (based on location)
- Pension / retirement benefits
- Paid Time Off, personal/family care, and other leaves of absence to support physical, financial, and emotional well‑being
- Flexible/hybrid work model: 3 days onsite (Tuesdays, Wednesdays, and a third day unique to each team or employee) and 2 days remote
Impact of the Role
The Enterprise Application Support (EAS) team provides technical application support for ITP and ECS lines of business. As the Associate Director, Application Support Engineering / SRE Lead, you will:
- Drive reliability, scalability, and performance of critical systems
- Implement standard methodologies, participate in incident response, and automate processes
- Collaborate with development, infrastructure, network, security, Scrum Masters, and internal/external clients to improve observability, resiliency, and mean time to restore service
- Promote a strong Site Reliability Engineering (SRE) culture across the organization
Primary Responsibilities
- Scrum Participation: Join planning, design sessions, sprint zero, and stand‑ups for new deliveries; champion non‑functional requirements (NFRs) focused on observability and resiliency.
- System Reliability Architecture: Design and implement reliable, resilient, and scalable systems; recommend redundancy, fault tolerance, and disaster recovery strategies; create recovery runbooks.
- Monitoring and Alerting: Develop comprehensive monitoring systems, define actionable alerts, and establish Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
- Incident Management: Lead incident response during critical outages, conduct post‑mortem analyses, and implement preventive measures.
- Automation and Tooling: Build and maintain automation scripts for self‑healing, deployments, scaling, and infrastructure management.
- Collaboration with Development Teams: Integrate SRE practices into the software development lifecycle, promoting code quality, reliability, and observability.
- Security Integration: Work with security teams to ensure system resilience against cyber threats and supervise vulnerability management.
- Technical Expertise: Stay current on emerging technologies and industry trends related to cloud computing, distributed systems, and reliability engineering.
- Operational Readiness: Present operational readiness at project management meetings, raise operational risks, and test NFRs in UAT environments.
- Risk Management: Partner with IT Embedded Risk Managers to identify strategic solutions for risk incidents.
- Metrics and Reporting: Demonstrate operational improvements through defined Key Performance Indicators (KPIs).
- Capacity Planning: Assess system capacity needs, plan for growth, and implement scaling strategies.
- Performance Optimization: Analyze performance metrics, identify bottlenecks, and implement optimization strategies.
Qualifications
- Minimum of 8 years of related experience
- Bachelor’s degree preferred or equivalent experience
Talents Needed for Success
- Strong Programming Skills: Proficiency in one or more languages (e.g., Python, Java, Go) for automation and monitoring tool development.
- System Administration: Expertise in Linux/Unix, network administration, and cloud platforms (AWS, Azure, GCP); mainframe experience is a plus.
- Monitoring and Observability: Deep understanding of tools such as Splunk, Dynatrace, ITSI, and experience designing robust monitoring systems.
- Incident Management: Proven ability to participate in incident response teams under pressure and solve complex issues.
Equal Opportunity Statement
DTCC is an equal opportunity employer that values diversity. We do not discriminate based on race, religion, color, national origin, sex, gender identity or expression, sexual orientation, age, marital status, veteran status, or disability status. Reasonable accommodations are available for individuals with disabilities throughout the application and interview process.
Requirements
- Minimum of 8 years of related experience
- Proficiency in one or more programming languages like Python, Java, Go, etc., for automation and development of monitoring tools.
- Expertise in Linux/Unix operating systems, network administration, and cloud platforms (AWS, Azure, GCP).
- Deep understanding of monitoring tools (Splunk, Dynatrace, ITSI, etc.) and experience in designing robust monitoring systems.
- Proven track record to participate in incident response teams under pressure, effectively solving complex issues.
Responsibilities
- Join all project collaborators planning and design sessions, sprint zero and stand-ups for all new delivery, to champion NFRs reflective of a strong observability and resiliency traits.
- Drive Design and help implement reliable, resilient, and scalable systems, considering redundancy, fault tolerance, and disaster recovery strategies.
- Make design recommendations that will allow the application to recover without cleanup activities or create a recovery runbook for application support team to follow for improved application recovery times.
- Develop comprehensive monitoring systems to identify potential issues proactively, define actionable alerts, and establish SLIs (Service Level Indicators) and SLOs (Service Level Objectives).
- Lead incident response during critical system outages, facilitating timely problem diagnosis and resolution, conducting post-mortem analysis to identify root causes and prevent future occurrences.
- Develop and maintain automation scripts to streamline operational tasks, including self-healing, application deployments, scaling, and infrastructure management.
- Work closely with development teams to integrate SRE practices into the software development lifecycle, promoting code quality, reliability, and observability.
- Collaborate with security teams to ensure system resilience against cyber threats, implementing security best practices and supervising for vulnerabilities.
- Stay updated on emerging technologies and industry trends related to cloud computing, distributed systems, and reliability engineering.
- Attend and present operational readiness with application support (EAS L2) at each project management meeting - raise any operational risks and concerns.
- Test NFRs in UAT environments to validate effectiveness and completeness of operational capabilities.
- Partner with IT Embedded Risk Managers to identify strategic solutions for risk incidents.
- Demonstrate operational improvements through defined KPIs.
- Proactively assess system capacity needs, plan for future growth, and implement scaling strategies to ensure optimal performance under high load.
- Analyze system performance metrics to identify bottlenecks and implement optimization strategies to improve system responsiveness and efficiency.
Benefits
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free