Skip to content
mimi

SRE Engineer

Diverse Lynx

Carrollton · On-site Full-time Senior 1w ago

About the role

Job Title

Senior Site Reliability Engineer

Location

West Lake, CA/Carrolton, TX (Onsite)

Job Type

Full Time

Must Have Technical/Functional Skills

  • 5-7 years of professional experience in a Site Reliability, DevOps, or Systems Engineering role.
  • 3-5 years of hands‑on experience managing production workloads in a cloud environment, preferably AWS.
  • Proven experience acting in an L2/L3 support capacity, with strong diagnostic and troubleshooting skills.
  • Technical Skills (Must Have)
    • Cloud Expertise: Deep understanding and hands‑on experience with the AWS ecosystem (EC2, S3, RDS, Lambda, VPC, IAM, CloudWatch).
    • Infrastructure as Code (IaC): Strong proficiency with tools like AWS CDK (preferred), Terraform, or CloudFormation.
    • Scripting & Automation: Proficiency in at least one scripting language such as Python, Bash, or NodeJS for automation and tooling.
    • Monitoring & Observability: Hands‑on experience with modern monitoring, logging, and tracing tools (e.g., NewRelic is preferred, Datadog, Prometheus, Grafana, ELK Stack).
    • Containerization: Experience with Docker and container orchestration systems (e.g., Kubernetes, ECS).
  • General Skills
    • Excellent analytical, troubleshooting, and complex problem‑solving skills with a methodical approach.
    • A calm and focused demeanor during high‑pressure incidents.
    • Strong verbal and written communication skills, with the ability to explain complex technical concepts to diverse audiences.
    • Highly attentive to detail, organized, and capable of prioritizing effectively in a dynamic environment.
    • A collaborative mindset and the ability to work effectively both independently and as part of a team.

Preferred Skills & Qualifications

  • Domain knowledge in FinTech or the Mortgage industry.
  • Experience with the AWS Serverless stack (Lambda, API Gateway, SQS, SNS, DynamoDB).
  • Familiarity with application development environments (e.g., NodeJS, TypeScript, Python) to facilitate effective troubleshooting and collaboration with development teams.
  • Experience with relational databases (Postgres) and NoSQL databases.
  • Experience working within an Agile/SCRUM development process using Jira.

Roles & Responsibilities

  • Incident Response & L2/L3 Support: Serve as a primary escalation point for complex production incidents. Lead troubleshooting efforts, perform deep‑dive root cause analysis (RCA), and work with Product Engineering teams to implement permanent solutions to prevent recurrence.
  • Monitoring & Observability: Develop and manage comprehensive monitoring and alerting solutions using tools like Datadog, CloudWatch, or similar. Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure system health.
  • Collaboration & Architectural Input: Partner closely with backend development teams, conducting Production Readiness Reviews and influencing the design of new services to ensure they meet SLOs and are built for reliability, observability, and operational excellence from the start. Advocate for SRE best practices across the engineering organization.
  • Problem Management: Analyze incident trends and system metrics to identify underlying problems. Develop and execute long‑term solutions, including automating away operational toil, software enhancements, and architectural improvements.
  • Runbook & Documentation: Create and maintain clear, concise documentation and runbooks to enable faster incident resolution and share operational knowledge across teams.

Equal Employment Opportunity

Diverse Lynx LLC is an Equal Employment Opportunity employer. All qualified applicants will receive due consideration for employment without any discrimination. All applicants will be evaluated solely on the basis of their ability, competence and their proven capability to perform the functions outlined in the corresponding role. We promote and support a diverse workforce across all levels in the company.

Reference

#J-18808-Ljbffr

Requirements

  • 5-7 years of professional experience in a Site Reliability, DevOps, or Systems Engineering role.
  • 3-5 years of hands‑on experience managing production workloads in a cloud environment, preferably AWS.
  • Proven experience acting in an L2/L3 support capacity, with strong diagnostic and troubleshooting skills.
  • Deep understanding and hands‑on experience with the AWS ecosystem (EC2, S3, RDS, Lambda, VPC, IAM, CloudWatch).
  • Strong proficiency with tools like AWS CDK (preferred), Terraform, or CloudFormation.
  • Proficiency in at least one scripting language such as Python, Bash, or NodeJS for automation and tooling.
  • Hands‑on experience with modern monitoring, logging, and tracing tools (e.g., NewRelic is preferred, Datadog, Prometheus, Grafana, ELK Stack).
  • Experience with Docker and container orchestration systems (e.g., Kubernetes, ECS).
  • Excellent analytical, troubleshooting, and complex problem‑solving skills with a methodical approach.
  • A calm and focused demeanor during high‑pressure incidents.
  • Strong verbal and written communication skills, with the ability to explain complex technical concepts to diverse audiences.
  • Highly attentive to detail, organized, and capable of prioritizing effectively in a dynamic environment.
  • A collaborative mindset and the ability to work effectively both independently and as part of a team.

Responsibilities

  • Serve as a primary escalation point for complex production incidents.
  • Lead troubleshooting efforts, perform deep-dive root cause analysis (RCA), and work with Product Engineering teams to implement permanent solutions to prevent recurrence.
  • Develop and manage comprehensive monitoring and alerting solutions using tools like Datadog, CloudWatch, or similar.
  • Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure system health.
  • Partner closely with backend development teams, conducting Production Readiness Reviews and influencing the design of new services to ensure they meet SLOs and are built for reliability, observability, and operational excellence from the start.
  • Advocate for SRE best practices across the engineering organization.
  • Analyze incident trends and system metrics to identify underlying problems.
  • Develop and execute long-term solutions, including automating away operational toil, software enhancements, and architectural improvements.
  • Create and maintain clear, concise documentation and runbooks to enable faster incident resolution and share operational knowledge across teams.

Skills

AWS CDKBashCloudWatchDockerDatadogEC2ECSELK StackIAMKubernetesLambdaNew RelicNodeJSPostgresPrometheusPythonRDSS3SLISLOTerraformVPC

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free