Site Reliability Engineer (SRE)

Blankfactor

Hackensack · On-site Full-time Mid Level $100k – $125k/yr 1mo ago

About the role

About

As a Site Reliability Engineer, you will ensure the reliability, availability, and performance of mission-critical platforms by building scalable systems, robust automation, and data-driven operations. You will partner closely with development, cloud, infrastructure, and security teams to deliver resilient, high-performing services that support the way people live and work today.

What You’ll Do

Design and implement solutions that enhance application reliability, performance, scalability, and resilience.
Build and maintain monitoring, alerting, observability, and telemetry to drive proactive detection and rapid incident response.
Lead incident management efforts, perform root cause analysis, and implement action-oriented post-mortem improvements.
Automate operational workflows using scripting, IaC, and configuration management tools.
Analyze capacity, performance, and usage trends to forecast demand and optimize
Collaborate with engineering teams to embed operability, resilience, and security into application and architecture designs.
Support safe, reliable deployments through CI/CD pipelines, release governance, and change control.
Maintain clear runbooks, architecture diagrams, and operational documentation that enable efficient production support.

Experience

Required:

Managing Kubernetes and containerized workloads (EKS, AKS, GKE), including scaling, networking, upgrades, and orchestration.
Experience in public cloud platforms (AWS, Azure, or GCP) across compute, storage, networking, IAM, and cost governance.
Using observability and APM tools such as Dynatrace, Splunk, Prometheus, Grafana.
Implementing security and compliance controls in regulated environments (e.g., PCI DSS, SOC 2), including secrets management and vulnerability remediation.
Infrastructure as Code experience using Terraform, Cloud Formation, Ansible, or similar tools.
Designing and maintaining CI/CD pipelines using Jenkins, Git Lab CI, Git Hub Actions.
Scripting and automation using Bash, Power Shell, or Python.
Equivalent combination of education, experience, and/or military background.
Key point is the experience on projects with high volume transactions and taking care of Zero data loss is a must which primarily in banking and payment projects.

Good to Have

Certifications such as AWS Sys Ops Administrator, AWS Dev Ops Engineer, Google Cloud Dev Ops Engineer, or CKA.
Experience with Premier applications, IBM iSeries, and/or Unisys systems.
Hands-on database operations and performance tuning (Oracle, SQL Server).
Proven experience in major incident command, stakeholder communication.
Experience with ITIL and Service Now (change, problem, and configuration).

Skills

AnsibleAWSAzureBashCloud FormationDynatraceGCPGit Hub ActionsGit Lab CIGrafanaIBM iSeriesJenkinsKubernetesOraclePower ShellPrometheusPythonSQL ServerSplunkTerraform

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Site Reliability Engineer (SRE)

About the role

About

What You’ll Do

Experience

Required:

Good to Have

Skills

Similar roles

Senior Database Engineer

Software Engineer (Rust)

Mid-Level IoT Engineer

Don't send a generic resume