B
Site Reliability Engineer
Beyond-ED
Remote (Global) Senior Today
About the role
About
We’re hiring a Senior Site Reliability Engineer to build and scale the reliability backbone of a leading GPU-powered platform.
Job Requirements
- Degree in Computer Science or a related discipline or equivalent practical experience / solid proof of expertise.
- 4+ years of software development experience in one or more languages (Go ideal; Rust/Python)
- 4+ years designing, analyzing, and troubleshooting distributed systems and production services.
- Proficiency in debugging, profiling, and performance tuning of large-scale Linux systems.
- Experience with Kubernetes (or similar schedulers), containerized services, and IaC (Terraform/Pulumi/CloudFormation).
- Experience with observability (metrics, logs, traces), progressive delivery (canary/blue green), and incident management.
- Track record of OSS contributions.
- Linux internals, networking, and kernel/perf tooling.
- Exposure to hypervisors (KVM/) or virtual machine introspection concepts.
- Knowledge of GPU architectures and CUDA programming.
- Cybersecurity experience (runtime security, hardening, secrets management).
- Building distributed systems on Kubernetes and high-throughput data pipelines (e.g., Kafka/Redpanda/Fluent Bit).
- Experience with multi-cloud operations, cost/perf optimization, and compliance-minded engineering.
Responsibilities
- Build and maintain systems that keep the platform stable, fast, and always available
- Automate repetitive operational tasks to reduce manual work and human errors
- Monitor system performance and set clear reliability targets (uptime, response time, etc.)
- Detect issues early and respond quickly to incidents to minimize downtime
- Work closely with engineering teams to improve system design, scalability, and efficiency
- Optimize infrastructure performance and cost across cloud environments
- Improve deployment processes to make releases safer and smoother
- Contribute to building internal tools that help teams operate systems more efficiently
- Continuously enhance system reliability, performance, and security
Preferred
- Developers and volunteers contributing to open-source libraries related to Linux environments
Candidate Background
- Only Computing Background
Location
- Fully Remote
Job Level
- Senior
Talent Country
- Egypt
Technologies
- Python, GoLang, Linux, Terraform, Rust, kernel, Cloud Architecture, DevOps, Backend, Kubernetes, Security, SRE
Requirements
- 4+ years of software development experience in one or more languages (Go ideal; Rust/Python)
- 4+ years designing, analyzing, and troubleshooting distributed systems and production services.
- Proficiency in debugging, profiling, and performance tuning of large-scale Linux systems.
- Experience with Kubernetes (or similar schedulers), containerized services, and IaC (Terraform/Pulumi/CloudFormation).
- Experience with observability (metrics, logs, traces), progressive delivery (canary/blue green), and incident management.
- Linux internals, networking, and kernel/perf tooling.
- Exposure to hypervisors (KVM/) or virtual machine introspection concepts.
- Knowledge of GPU architectures and CUDA programming.
- Cybersecurity experience (runtime security, hardening, secrets management).
- Building distributed systems on Kubernetes and high-throughput data pipelines (e.g., Kafka/Redpanda/Fluent Bit).
- Experience with multi-cloud operations, cost/perf optimization, and compliance-minded engineering.
Responsibilities
- Build and maintain systems that keep the platform stable, fast, and always available
- Automate repetitive operational tasks to reduce manual work and human errors
- Monitor system performance and set clear reliability targets (uptime, response time, etc.)
- Detect issues early and respond quickly to incidents to minimize downtime
- Work closely with engineering teams to improve system design, scalability, and efficiency
- Optimize infrastructure performance and cost across cloud environments
- Improve deployment processes to make releases safer and smoother
- Contribute to building internal tools that help teams operate systems more efficiently
- Continuously enhance system reliability, performance, and security
Skills
Cloud ArchitectureCUDADevOpsFluent BitGoGoLangGPUIaCKafkakernelKubernetesLinuxNetworkingPulumiPythonRedpandaRustSecuritySRETerraform
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free