Site Reliability Engineer (w/m/d)

IONOS SE

Hybrid Full-time Mid Level 2mo ago

About the role

About Us

IONOS is the leading European digitalization partner for small and medium-sized enterprises (SMEs). IONOS has over six million customers and operates with a globally available platform in 18 markets in Europe and North America. With its Web Presence & Productivity offerings, the company acts as a "one-stop shop" for all digitalization needs – from domains and web hosting to classic website builders and do-it-yourself solutions, from e-commerce to online marketing tools. Additionally, IONOS offers cloud solutions for companies looking to move to the cloud as part of their business development.

Your Role

We are looking for a highly qualified and experienced Site Reliability Engineer to support our team in a 24/7 shift. The SRE Department L2 operates all IONOS Cloud IaaS and PaaS services. As a Site Reliability Engineer, you will be responsible for the stability, security, and performance of our complex, distributed systems. You will work closely with development teams to design, implement, and operate scalable and reliable infrastructures, as well as to automate and optimize processes.

Responsibilities

Technical Level 2 support with direct customer contact.
Maintenance of monitoring, logging, and alerting solutions (e.g., Prometheus, Grafana, Loki) for proactive problem detection in shift operations and participation in solving complex issues in distributed systems.
Troubleshooting networks (LAN/WAN/VPN, DNS, DHCP) and storage systems (File/Object/Block); provision and operation of highly available services on Linux and Kubernetes (Helm charts).
Setup and maintenance of Infrastructure-as-Code, automation, and playbooks with Ansible, Terraform, GitLab CI/CD, Argo CD, as well as scripting languages like Bash, Python, and Go.
Collaboration with development teams to improve processes and deployments, and to ensure smooth integration of new services and applications into our cloud and Kubernetes environment.
Ensuring stable and secure platform operation, including end-to-end incident management from initial analysis through resolution to post-processing within problem management.

Qualifications

Willingness to work in a 24x7 shift model (night, weekend, and holiday shifts) and to bring a strong problem-solving and troubleshooting mindset.
Several years of experience as a Site Reliability Engineer or in a related role (Linux System Administrator, Platform Engineer, DevOps/Infrastructure Engineer, Full-Stack Developer).
In-depth knowledge of automation tools (e.g., Ansible, SaltStack), monitoring and observability tools (Prometheus, Grafana, Loki), and logging and alerting solutions (ELK Stack).
Experience with virtualized environments (QEMU/KVM, OpenStack, Proxmox), cloud storage technologies (File, Object, Block), and secure handling of Docker & Kubernetes.
Very good knowledge in at least one programming or scripting language (Go, Python, Bash) for automation and monitoring tasks.
Experience in code management (merge conflicts, feature branches, merge requests, CI/CD) is an advantage.

Nice-to-have:

Experience with RDMA, InfiniBand, and RoCE protocols.
In-depth knowledge of Linux MD RAID (mdadm, sedadm) and LVM.
Expertise in Linux performance tuning and network stack debugging (ethtool, perf, tcpdump, ibstat, ibtop).
Practical experience with S3, Ceph, and software-defined networks.
Experience with established software development practices (code reviews, build processes, packaging, testing).

Language Skills

Fluent in German and English (at least B2 according to CEFR standard).

Location

Berlin

Note

At the end of the application process, candidates must undergo a security check. Your consent will be requested in due time during the process.

Benefits

Hybrid work model
Shift model work hours
Subsidized canteen and various free drinks at some locations
Modern office spaces with very good public transport connections
Various employee discounts for activities and products
Employee events such as summer and winter parties, as well as workshops
Numerous further training and development opportunities
Various health offers, such as sports and health courses

Skills

AnsibleArgo CDBashCephCI/CDDockerELK StackGitLab CI/CDGoGrafanaHelmInfiniBandKVMKubernetesLinuxLokiMD RAIDMonitoringOpenStackPrometheusProxmoxPythonQEMURoCESaltStackS3TerraformVirtualization

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free