Skip to content
mimi

System Reliability Engineer

arculus

München · flexible Full-time Mid Level Today

About the role

About us

At arculus, we design, build, and maintain cutting‑edge autonomous mobile robots and the software ecosystem around them. Our Development department brings together software, infrastructure, and product experts in a collaborative, international environment, focused on delivering reliable and high‑quality products that make a real difference in intralogistics.

Your Role

As a System Reliability Engineer, you will be responsible for ensuring the stability, performance, and scalability of our Automation Software platform. Your mission begins with a strong focus on the “Now”: building robust monitoring, automation, and operational practices that keep our systems reliable under real‑world conditions.

Operating at the intersection of software development and operations, you will proactively prevent incidents, optimize system behavior, and enable fast, reliable service delivery. By aligning reliability engineering with product and architectural goals, you will ensure our systems meet critical KPIs such as uptime, latency, and deployment velocity across the entire lifecycle.

Your Tasks & Responsibilities

  • Design and operate monitoring, alerting, and incident response systems to ensure high availability
  • Define and manage SLIs, SLOs, and SLAs; proactively mitigate reliability, performance, and capacity risks
  • Automate deployments, scaling, and operational workflows; implement infrastructure as code and self‑healing patterns
  • Optimize CI/CD pipelines for faster, safer, and more reliable releases
  • Lead or support incident response, root cause analysis, and post‑mortems; translate findings into preventive measures
  • Collaborate with architects, developers, and product teams to ensure scalable, reliable system design
  • Review system changes for operational, performance, and reliability impact
  • Support capacity planning, performance benchmarking, and scaling strategies
  • Contribute to security monitoring and ensure secure system operations
  • Drive continuous improvement in observability, reliability, and operational efficiency

Your Experience

  • 3+ years in Site Reliability Engineering, DevOps, or similar roles in production environments
  • Proven experience improving system reliability, reducing downtime, and enhancing deployment processes
  • Strong expertise in cloud platforms (AWS, GCP, Azure) and Kubernetes
  • Hands‑on experience with observability tools (Prometheus, Grafana, ELK stack)
  • Solid scripting and automation skills (e.g., Python, Bash)
  • Experience operating and scaling distributed systems in large production environments
  • Familiarity with CI/CD pipelines, infrastructure as code, and modern DevOps practices

Who You Are

  • Passionate about building reliable, scalable, and observable systems
  • Strong communicator, able to collaborate effectively across engineering, product, and operations teams
  • Proactive and solution‑oriented, with a strong sense of ownership and accountability
  • Analytical and structured thinker with a focus on continuous improvement
  • Comfortable working in fast‑paced, complex environments with evolving system landscapes
  • Motivated to ensure technical excellence translates into stable and high‑performing real‑world systems

WHY ARCULUS

  • We are a diverse, global team of 100+ creative thinkers, algorithmic brains, makers, movers, and shakers.
  • Our approach comes from a continuous cycle: assemble, weld, code, test, deploy or delete, and repeat. That is how we deliver innovative solutions to tackle the biggest intralogistics challenges.
  • Our tech space is located in the eastern region of Munich, featuring state‑of‑the‑art meeting rooms, a fully‑equipped electronics lab, and a spacious robotics testing area, plus various social spaces on the modern Neue Balan campus.
  • We are more than just a workplace: we are a community. Activities include hiking trips, running events, ping‑pong tournaments, and quiz nights.
  • Competitive salaries and benefits such as EGYM Wellpass, language courses, Jobrad, and flexible working hours.
  • Relocation and visa support are provided for candidates moving to join our team.

ABOUT THE COMPANY

arculus is a part of Jungheinrich and independently develops high‑end mobile robots and software products for intralogistics automation. From mechanics to electronics and code – our engineering powerhouse has it all. We combine the speed and creativity of an agile tech company with the strength of a leading global intralogistics player. Collaboration, innovation, and continuous learning: that is how we achieve an open‑minded and fast‑paced working culture.

COMMITTED TO DIVERSITY AND INCLUSION

We are an equal opportunity employer and highly value diversity and inclusivity, which we see as strengths. While we are making progress, we are not yet where we want to be. Still, we believe in the power of a diverse workforce and welcome applicants of all genders, ethnicities, ages, national origins, sexual orientations, cultures, and educational backgrounds. Our goal is to create a work culture where everyone feels equally heard and included.

Requirements

  • Proven experience improving system reliability, reducing downtime, and enhancing deployment processes
  • Strong expertise in cloud platforms (AWS, GCP, Azure) and Kubernetes
  • Hands-on experience with observability tools (Prometheus, Grafana, ELK stack)
  • Solid scripting and automation skills (e.g., Python, Bash)
  • Experience operating and scaling distributed systems in large production environments
  • Familiarity with CI/CD pipelines, infrastructure as code, and modern DevOps practices

Responsibilities

  • Design and operate monitoring, alerting, and incident response systems to ensure high availability
  • Define and manage SLIs, SLOs, and SLAs; proactively mitigate reliability, performance, and capacity risks
  • Automate deployments, scaling, and operational workflows; implement infrastructure as code and self-healing patterns
  • Optimize CI/CD pipelines for faster, safer, and more reliable releases
  • Lead or support incident response, root cause analysis, and post-mortems; translate findings into preventive measures
  • Collaborate with architects, developers, and product teams to ensure scalable, reliable system design
  • Review system changes for operational, performance, and reliability impact
  • Support capacity planning, performance benchmarking, and scaling strategies
  • Contribute to security monitoring and ensure secure system operations
  • Drive continuous improvement in observability, reliability, and operational efficiency

Benefits

EGYM Wellpasslanguage coursesJobradflexible working hoursrelocation supportvisa support

Skills

AWSAzureBashCI/CDELK stackGCPGrafanaKubernetesPrometheusPython

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free