Senior Observability Platform Engineer

ATOSS Software SE

flexible Full-time Senior 2mo ago

About the role

About Us

ATOSS Software SE is one of Germany’s most successful tech growth stories. As the market leader in Workforce Management Software, we help companies work more intelligently, creatively, and humanely optimizing the balance between profitability and people.

We’re a rare company: according to Handelsblatt (10/24), just 309 public companies worldwide achieved over 20% return on sales for ten consecutive years. Only two are based in Germany and ATOSS is one of them.

With 19 years of record-breaking growth, over €2 billion market cap, and listings in SDAX and TecDAX, we’re scaling globally and we’re growing.

If you’re ready to drive impact in a high-performing B2B SaaS environment, this is your chance to elevate your career.

The Person You are

At ATOSS, we hire for both character and skill, seeking individuals who embody resilience, a pioneering spirit, and the passion to grow.

We Value Those Who

Think like entrepreneurs – taking ownership, pushing boundaries, and driving impact.
Challenge the status quo – bringing fresh ideas and bold execution to the table.
Thrive in change – seeing growth as a lifelong journey, both professionally and personally.

The Role

As a Senior Observability Platform Engineer, you are a key member of our observability team.

You design, build, and operate our observability platform (Grafana, Loki, Tempo, Prometheus/Mimir on Kubernetes) and enable product, platform, and AI teams to monitor and improve the services they own.

This is a hands-on engineering role focused on platform reliability, standards, automation, and cross-team collaboration. You will work closely with other platform functions and product lines and help shape our observability strategy and roadmap.

Key Responsibilities

Product ownership & roadmap

Analyze and capture requirements across product lines and support stakeholders.
Translate requirements into clear priorities, user stories, and acceptance criteria.
Communicate progress and upcoming roadmap items to stakeholders through demos, updates, and KPI reporting.
Collaborate with R&D leads and architecture to align the observability roadmap with broader platform and product goals.

Platform engineering & operations

Operate, scale, and upgrade the central observability stack (Grafana, Loki, Tempo, Prometheus/Mimir on Kubernetes) across multiple environments and cloud providers.
Automate routine operations (provisioning, configuration, housekeeping, capacity checks) to reduce manual work and improve reliability.
Evaluate and adopt modern observability technologies (OpenTelemetry, distributed tracing, anomaly detection, AI-assisted insights) that fit the overall platform architecture.

Standards, instrumentation & enablement

Define and evolve standards for metrics, logs, traces, events, dashboards, alerts, retention, and labeling.
Provide reusable templates, reference dashboards, and alert patterns so teams can efficiently build and maintain their own observability content.
Build and improve self-service capabilities (data sources, folder structures, onboarding flows, RBAC patterns) so teams can use the platform without heavy manual support.
Enable and advise teams on instrumentation and observability design.

Reliability, KPIs & incident support

Ensure the observability platform itself is reliable, performant, and cost-efficient.
Define and track KPIs/SLIs for the observability platform (availability, performance, cost, adoption) and continuously improve them.
Support incident detection and response by ensuring the right signals (metrics, logs, traces) and views are available to teams.
Collaborate after major incidents to identify observability gaps and feed improvements back into standards, templates, and the roadmap.
Support KPI measurement capabilities across teams (incident detection, response efficiency, observability coverage).

Security, compliance & cost management

Ensure privacy, compliance, and security requirements are met for observability data (RBAC, tenant isolation, least-privilege access, data minimization).
Define guardrails for data volume, cardinality, and retention to keep the platform performant and cost-effective.
Work with Security, Compliance, and Data Protection to align telemetry practices with regulatory and contractual requirements.

AI observability

Operate and evolve AI observability components as part of the central observability platform.
Define integration patterns, standards, and example configurations so AI team can instrument their models, prompts, and pipelines in a consistent way.
Ensure AI observability tooling is reliable, secure, and cost-efficient, and integrates well with the rest of the observability stack.
Enable AI/R&D teams through guidance and templates, while they remain responsible for defining their own metrics, dashboards, alerts, and evaluations.

Use of AI to increase efficiency

Apply AI features on top of observability data (where appropriate and technically feasible) to reduce manual work.
Evaluate and introduce AI-assisted diagnostics as optional helpers for incident and operations teams.
Collaborate with Security, Compliance, and Data Protection when using AI on observability data, ensuring governance, access control, and data protection requirements are met.

Key Requirements

Background as an Observability / SRE / Platform / Infrastructure Engineer in cloud-native environments.
Deep understanding of metrics, logs, traces, and alerting for distributed systems.
Familiarity with Kubernetes, microservices, and modern instrumentation (including OpenTelemetry).
Strong experience with Prometheus and the Grafana stack (Grafana, Loki, Tempo, Prometheus/Mimir) in production, at scale; exposure to Langfuse, Clickhouse is a strong plus.
Experience with observability in at least one major cloud hyper-scalers and their native services
Proven experience designing and rolling out observability solutions used by multiple teams, including standards, templates, and best pracices.
Ability to work closely with development/engineering teams, understand their needs, and turn them into platform features, standards, and guidance.
Strong stakeholder communication skills across engineering and operations, with a pragmatic, results-focused mindset.
Experience operating a multi-tenant observability platform in a SaaS context.
Knowledge of regulatory and compliance requirements affecting telemetry and logging.

Our Benefits

Competitive Rewards: Including profit-sharing and employee stock program.
Structured Onboarding & Continuous Leadership Development: Clear career paths onboarding through Expert & Leadership Tracks, plus access to ATOSS Academy.
Flexible Work Culture: Hybrid options (remote within the EU), 30 days of vacation, and a strong commitment to diversity & inclusion.
Engaging Team Environment: Seasonal company events, team retreats, and an in-house barista.
Health & Wellbeing: Including regular check-ups, corporate wellness programs, and Wellhub membership.
Stability & Growth: Company listed on SDAX & TecDAX, with 19+ years of record-breaking revenue and a 30%+ EBIT margin. Certified Top Employer© for the 5th year in a row.

At ATOSS, great talent knows no limits. We welcome professionals from all backgrounds and empower their growth through an inclusive, skill focused environment.

Join us and be part of a high-growth, future-focused company!

Skills

AIAI-assisted diagnosticsClickhouseGrafanaKubernetesLangfuseLokimicroservicesOpenTelemetryPrometheusSaaSTempodistributed tracing

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free