Skip to content
mimi

Database Reliability Engineer - Core Team

Clickhouse

Remote (Global) Lead Yesterday

About the role

About ClickHouse

Recognized on the 2025 Forbes Cloud 100 list, ClickHouse is one of the most innovative and fast-growing private cloud companies. With more than 3,000 customers and ARR that has grown over 250 percent year over year, ClickHouse leads the market in real-time analytics, data warehousing, observability, and AI workloads.
The company’s sustained, accelerating momentum was recently validated by a $400M Series D financing round. Over the past three months, customers including Capital One, Lovable, Decagon, Polymarket, and Airwallex have adopted the platform or expanded existing deployments. These customers join an established base of AI innovators and global brands such as Meta, Cursor, Sony, and Tesla.

We’re on a mission to transform how companies use data. Come be a part of our journey!

Location

Note: This position can be based remotely in the United Kingdom, Germany, or the Netherlands.

Role Overview

We are committed to providing our customers with reliable and secure services at ClickHouse. To continue this, we are building out our Site Reliability Engineering team in ClickHouse Core. As one of the first members of our Reliability Engineering Team at Core, you will be responsible for building and leading processes to ensure and improve the reliability, availability, scalability, and performance of ClickHouse. You will collaborate with different teams like Control Plane, Dataplane, Security, Support and Operations and guide them to implement ClickHouse in the best way for our customers. You will also own the areas of managing engineering escalation management and response, investigations, post-mortem analysis including running blameless postmortems, and continuous improvement of how Clickhouse is run and optimized in the cloud. This role is a unique opportunity to make a significant impact on our elastic, limitless scale, high-performance ClickHouse in ClickHouse Cloud.

Responsibilities

  • Continuously improve the reliability and performance of ClickHouse core.
  • Improve and create metrics and alerts for ClickHouse to be able to identify and prevent problems in production before they affect customers.
  • Dig deeper into the most common problems encountered by customers in Clickhouse Core to identify the root cause of problems and submit bug fixes, issue reports and suggest improvements.
  • Enhance and refine incident response processes and post-mortem analysis for ClickHouse core related outages including working with support and Cloud teams to communicate to the impacted customers.
  • Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities.
  • Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize customer impact.

About You

  • Bachelor’s or Master’s degree in Computer Science or a related field.
  • At least 5 years of experience in Reliability Engineering, QA or customer facing engineering.
  • Previous experience operating ClickHouse or other SQL databases in production.
  • Excellent understanding of distributed database internals and SQL, particularly ClickHouse is a major plus.
  • Scripting experience with Shell or Python, and ability to read and understand C++ code.
  • Knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform.
  • You are a strong problem-solver and have solid production debugging skills.
  • You thrive in a fast-paced environment as part of a global team, and you see yourself as a partner with the business with the shared goal of moving the business forward.
  • You have a high level of responsibility, ownership, and accountability.
  • Excellent communication skills

Compensation

For roles based in the United States, the typical starting salary range for this position is listed above. In certain locations, such as the San Francisco Bay Area and the New York City Metro Area, a premium market range may apply, as listed.
These salary ranges reflect what we reasonably and in good faith believe to be the minimum and maximum pay for this role at the time of posting. The actual compensation may be higher or lower than the amounts listed, and the ranges may be subject to future adjustments.

An individual’s placement within the range will depend on various factors, including (but not limited to) education, qualifications, certifications, experience, skills, location, performance, and the needs of the business or organization.

If you have any questions or comments about compensation as a candidate, please get in touch with us at paytransparency@clickhouse.com.

Requirements

  • At least 5 years of experience in Reliability Engineering, QA or customer facing engineering.
  • Previous experience operating ClickHouse or other SQL databases in production.
  • Scripting experience with Shell or Python,and ability to read and understand C++ code.
  • Knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform.
  • You are a strong problem-solver and have solid production debugging skills.
  • You thrive in a fast-paced environment as part of a global team, and you see yourself as a partner with the business with the shared goal of moving the business forward.
  • You have a high level of responsibility, ownership, and accountability.
  • Excellent communication skills

Responsibilities

  • Continuously improve the reliability and performance of ClickHouse core.
  • Improve and create metrics and alerts for ClickHouse to be able to identify and prevent problems in production before they affect customers.
  • Dig deeper into the most common problems encountered by customers in Clickhouse Core to identify the root cause of problems and submit bug fixes, issue reports and suggest improvements.
  • Enhance and refine incident response processes and post-mortem analysis for ClickHouse core related outages including working with support and Cloud teams to communicate to the impacted customers.
  • Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities.
  • Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize customer impact.

Skills

AWSAzureC++ClickHouseGoogle Cloud PlatformPythonShell

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free