Skip to content
mimi

Senior Director, Engineering-AI Toolkit Team

Remotica

South Africa · On-site Full-time Senior Yesterday

About the role

About Splunk

Splunk, a Cisco company, is building a safer and more resilient digital world with an end-to-end full stack platform made for a hybrid, multi-cloud world. Leading enterprises use our unified security and observability platform to keep their digital systems secure and reliable. Our customers love our technology, but it's our caring employees that make Splunk stand out as an amazing career destination. No matter where in the world or what level of the organization, we approach our work with kindness. So bring your work experience, problem-solving skills and talent, of course, but also bring your joy, your passion and all the things that make you, you. Come help organizations be their best, while you reach new heights with a team that has your back.

Role:

We are looking for an experienced and visionary Senior Director of Engineering to lead our AI Toolkit Engineering team. This role is instrumental in shaping and scaling the toolchains and infrastructure that power AI capabilities for both our customers and internal teams. You will drive the strategy, development, and operations of the AI toolkits and platform, and ensure the reliability and performance of our AI service infrastructure in the Cloud and on premise.

As a senior leader, you will work closely with AI product, engineering and platform teams to deliver developer-first tooling, robust infrastructure, and high-availability services. This is a critical role in our AI organization, reporting directly to the VP of AI.

On a day-to-day basis, you'll spend your time guiding engineering leads through key architectural decisions, collaborating with cross-functional partners to align on roadmap priorities, and diving into service reliability or infrastructure scalability discussions. You'll review high-impact design documents, help unblock critical technical challenges, and ensure smooth operation of our production AI services. You'll also coach and develop engineering managers, foster a culture of technical excellence, and ensure the team delivers with both speed and quality.

Responsibilities:

The responsibilities of this role include:

  • Lead the end-to-end development of AI toolkits that accelerate model development, deployment, and monitoring for internal and external users
  • Oversee the operations and reliability of AI services in production, including model serving, inference infrastructure, and pipeline orchestration
  • Drive the engineering strategy, architectural decisions, and execution roadmap for AI tooling and infrastructure
  • Collaborate with cross-functional stakeholders including AI/ML product, engineering, data engineering, security, and DevOps teams
  • Hire, grow, and mentor a high-performing team of engineering managers and technical leads
  • Set high standards for engineering excellence, observability, and operational efficiency
  • Own service-level objectives (SLOs), performance, cost-efficiency, and uptime of AI infrastructure in the Cloud
  • Stay ahead of trends in AI tooling, MLOps, and infrastructure to inform strategic investments and improvements

Requirements:

The ideal candidates should meet the following requirements:

  • 12+ years of software engineering experience with 5+ years in engineering leadership roles, preferably in AI/ML or infrastructure domains
  • Deep experience leading platform or infrastructure teams building developer tools, SDKs, and/or distributed systems
  • Proven track record of operating and scaling cloud-based services (e.g., AWS, GCP, or Azure)
  • Strong familiarity with modern AI/ML lifecycle tooling and infrastructure, such as:
    • Model development: PyTorch, TensorFlow, Hugging Face, LangChain
    • Experiment tracking: MLflow, Weights & Biases
    • Model serving & inference: Triton Inference Server, TorchServe, Ray Serve, KServe
    • Pipeline orchestration: Airflow, Argo Workflows, Kubeflow, Metaflow
    • Containerization & orchestration: Docker, Kubernetes, Helm
    • Observability & monitoring: Prometheus, Grafana, OpenTelemetry
  • Strong technical foundation in distributed systems, CI/CD pipelines, and service reliability engineering
  • Excellent people leadership, communication, and cross-functional collaboration skills
  • Experience operating in fast-paced environments, with the ability to make high-impact decisions with ambiguity

These additional skills and experiences are preferred but not required:

  • Experience working in AI platform, MLOps, or developer experience-focused teams
  • Hands-on experience with large language model (LLM) infrastructure and serving at scale
  • Background in security, compliance, or data privacy for AI infrastructure
  • Contributions to open source AI infrastructure or tools

Skills

AirflowArgo WorkflowsAWSAzureCI/CDDockerGCPGrafanaHelmHugging FaceKServeKubeflowKubernetesLangChainMetaflowMLflowOpenTelemetryPrometheusPyTorchRay ServeTensorFlowTorchServeTriton Inference ServerWeights & Biases

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free