Skip to content
mimi

Cloud Engineer Observability SRE

FUSTIS LLC

San Francisco · Hybrid Contract Mid Level $60 – $65/hr 2w ago

About the role

About

The Grade 10 Cloud Engineer within the Customer’s Cloud Collaboration Technology Group will play a key role in building and operating scalable observability and infrastructure platforms supporting Webex microservices. This role requires strong hands-on expertise in Kubernetes, cloud infrastructure, and observability systems, along with the ability to operate independently and to own components end-to-end in production environments. Candidates will demonstrate extensive use of generative AI tools for code generation and production system troubleshooting.

Key Responsibilities

  • Design, develop, and operate observability platforms - to perform logging, metrics, and/or tracing - for Webex microservices.
  • Manage and optimize Kubernetes clusters across multi-region environments.
  • Own CI/CD pipelines using Argo CD and Helm.
  • Implement Infrastructure as code (IaC) using Terraform on AWS.
  • Operate monitoring ecosystems, including but not limited to:
    • OpenSearch/ELK,
    • Prometheus,
    • Grafana,
    • Splunk, and
    • Kafka.
  • Build automation to detect and remediate production issues.
  • Ensure security compliance through vulnerability patching.
  • Collaborate cross-functionally to improve reliability.
  • Participate in on-call rotations and incident response.
  • Contribute to distributed system design and operations.

Required Skills

General Abilities

  • Bachelor’s degree in computer science or related field

General Technical Skills

  • At least eight (³8) years of experience in a DevOps and/or SRE platform engineering role
  • Incident response and on-call operations: Demonstrated experience in a 24/7 production environment, including but not limited to:
    • Triaging alerts
    • Leading incident response
    • Writing post-incident reviews
    • Maintaining SLA commitments across large-scale distributed systems
  • IaC and automation: Proficiency with Terraform, Ansible, and/or equivalent IaC tooling for provisioning and managing cloud infrastructure at scale on AWS
  • Scripting and development: Working proficiency in Python, Golang, and/or Bash for building automation scripts, operational tooling, and/or CI/CD pipeline integrations (e.g., Drone, GitHub Actions, Argo CD)

Specific Technical Skills

  • Kubernetes and container orchestration: Production experience operating and troubleshooting workloads on Kubernetes at large scale (i.e., hundreds of deployments and thousands of pods), including but not limited to:
    • Helm chart management
    • Pod scheduling
    • Resource tuning
    • Multi-cluster operations
  • Observability stack expertise: Hands-on experience - performing pipeline design, query optimization, and/or capacity planning for high-volume environments - in at least two (³2) of the following:
    • OpenSearch/Elasticsearch
    • Prometheus/Mimir
    • Grafana
    • Loki
    • Splunk
    • Logstash

Desired Skills

  • Apache Kafka/AWS MSK: Experience in at least one (³1) of the following:
    • Operating or tuning Kafka clusters at scale
    • Managing the following across high-throughput streaming pipelines:
      • Topic configurations,
      • ACLs,
      • Consumer lag, and/or
      • Schema registries
  • Splunk administration: Experience deploying, managing, and/or migrating Splunk Enterprise environments with Kubernetes-based log shipping architectures, including but not limited to:
    • Forwarder management,
    • Search optimization,
    • Index lifecycle, and/or
    • Integration
  • OpenTelemetry and distributed tracing: Experience with deploying OpenTelemetry for data collection and application performance monitoring
  • Security frameworks and container hardening: Familiarity with at least one (³1) of the following (for vulnerability remediation at scale):
    • Government or industry security certification standards; examples:
      • FedRAMP
      • STIG
      • IL5
      • ISO 27001
      • SOC 2
    • Container image hardening practices
    • Security scanning tools (e.g., Anchore, Grype)
  • AI-augmented operations: Experience using LLMs, AI coding assistants, and/or custom AI agents (e.g., MCP servers, Copilot, Claude) to:
    • Accelerate engineering workflows,
    • Automate runbooks, and/or
    • Assist with incident triage
  • Deployment pipelines (Argo CD/Helm bundles): Experience with at least one (³1) of the following across multi-region clusters:
    • GitOps-style deployment workflows
    • Argo CD application management
    • Helm bundle patterns
    • Blue/green or canary release strategies
  • Cost optimization and capacity planning: Experience in at least one (³1) of the following in large-scale logging and/or metrics platforms:
    • Right-sizing cloud resources
    • Analyzing spending across AWS services
    • Optimizing data retention policies (ISM/ILM)
    • Reducing storage costs

Skills

AWSAnsibleArgo CDBashCI/CDCloud Collaboration TechnologyDockerELKElasticsearchGenerative AIGitHub ActionsGolangGrafanaHelmInfrastructure as CodeKafkaKubernetesLogstashLokiMimirOpenSearchOpenTelemetryPrometheusPythonSplunkTerraformTracingVulnerability Patching

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free