Cloud Engineer Observability SRE

FUSTIS LLC

San Francisco · Hybrid Contract Mid Level $60 – $65/hr 2mo ago

About the role

About

The Grade 10 Cloud Engineer within the Customer’s Cloud Collaboration Technology Group will play a key role in building and operating scalable observability and infrastructure platforms supporting Webex microservices. This role requires strong hands-on expertise in Kubernetes, cloud infrastructure, and observability systems, along with the ability to operate independently and to own components end-to-end in production environments. Candidates will demonstrate extensive use of generative AI tools for code generation and production system troubleshooting.

Key Responsibilities

Design, develop, and operate observability platforms - to perform logging, metrics, and/or tracing - for Webex microservices.
Manage and optimize Kubernetes clusters across multi-region environments.
Own CI/CD pipelines using Argo CD and Helm.
Implement Infrastructure as code (IaC) using Terraform on AWS.
Operate monitoring ecosystems, including but not limited to:
- OpenSearch/ELK,
- Prometheus,
- Grafana,
- Splunk, and
- Kafka.
Build automation to detect and remediate production issues.
Ensure security compliance through vulnerability patching.
Collaborate cross-functionally to improve reliability.
Participate in on-call rotations and incident response.
Contribute to distributed system design and operations.

Required Skills

General Abilities

Bachelor’s degree in computer science or related field

General Technical Skills

At least eight (³8) years of experience in a DevOps and/or SRE platform engineering role
Incident response and on-call operations: Demonstrated experience in a 24/7 production environment, including but not limited to:
- Triaging alerts
- Leading incident response
- Writing post-incident reviews
- Maintaining SLA commitments across large-scale distributed systems
IaC and automation: Proficiency with Terraform, Ansible, and/or equivalent IaC tooling for provisioning and managing cloud infrastructure at scale on AWS
Scripting and development: Working proficiency in Python, Golang, and/or Bash for building automation scripts, operational tooling, and/or CI/CD pipeline integrations (e.g., Drone, GitHub Actions, Argo CD)

Specific Technical Skills

Kubernetes and container orchestration: Production experience operating and troubleshooting workloads on Kubernetes at large scale (i.e., hundreds of deployments and thousands of pods), including but not limited to:
- Helm chart management
- Pod scheduling
- Resource tuning
- Multi-cluster operations
Observability stack expertise: Hands-on experience - performing pipeline design, query optimization, and/or capacity planning for high-volume environments - in at least two (³2) of the following:
- OpenSearch/Elasticsearch
- Prometheus/Mimir
- Grafana
- Loki
- Splunk
- Logstash

Desired Skills

Apache Kafka/AWS MSK: Experience in at least one (³1) of the following:
- Operating or tuning Kafka clusters at scale
- Managing the following across high-throughput streaming pipelines:
  - Topic configurations,
  - ACLs,
  - Consumer lag, and/or
  - Schema registries
Splunk administration: Experience deploying, managing, and/or migrating Splunk Enterprise environments with Kubernetes-based log shipping architectures, including but not limited to:
- Forwarder management,
- Search optimization,
- Index lifecycle, and/or
- Integration
OpenTelemetry and distributed tracing: Experience with deploying OpenTelemetry for data collection and application performance monitoring
Security frameworks and container hardening: Familiarity with at least one (³1) of the following (for vulnerability remediation at scale):
- Government or industry security certification standards; examples:
  - FedRAMP
  - STIG
  - IL5
  - ISO 27001
  - SOC 2
- Container image hardening practices
- Security scanning tools (e.g., Anchore, Grype)
AI-augmented operations: Experience using LLMs, AI coding assistants, and/or custom AI agents (e.g., MCP servers, Copilot, Claude) to:
- Accelerate engineering workflows,
- Automate runbooks, and/or
- Assist with incident triage
Deployment pipelines (Argo CD/Helm bundles): Experience with at least one (³1) of the following across multi-region clusters:
- GitOps-style deployment workflows
- Argo CD application management
- Helm bundle patterns
- Blue/green or canary release strategies
Cost optimization and capacity planning: Experience in at least one (³1) of the following in large-scale logging and/or metrics platforms:
- Right-sizing cloud resources
- Analyzing spending across AWS services
- Optimizing data retention policies (ISM/ILM)
- Reducing storage costs

Skills

AWSAnsibleArgo CDBashCI/CDCloud Collaboration TechnologyDockerELKElasticsearchGenerative AIGitHub ActionsGolangGrafanaHelmInfrastructure as CodeKafkaKubernetesLogstashLokiMimirOpenSearchOpenTelemetryPrometheusPythonSplunkTerraformTracingVulnerability Patching

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Cloud Engineer Observability SRE

About the role

About

Key Responsibilities

Required Skills

General Abilities

General Technical Skills

Specific Technical Skills

Desired Skills

Skills

Similar roles

backend developer

Fullstack Software Architect / Lead Engineer

Java Backend Engineer (all gender)

Don't send a generic resume