FL
Cloud Engineer Observability SRE
FUSTIS LLC
San Francisco · Hybrid Contract Mid Level $60 – $65/hr 2w ago
About the role
About
The Grade 10 Cloud Engineer within the Customer’s Cloud Collaboration Technology Group will play a key role in building and operating scalable observability and infrastructure platforms supporting Webex microservices. This role requires strong hands-on expertise in Kubernetes, cloud infrastructure, and observability systems, along with the ability to operate independently and to own components end-to-end in production environments. Candidates will demonstrate extensive use of generative AI tools for code generation and production system troubleshooting.
Key Responsibilities
- Design, develop, and operate observability platforms - to perform logging, metrics, and/or tracing - for Webex microservices.
- Manage and optimize Kubernetes clusters across multi-region environments.
- Own CI/CD pipelines using Argo CD and Helm.
- Implement Infrastructure as code (IaC) using Terraform on AWS.
- Operate monitoring ecosystems, including but not limited to:
- OpenSearch/ELK,
- Prometheus,
- Grafana,
- Splunk, and
- Kafka.
- Build automation to detect and remediate production issues.
- Ensure security compliance through vulnerability patching.
- Collaborate cross-functionally to improve reliability.
- Participate in on-call rotations and incident response.
- Contribute to distributed system design and operations.
Required Skills
General Abilities
- Bachelor’s degree in computer science or related field
General Technical Skills
- At least eight (³8) years of experience in a DevOps and/or SRE platform engineering role
- Incident response and on-call operations: Demonstrated experience in a 24/7 production environment, including but not limited to:
- Triaging alerts
- Leading incident response
- Writing post-incident reviews
- Maintaining SLA commitments across large-scale distributed systems
- IaC and automation: Proficiency with Terraform, Ansible, and/or equivalent IaC tooling for provisioning and managing cloud infrastructure at scale on AWS
- Scripting and development: Working proficiency in Python, Golang, and/or Bash for building automation scripts, operational tooling, and/or CI/CD pipeline integrations (e.g., Drone, GitHub Actions, Argo CD)
Specific Technical Skills
- Kubernetes and container orchestration: Production experience operating and troubleshooting workloads on Kubernetes at large scale (i.e., hundreds of deployments and thousands of pods), including but not limited to:
- Helm chart management
- Pod scheduling
- Resource tuning
- Multi-cluster operations
- Observability stack expertise: Hands-on experience - performing pipeline design, query optimization, and/or capacity planning for high-volume environments - in at least two (³2) of the following:
- OpenSearch/Elasticsearch
- Prometheus/Mimir
- Grafana
- Loki
- Splunk
- Logstash
Desired Skills
- Apache Kafka/AWS MSK: Experience in at least one (³1) of the following:
- Operating or tuning Kafka clusters at scale
- Managing the following across high-throughput streaming pipelines:
- Topic configurations,
- ACLs,
- Consumer lag, and/or
- Schema registries
- Splunk administration: Experience deploying, managing, and/or migrating Splunk Enterprise environments with Kubernetes-based log shipping architectures, including but not limited to:
- Forwarder management,
- Search optimization,
- Index lifecycle, and/or
- Integration
- OpenTelemetry and distributed tracing: Experience with deploying OpenTelemetry for data collection and application performance monitoring
- Security frameworks and container hardening: Familiarity with at least one (³1) of the following (for vulnerability remediation at scale):
- Government or industry security certification standards; examples:
- FedRAMP
- STIG
- IL5
- ISO 27001
- SOC 2
- Container image hardening practices
- Security scanning tools (e.g., Anchore, Grype)
- Government or industry security certification standards; examples:
- AI-augmented operations: Experience using LLMs, AI coding assistants, and/or custom AI agents (e.g., MCP servers, Copilot, Claude) to:
- Accelerate engineering workflows,
- Automate runbooks, and/or
- Assist with incident triage
- Deployment pipelines (Argo CD/Helm bundles): Experience with at least one (³1) of the following across multi-region clusters:
- GitOps-style deployment workflows
- Argo CD application management
- Helm bundle patterns
- Blue/green or canary release strategies
- Cost optimization and capacity planning: Experience in at least one (³1) of the following in large-scale logging and/or metrics platforms:
- Right-sizing cloud resources
- Analyzing spending across AWS services
- Optimizing data retention policies (ISM/ILM)
- Reducing storage costs
Skills
AWSAnsibleArgo CDBashCI/CDCloud Collaboration TechnologyDockerELKElasticsearchGenerative AIGitHub ActionsGolangGrafanaHelmInfrastructure as CodeKafkaKubernetesLogstashLokiMimirOpenSearchOpenTelemetryPrometheusPythonSplunkTerraformTracingVulnerability Patching
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free