Senior Software Engineer - Application Reliability , Hybrid
Cisco
About the role
The application window is expected to close on: 06/20/2026 Job posting may be removed earlier if the position is filled or if a sufficient number of applications are received.
This position is based in San Jose, CA or North Carolina and operates under a hybrid work model.
As a Senior Software Engineer in Application Reliability, you will own the reliability of our AI-powered applications and features from the user's perspective.
While our infrastructure SRE team ensures the platform is healthy, your focus will be on feature uptime, usage trends, automated issue identification, and self-healing remediation at the application layer. You will build LangGraph-based agents for automated diagnostics, Looker dashboards for observability, and evaluation harnesses for agent quality - all powered by BigQuery, BigTable, and Python. You will partner closely with application developers, data engineers, and infrastructure SREs to ensure our APIs, RAG systems, agents, and user-facing features are reliable, observable, and continuously improving.
Your Impact
• Define, implement, and enforce feature-level SLIs, SLOs, and error budgets for APIs, RAG systems, AI agents, and user-facing applications. • Build and maintain application observability systems using Looker dashboards on BigQuery and BigTable - providing real-time visibility into feature health, error patterns, and usage trends for developers, PMs, and leadership. • Design and build LangGraph-based agents for automated issue identification and remediation: anomaly detection on BQ logs, root cause diagnosis, auto-rollback, feature flag kill switches, and self-healing workflows. • Develop agent evaluation harnesses to benchmark agent performance, test multi-step workflows, handle non-deterministic outputs, and run regression testing as agents evolve. • Write complex SQL (BigQuery) for usage trend analysis, anomaly detection, and operational analytics; design BQ table schemas optimized for observability and debugging. • Analyze application usage trends and adoption metrics to proactively identify reliability risks, capacity needs, and degraded user experiences before they become incidents. • Partner with application development teams to embed reliability practices into the development lifecycle: deployment safety (canary, progressive rollout), structured logging standards, and distributed tracing. • Lead application-level incident response, root cause analysis, and blameless postmortems focused on feature impact rather than infrastructure symptoms. • Build Python-based tooling and automation to reduce mean time to detect (MTTD) and mean time to resolve (MTTR) for application-layer issues. • Stay current with the rapidly evolving AI landscape (new frameworks, tools, and paradigms) and apply emerging techniques to improve platform reliability and developer productivity.
Minimum Qualifications
• 10+ years of experience in software engineering with significant focus on reliability, observability, or production operations; Bachelor's or Master's Degree in Computer Science, Engineering, or a related technical discipline. • Strong Python development skills, with experience building production tooling, automation, and agent-based systems. • Production GCP experience - deploying and managing applications on GKE (Kubernetes), deep SQL expertise with BigQuery (complex queries, window functions, schema design, cost optimization), and hands-on experience with BigTable (or equivalent) for high-throughput operational data. • Proven experience designing and operating application-level SLI/SLO frameworks, burn-rate alerting, and error budget policies. • Strong debugging skills at the application layer - distributed tracing, profiling, structured log analysis, and dependency mapping.
Preferred Qualifications
• Experience building agent evaluation harnesses (benchmarking, regression testing, guardrail validation for AI agents). • Familiarity with A2A protocols, streaming architectures, and event-driven systems. • Experience with deployment safety patterns: feature flags, canary deployments, progressive rollouts, and automated rollback. • Experience with GCP observability services (Cloud Logging, Cloud Trace, Cloud Monitoring). • Exposure to AIOps concepts: ML-driven anomaly detection, automated root cause analysis, intelligent alerting. • Experience driving reliability culture across engineering teams - SLO adoption, postmortem processes, and reliability reviews. • Active engagement with the evolving AI ecosystem; awareness of emerging tools and frameworks. • Hands-on experience with GenAI application development: LangGraph, agent engineering, prompt design, and agentic workflows. • Experience building Looker dashboards and Look ML models for operational observability.
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free