A
SRE-Application/Platform
Ampstek
Bloomfield · On-site Contract 3w ago
About the role
About
Role: SRE-Application/Platform
Location: Bloomfield , CT(Onsite)
Duration : Long Term Contract
Responsibilities
- Ensure 24x7 system reliability, incident response, and operational readiness for global applications.
- Lead troubleshooting efforts during outages/performance incidents; perform root cause analysis (RCA) and implement preventive actions.
- Define and maintain operational metrics and reliability goals (availability, latency, throughput, resource utilization).
- Improve system stability via proactive monitoring, alerting, and capacity planning
Big Data & Streaming Support
- Support deployments and operations across: AWS Cloud, Kubernetes, containerized environments
- Implement and maintain cluster reliability in Kubernetes environments: Resource quotas, access control, permissions, namespace management
Qualifications
- Experience in monitoring, troubleshooting, performance tuning, capacity planning, and automation, along with strong exposure to distributed data processing frameworks like Spark, Flink, and Kafka.
- Hadoop Cluster Administration & Operations
Skills
AWS CloudFlinkHadoopKafkaKubernetesSpark
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free