M(
Platform Monitoring Engineer
MCO (MyComplianceOffice)
Hyderabad · On-site Full-time Lead Yesterday
About the role
About the role
Member of MCO’s Platform Monitoring team, who manage the observability and performance of all technical infrastructure, environments and enterprise systems. Technical and analytical approach to ensure that our SaaS systems are performing optimally during normal operation and as we introduce new features, traffic, and load.
Responsibilities
- Member of MCO’s Platform Monitoring team providing technical leadership and subject-matter expertise in monitoring and observability
- Responsible for proactive monitoring of MCO’s SaaS platform to ensure high availability, performance, and reliability of application and infrastructure services.
- Design, configure, tune, and manage monitoring and observability solutions, with a strong focus on Datadog, including metrics, logs, traces, dashboards, monitors, and alerts.
- Lead the implementation and continuous improvement of monitoring standards, processes, and procedures, aligned with SRE and ITIL best practices.
- Establish and maintain actionable alerting strategies, reducing noise while ensuring rapid detection of service degradation and failures.
- Provide real-time and scheduled reporting on system health, capacity, availability, and performance trends to engineering and leadership teams.
- Actively participate in Severity 1 and Severity 2 incident management, including detection, triage, root cause analysis, and post-incident reviews.
- Collaborate with platform, application, SRE, and infrastructure teams to identify monitoring gaps and improve end-to-end service visibility.
- Drive automation and observability enhancements, including synthetic monitoring, APM, log correlation, and infrastructure monitoring.
- Contribute as an individual technical leader to establish the Platform Monitoring team as a high-performing, metrics-driven organization.
- Mentor other team members and promote monitoring best practices across the organization.
Experience and Skills Required
- 5-8 years’ experience in IT operations and cloud infrastructure with 24/7 high availability
- Strong experience monitoring complex, distributed SaaS platforms, including applications, infrastructure, and network components.
- Hands-on expertise with Datadog (preferred), including dashboards, monitors, logs, APM, synthetic tests, and alerting strategies.
- Experience with other observability tools such as New Relic, Prometheus, Grafana, or similar platforms.
- Solid understanding of incident management, on-call operations, and escalation processes.
- Strong communication and stakeholder engagement skills to proactively identify risks and highlight potential incidents before customer impact.
- Working knowledge of cloud platforms such as Oracle Cloud Infrastructure (OCI), AWS, or Rackspace.
- Strong Linux experience (RedHat / Oracle Linux) and familiarity with enterprise application stacks including Apache and JBoss EAP.
- Experience analyzing performance bottlenecks, capacity issues, and availability risks using monitoring data.
- Familiarity with automation, scripting, or Infrastructure-as-Code concepts is a plus.
Requirements
- Strong experience monitoring complex, distributed SaaS platforms, including applications, infrastructure, and network components
- Hands-on expertise with Datadog (preferred), including dashboards, monitors, logs, APM, synthetic tests, and alerting strategies
- Experience with other observability tools such as New Relic, Prometheus, Grafana, or similar platforms
- Solid understanding of incident management, on-call operations, and escalation processes
- Strong communication and stakeholder engagement skills to proactively identify risks and highlight potential incidents before customer impact
- Working knowledge of cloud platforms such as Oracle Cloud Infrastructure (OCI), AWS, or Rackspace
- Strong Linux experience (RedHat / Oracle Linux) and familiarity with enterprise application stacks including Apache and JBoss EAP
- Experience analyzing performance bottlenecks, capacity issues, and availability risks using monitoring data
Responsibilities
- Member of MCO’s Platform Monitoring team providing technical leadership and subject-matter expertise in monitoring and observability
- Responsible for proactive monitoring of MCO’s SaaS platform to ensure high availability, performance, and reliability of application and infrastructure services
- Design, configure, tune, and manage monitoring and observability solutions, with a strong focus on Datadog, including metrics, logs, traces, dashboards, monitors, and alerts
- Lead the implementation and continuous improvement of monitoring standards, processes, and procedures, aligned with SRE and ITIL best practices
- Establish and maintain actionable alerting strategies, reducing noise while ensuring rapid detection of service degradation and failures
- Provide real-time and scheduled reporting on system health, capacity, availability, and performance trends to engineering and leadership teams
- Actively participate in Severity 1 and Severity 2 incident management, including detection, triage, root cause analysis, and post-incident reviews
- Collaborate with platform, application, SRE, and infrastructure teams to identify monitoring gaps and improve end-to-end service visibility
- Drive automation and observability enhancements, including synthetic monitoring, APM, log correlation, and infrastructure monitoring
- Contribute as an individual technical leader to establish the Platform Monitoring team as a high-performing, metrics-driven organization
- Mentor other team members and promote monitoring best practices across the organization
Skills
ApacheAPMAWSDatadogGrafanaInfrastructure-as-CodeJBoss EAPLinuxNew RelicOCIPrometheusRackspaceRedHatSaaSSRE
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free