Platform Monitoring Engineer

MCO (MyComplianceOffice)

Hyderabad · On-site Full-time Lead Yesterday

About the role

Member of MCO’s Platform Monitoring team, who manage the observability and performance of all technical infrastructure, environments and enterprise systems. Technical and analytical approach to ensure that our SaaS systems are performing optimally during normal operation and as we introduce new features, traffic, and load.

Responsibilities

Member of MCO’s Platform Monitoring team providing technical leadership and subject-matter expertise in monitoring and observability
Responsible for proactive monitoring of MCO’s SaaS platform to ensure high availability, performance, and reliability of application and infrastructure services.
Design, configure, tune, and manage monitoring and observability solutions, with a strong focus on Datadog, including metrics, logs, traces, dashboards, monitors, and alerts.
Lead the implementation and continuous improvement of monitoring standards, processes, and procedures, aligned with SRE and ITIL best practices.
Establish and maintain actionable alerting strategies, reducing noise while ensuring rapid detection of service degradation and failures.
Provide real-time and scheduled reporting on system health, capacity, availability, and performance trends to engineering and leadership teams.
Actively participate in Severity 1 and Severity 2 incident management, including detection, triage, root cause analysis, and post-incident reviews.
Collaborate with platform, application, SRE, and infrastructure teams to identify monitoring gaps and improve end-to-end service visibility.
Drive automation and observability enhancements, including synthetic monitoring, APM, log correlation, and infrastructure monitoring.
Contribute as an individual technical leader to establish the Platform Monitoring team as a high-performing, metrics-driven organization.
Mentor other team members and promote monitoring best practices across the organization.

Experience and Skills Required

5-8 years’ experience in IT operations and cloud infrastructure with 24/7 high availability
Strong experience monitoring complex, distributed SaaS platforms, including applications, infrastructure, and network components.
Hands-on expertise with Datadog (preferred), including dashboards, monitors, logs, APM, synthetic tests, and alerting strategies.
Experience with other observability tools such as New Relic, Prometheus, Grafana, or similar platforms.
Solid understanding of incident management, on-call operations, and escalation processes.
Strong communication and stakeholder engagement skills to proactively identify risks and highlight potential incidents before customer impact.
Working knowledge of cloud platforms such as Oracle Cloud Infrastructure (OCI), AWS, or Rackspace.
Strong Linux experience (RedHat / Oracle Linux) and familiarity with enterprise application stacks including Apache and JBoss EAP.
Experience analyzing performance bottlenecks, capacity issues, and availability risks using monitoring data.
Familiarity with automation, scripting, or Infrastructure-as-Code concepts is a plus.

Requirements

Strong experience monitoring complex, distributed SaaS platforms, including applications, infrastructure, and network components
Hands-on expertise with Datadog (preferred), including dashboards, monitors, logs, APM, synthetic tests, and alerting strategies
Experience with other observability tools such as New Relic, Prometheus, Grafana, or similar platforms
Solid understanding of incident management, on-call operations, and escalation processes
Strong communication and stakeholder engagement skills to proactively identify risks and highlight potential incidents before customer impact
Working knowledge of cloud platforms such as Oracle Cloud Infrastructure (OCI), AWS, or Rackspace
Strong Linux experience (RedHat / Oracle Linux) and familiarity with enterprise application stacks including Apache and JBoss EAP
Experience analyzing performance bottlenecks, capacity issues, and availability risks using monitoring data

Responsibilities

Member of MCO’s Platform Monitoring team providing technical leadership and subject-matter expertise in monitoring and observability
Responsible for proactive monitoring of MCO’s SaaS platform to ensure high availability, performance, and reliability of application and infrastructure services
Design, configure, tune, and manage monitoring and observability solutions, with a strong focus on Datadog, including metrics, logs, traces, dashboards, monitors, and alerts
Lead the implementation and continuous improvement of monitoring standards, processes, and procedures, aligned with SRE and ITIL best practices
Establish and maintain actionable alerting strategies, reducing noise while ensuring rapid detection of service degradation and failures
Provide real-time and scheduled reporting on system health, capacity, availability, and performance trends to engineering and leadership teams
Actively participate in Severity 1 and Severity 2 incident management, including detection, triage, root cause analysis, and post-incident reviews
Collaborate with platform, application, SRE, and infrastructure teams to identify monitoring gaps and improve end-to-end service visibility
Drive automation and observability enhancements, including synthetic monitoring, APM, log correlation, and infrastructure monitoring
Contribute as an individual technical leader to establish the Platform Monitoring team as a high-performing, metrics-driven organization
Mentor other team members and promote monitoring best practices across the organization

Skills

ApacheAPMAWSDatadogGrafanaInfrastructure-as-CodeJBoss EAPLinuxNew RelicOCIPrometheusRackspaceRedHatSaaSSRE

Similar roles

Senior Data Engineer with PySpark and Microservices

Jobs via Dice

Senior Palo Alto Security Engineer

Westcon-Comstor

Principal IS Architect – Storage & Backup

Amgen

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free