Skip to content
mimi

Senior DevOps Engineer

Radwell International

On-site Full-time Senior $145k – $175k/yr 3d ago

About the role

Job Summary

We’re building the next generation of platform engineering and AIOps capabilities to power Radwell’s digital ecosystem. As a Senior DevOps Engineer, you’ll be a hands‑on technical leader who lays down the standards for CI/CD, infrastructure‑as‑code, observability, and ML Ops—while partnering closely with IT Operations, Security, Data Engineering, and Product teams.

Essential Duties and Responsibilities

May be modified from time to time. Other duties may be assigned.

DevOps, Platform Engineering & AIOps

Build and lead (as a senior IC and technical mentor) a Platform Engineering & DevOps function responsible for:

  • Standardized CI/CD pipelines (e.g., GitHub Actions, Azure DevOps) for apps, APIs, and ML workloads.
  • Infrastructure‑as‑Code for cloud resources (AWS/Azure), Kubernetes/ECS, databases, and data/AI infrastructure (repeatable, versioned, and policy‑as‑code enforced).
  • Secure, compliant, and repeatable environments for development, testing, staging, and production (secrets management, identity & access, network policies, artifact signing).
  • Design and implement an AIOps strategy that uses AI/ML to operate Radwell’s digital ecosystem:
    • Intelligent monitoring for web, ERP, CRM, AI services, and integrations.
    • Anomaly detection, proactive incident prevention, and noise‑reduced alerting.
    • Automated root‑cause analysis and self‑healing workflows for critical paths.
  • Partner with IT Operations and Security to build a unified observability stack (logs, metrics, traces, events) that feeds AIOps and SRE practices.

Data & ML Ops

  • Establish ML CI/CD patterns: automated training, validation, security gates, model/package versioning, and canary/blue‑green rollouts for batch and online serving.
  • Stand up model registry, feature store, and drift/quality guardrails (data contracts, statistical monitoring, hallucination/grounding metrics as applicable).
  • Engineer reproducible pipelines for data prep, training, evaluation, deployment, and rollback; integrate with experiment tracking and cost/usage telemetry.
  • Collaborate with Data Engineering to productionize feature delivery and ensure lineage, governance, and privacy compliance are baked into pipelines.

Site Reliability & Governance

  • Define SLIs/SLOs across critical user journeys; drive error budgets and reliability backlogs.
  • Reduce alert fatigue via correlation, deduplication, and ML‑based signal enrichment; sharpen MTTD/MTTR and change failure rate.
  • Champion FinOps (cost visibility and efficiency) across compute, storage, and data/AI workloads.
  • Document standards, publish internal runbooks/playbooks, and enable teams through training and code examples.

Qualifications

  • Experience mentoring distributed teams and partnering across time zones.
  • Prior collaboration with Security/IT Ops on incident response, tabletop exercises, and compliance audits.
  • Proven AIOps implementations: anomaly detection, correlation/RCA, forecasting, and automated remediation—using platform features and/or bespoke ML.

Knowledge & Skills Required

  • Deep experience with AWS and/or Azure, Kubernetes (AKS/EKS) or ECS, container registries, and service meshes.
  • Expert in Terraform or CloudFormation, GitHub Actions/Azure DevOps, and environment promotion strategies.
  • Hands‑on with observability stacks and OpenTelemetry (e.g., Prometheus/Grafana, ELK/Opensearch, Datadog, Splunk, Azure Monitor, CloudWatch/X‑Ray).
  • Solid ML Ops toolkit familiarity (e.g., MLflow, SageMaker, Azure ML, Databricks), feature stores, model registries, and testing/rollback strategies.
  • Strong grasp of security & compliance in pipelines and infra: IAM, KMS, secrets, SAST/DAST/SCA, policy‑as‑code.
  • Background with event streaming (Kafka/MSK/EventBridge), API gateways, and zero trust networking (mTLS, boundary controls).
  • Familiarity with LLMOps (prompt/version control, grounding evals, token/cost telemetry) and RAG production patterns.
  • Experience hardening eCommerce and revenue‑critical flows (search, pricing, invoicing) at scale.

Education & Experience

  • Bachelor’s degree in Information Technology, Computer Science, Business, or related field preferred.
  • High school diploma or equivalent required.
  • 8+ years in DevOps/SRE/Platform Engineering, including 3+ years leading standards or mentoring engineers.

Physical Demands

  • Continuous sitting and typing for extended periods.
  • Lifting requirements include occasional lifting of up to 25 pounds.
  • Frequent walking or standing may be required at times.

Employee Evaluation Summary

  • Introductory Review – will be written at approximately 80 days after employment and will be used to determine whether employment will continue.
  • Annual Reviews – based on attendance, job knowledge, overall performance and timely project completion.

Work Schedule

This is an exempt position, which requires a work schedule that will achieve the results and objectives identified by the company. Generally, the schedule for this position will be 8:00 am‑5:00 pm, Monday through Friday, with one hour for lunch. Nights and weekends may be worked at the Development Team Manager’s discretion based on current project and implementation needs, deadlines, and workload. Employee is expected to come to work on time and adhere to accepted time‑off policies.

Work Environment

Dress attire is casual but professional in an office setting. All employees are required to always wear “Radwear” (shirt with company logo) once the initial supply (at company expense) has been received. Radwell ID Badge and Access card must be always worn. Radwell Safety Policies must be adhered to at all times.

Employer’s Rights

This job description does not list all the duties of the job. You may be asked by supervisors or managers to perform other duties. You will be evaluated in part based upon your performance of the tasks listed in this job description. The employer has the right to revise this job description at any time. The job description is not a contract for employment and either you or the employer may terminate employment at any time, for any reason.

Benefits

Benefits: Radwell offers a comprehensive benefits package including health, dental, and vision coverage. The Company provides company‑sponsored short‑term and long‑term disability benefits, as well as $50,000 in Life insurance. These benefits, along with additional voluntary benefits, are available to all regular full‑time employees beginning on first day of employment. All employees are automatically enrolled at 3 % into the Company’s 401(k) Plan on the first of the month following 90 days of continuous employment. Employees are eligible for common paid Company Holidays and 15 days of PTO annually, which begin accruing on first date of employment and may be used immediately upon joining the team.

Salary Information

The recruiting base salary range for this FULL‑TIME position is $145,000 – $175,000 / year. Within the range, individual pay is determined by factors, including job‑related skills, experience, and relevant education or training. Additionally, this role is bonus‑eligible, with a target bonus percentage that provides an opportunity to earn even more based on company performance.

Requirements

  • Experience mentoring distributed teams and partnering across time zones.
  • Prior collaboration with Security/IT Ops on incident response, tabletop exercises, and compliance audits.
  • Proven AIOps implementations: anomaly detection, correlation/RCA, forecasting, and automated remediation—using platform features and/or bespoke ML.

Responsibilities

  • Build and lead (as a senior IC and technical mentor) a Platform Engineering & DevOps function
  • Standardized CI/CD pipelines (e.g., GitHub Actions, Azure DevOps) for apps, APIs, and ML workloads.
  • Infrastructure‑as‑Code for cloud resources (AWS/Azure), Kubernetes/ECS, databases, and data/AI infrastructure (repeatable, versioned, and policy‑as‑code enforced).
  • Secure, compliant, and repeatable environments for development, testing, staging, and production (secrets management, identity & access, network policies, artifact signing).
  • Design and implement an AIOps strategy that uses AI/ML to operate Radwell’s digital ecosystem:
  • Intelligent monitoring for web, ERP, CRM, AI services, and integrations.
  • Anomaly detection, proactive incident prevention, and noise‑reduced alerting.
  • Automated root‑cause analysis and self‑healing workflows for critical paths.
  • Partner with IT Operations and Security to build a unified observability stack (logs, metrics, traces, events) that feeds AIOps and SRE practices.
  • Establish ML CI/CD patterns: automated training, validation, security gates, model/package versioning, and canary/blue‑green rollouts for batch and online serving.
  • Stand up model registry, feature store, and drift/quality guardrails (data contracts, statistical monitoring, hallucination/grounding metrics as applicable).
  • Engineer reproducible pipelines for data prep, training, evaluation, deployment, and rollback; integrate with experiment tracking and cost/usage telemetry.
  • Collaborate with Data Engineering to productionize feature delivery and ensure lineage, governance, and privacy compliance are baked into pipelines.
  • Define SLIs/SLOs across critical user journeys; drive error budgets and reliability backlogs.
  • Reduce alert fatigue via correlation, deduplication, and ML‑based signal enrichment; sharpen MTTD/MTTR and change failure rate.
  • Champion FinOps (cost visibility and efficiency) across compute, storage, and data/AI workloads.
  • Document standards, publish internal runbooks/playbooks, and enable teams through training and code examples.

Benefits

health insurancedental insurancevision insuranceshort-term disability insurancelong-term disability insurancelife insurance401(k)paid company holidayspaid time off

Skills

AKSAWSAzureAzure DevOpsAzure MLCloudFormationCloudWatchDatabricksDatadogECSELKEventBridgeGitHub ActionsGrafanaKafkaKubernetesLLMOpsMLflowMSKOpenTelemetryOpensearchPrometheusRAGSageMakerSASTSplunkTerraformX-Ray

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free