Skip to content
mimi

ML Ops Engineer

Software Technology Inc.

Austin · Hybrid Contract Senior Yesterday

About the role

Job Description

We are looking for an experienced MLOps Engineer to design, build, and manage scalable machine learning infrastructure on AWS. This role will drive the end-to-end operationalization of ML models — from automated training pipelines and experiment tracking to production deployment, monitoring, and continuous retraining. The ideal candidate will bridge the gap between data science and engineering, establishing robust MLOps practices that ensure reliable, repeatable, and efficient delivery of ML solutions at scale using AWS-native services and industry-leading tools like SageMaker, Kubeflow, and MLflow.

Roles & Responsibilities

  • Design, implement, and manage AWS-based MLOps infrastructure to support large-scale machine learning workflows
  • Build and maintain end-to-end ML pipelines using SageMaker Pipelines, Step Functions, and Kubeflow for automated training, validation, and deployment
  • Implement model versioning, experiment tracking, and model registry practices using MLflow and SageMaker Model Registry
  • Develop and maintain CI/CD pipelines for ML models, ensuring seamless integration from development to production
  • Demonstrate hands-on expertise in Python and frameworks like TensorFlow or PyTorch, with deployment on SageMaker endpoints
  • Utilize Docker, Amazon EKS, and AWS-native CI/CD tools to streamline ML deployment and operations
  • Leverage core AWS services such as S3, EC2, Lambda, Glue, and Athena for building and scaling data and ML infrastructure
  • Deploy, manage, and optimize machine learning models in production using SageMaker real-time and batch inference endpoints
  • Implement automated model monitoring, drift detection, and retraining triggers to maintain model health in production
  • Set up A/B testing and canary deployment strategies for safe model rollouts
  • Collaborate with data scientists and engineering teams to standardize MLOps practices and enhance performance across the AWS ecosystem
  • Monitor system and model performance using CloudWatch, CloudTrail, and X-Ray, troubleshoot issues, and ensure high availability and reliability
  • Stay informed about the latest AWS service releases, MLOps best practices, and advancements in ML operations tooling

Required Skills & Qualifications

  • Proficiency in production-grade machine learning system development, deployment, and MLOps practices on AWS
  • Strong experience with Python and ML frameworks such as TensorFlow or PyTorch
  • Familiarity with containerization and orchestration tools like Docker and Kubernetes (including Amazon EKS)
  • Hands-on experience with CI/CD pipelines using AWS-native tools such as CodePipeline, CodeBuild, and CodeDeploy
  • Advanced knowledge of AWS cloud services, particularly SageMaker, Bedrock, Lambda, Step Functions, and S3
  • Expertise in MLOps tools and platforms including Kubeflow, MLflow, and AWS SageMaker Pipelines for end-to-end model lifecycle management
  • Experience with model versioning, experiment tracking, model registry, and automated retraining workflows
  • Familiarity with AWS infrastructure-as-code tools such as CloudFormation or CDK
  • Strong understanding of model monitoring, drift detection, and A/B testing in production environments
  • Strong analytical and troubleshooting skills to maintain high system reliability using CloudWatch, X-Ray, and AWS-native observability tools

Skills

AWS CloudFormationAWS CodeBuildAWS CodePipelineAWS EKSAWS LambdaAWS SageMakerAWS SageMaker PipelinesAWS Step FunctionsAWS S3AWS CloudWatchCI/CDDockerDrift DetectionInfrastructure-as-CodeKubernetesKubeflowMLflowMLOpsModel MonitoringModel RegistryPythonTensorFlowPyTorch

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free