1P
AI/ML Cloud Engineer/ MLOps
1 point system
Bloomfield · On-site Full-time Senior 1mo ago
About the role
About
This project mainly focuses on deploying AI/ML models and tracking their usage. The team is using a tool called LiteLLM to measure how the models are being used. You don’t need deep expertise in model training, but you should have basic understanding of how models are trained.
Key Skills Required:
- Strong experience in MLOps (especially for model deployment)
- Experience with LLM Gateways (this is a plus, as they are starting to use it)
What You Will Do:
- Design, deploy, and manage cloud infrastructure for AI/ML workloads on AWS and Azure
- Work on AI platforms like:
- Amazon SageMaker
- Azure Machine Learning
- Support model training and deployment environments
- Help data scientists and ML engineers by setting up and optimizing infrastructure for:
- Model training
- Model deployment (inference)
Key Responsibilities
Cloud Infrastructure Management
- Design, deploy, and manage cloud infrastructure supporting AI/ML workloads on AWS and Azure.
- Manage compute resources such as EC2, Azure Virtual Machines, GPU instances, and Kubernetes clusters.
- Provision and configure storage, networking, and security services for AI platforms.
- Ensure high availability, scalability, and reliability of AI environments.
AI Platform Support
- Deploy and maintain AI/ML services such as:
- Amazon SageMaker
- Azure Machine Learning
- AI model training and inference environments
- Support data scientists and ML engineers by providing optimized infrastructure for model training and deployment.
Automation & Infrastructure as Code
- Implement Infrastructure as Code (IaC) using tools such as:
- Terraform
- CloudFormation
- ARM templates / Bicep
- Automate environment provisioning, patching, and scaling.
Containerization & Orchestration
- Deploy and manage containerized AI workloads using:
- Docker
- Kubernetes
- Amazon EKS
- Azure Kubernetes Service (AKS)
Monitoring & Performance Optimization
- Monitor system health, performance, and resource utilization using tools like:
- CloudWatch
- Azure Monitor
- Datadog / Prometheus
- Optimize infrastructure for cost, performance, and GPU utilization.
Security & Compliance
- Implement cloud security best practices including:
- IAM / RBAC management
- Network security groups
- Encryption and secrets management
- Ensure compliance with organizational and regulatory standards.
CI/CD & DevOps Integration
- Integrate AI infrastructure with CI/CD pipelines.
- Support automated deployment of models and AI services.
Required Qualifications
- Bachelor’s degree in Computer Science, Information Systems, or related field.
- 5+ years experience in infrastructure administration or cloud engineering.
- Strong hands-on experience with:
- AWS cloud services
- Microsoft Azure cloud services
- Experience supporting AI/ML infrastructure or data platforms.
- Proficiency with Linux administration and scripting (Python, Bash, PowerShell).
- Experience with Docker and Kubernetes.
Preferred Qualifications
- Experience with GPU infrastructure for AI workloads.
- Knowledge of ML pipelines and MLOps practices.
- Experience with data platforms (Snowflake, Databricks, or Spark).
- Familiarity with AI frameworks such as TensorFlow or PyTorch.
- Cloud certifications such as:
- AWS Certified Solutions Architect
- Azure Administrator or Azure AI Engineer
Key Skills
- Cloud Infrastructure (AWS, Azure)
- AI/ML Platform Support
- Kubernetes / Containers
- Infrastructure Automation
- Monitoring & Performance Tuning
- Security & Compliance
- DevOps & CI/CD
Skills
ARM templatesAWSAWS CloudFormationAWS EKSAzureAzure Kubernetes Service (AKS)Azure Machine LearningBashBicepCloudWatchDatadogDockerIAMKubernetesLinuxLiteLLMMicrosoft AzureMLOpsPrometheusPythonTerraform
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free