JV
AI/ML Cloud Engineer/ MLOps
Jobs via Dice
Bloomfield · On-site Full-time Mid Level 1mo ago
About the role
About
This project mainly focuses on deploying AI/ML models and tracking their usage. The team is using a tool called LiteLLM to measure how the models are being used.
You don’t need deep expertise in model training, but you should have basic understanding of how models are trained.
Key Skills Required
- Strong experience in MLOps (especially for model deployment)
- Experience with LLM Gateways (this is a plus, as they are starting to use it)
What You Will Do
- Design, deploy, and manage cloud infrastructure for AI/ML workloads on AWS and Azure
- Work on AI platforms like Amazon SageMaker and Azure Machine Learning
- Support model training and deployment environments
- Help data scientists and ML engineers by setting up and optimizing infrastructure for model training and model deployment (inference)
Key Responsibilities
Cloud Infrastructure Management
- Design, deploy, and manage cloud infrastructure supporting AI/ML workloads on AWS and Azure.
- Manage compute resources such as EC2, Azure Virtual Machines, GPU instances, and Kubernetes clusters.
- Provision and configure storage, networking, and security services for AI platforms.
- Ensure high availability, scalability, and reliability of AI environments.
AI Platform Support
- Deploy and maintain AI/ML services such as Amazon SageMaker and Azure Machine Learning.
- AI model training and inference environments.
- Support data scientists and ML engineers by providing optimized infrastructure for model training and deployment.
Automation & Infrastructure as Code
- Implement Infrastructure as Code (IaC) using tools such as Terraform, CloudFormation, ARM templates / Bicep.
- Automate environment provisioning, patching, and scaling.
Containerization & Orchestration
- Deploy and manage containerized AI workloads using Docker, Kubernetes, Amazon EKS, Azure Kubernetes Service (AKS).
Monitoring & Performance Optimization
- Monitor system health, performance, and resource utilization using tools like CloudWatch, Azure Monitor, Datadog, Prometheus.
- Optimize infrastructure for cost, performance, and GPU utilization.
Security & Compliance
- Implement cloud security best practices including IAM / RBAC management, network security groups, encryption and secrets management.
- Ensure compliance with organizational and regulatory standards.
CI/CD & DevOps Integration
- Integrate AI infrastructure with CI/CD pipelines.
- Support automated deployment of models and AI services.
Required Qualifications
- Bachelor’s degree in Computer Science, Information Systems, or related field.
- 5+ years experience in infrastructure administration or cloud engineering.
- Strong hands-on experience with AWS cloud services.
- Strong hands-on experience with Microsoft Azure cloud services.
- Experience supporting AI/ML infrastructure or data platforms.
- Proficiency with Linux administration and scripting (Python, Bash, PowerShell).
- Experience with Docker and Kubernetes.
Preferred Qualifications
- Experience with GPU infrastructure for AI workloads.
- Knowledge of ML pipelines and MLOps practices.
- Experience with data platforms (Snowflake, Databricks, or Spark).
- Familiarity with AI frameworks such as TensorFlow or PyTorch.
- Cloud certifications such as AWS Certified Solutions Architect, Azure Administrator or Azure AI Engineer.
Key Skills
- Cloud Infrastructure (AWS, Azure)
- AI/ML Platform Support
- Kubernetes / Containers
- Infrastructure Automation
- Monitoring & Performance Tuning
- Security & Compliance
- DevOps & CI/CD
Skills
Amazon EKSAmazon SageMakerAzure Kubernetes ServiceAzure Machine LearningAzure MonitorAWSBashCloudFormationDatadogDockerGPUIAMKubernetesLinuxMicrosoft AzurePowerShellPrometheusPythonTerraform
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free