JV
AI/ML Cloud Engineer
Jobs via Dice
Bloomfield · Hybrid Contract Mid Level Today
About the role
AI/ML Cloud Engineer
Key Responsibilities :
Cloud Infrastructure Management
- Design, deploy, and manage cloud infrastructure supporting AI/ML workloads on AWS and Azure.
- Manage compute resources such as EC2, Azure Virtual Machines, GPU instances, EKS, VPC, ECS, S3, Lambda, Route 53 and Kubernetes clusters.
- Provision and configure storage, networking, and security services for AI platforms.
- Ensure high availability, scalability, and reliability of AI environments.
AI Platform Support
- Deploy and maintain AI/ML services such as:
- Amazon SageMaker and Azure Microsoft Foundry
- Azure Machine Learning
- AI model training and inference environments
- Support data scientists and ML engineers by providing optimized infrastructure for model training and deployment.
Automation & Infrastructure as Code
- Implement Infrastructure as Code (IaC) using tools such as:
- Terraform
- CloudFormation
- ARM templates/Bicep
- Docker Files
- Automate and set up environment provisioning, patching, and scaling.
Containerization & Orchestration
- Deploy and manage containerized AI workloads using:
- Docker
- Kubernetes
- Amazon EKS
- Azure Kubernetes Service (AKS)
- ECS
Monitoring & Performance Optimization
- Monitor system health, performance, and resource utilization using tools like:
- CloudWatch
- Azure Monitor
- Datadog / Prometheus
- Optimize infrastructure for cost, performance, and GPU utilization.
Security & Compliance
- Implement cloud security best practices including:
- IAM / RBAC management
- Network security groups
- Encryption and secrets management
- Ensure compliance with organizational and regulatory standards.
CI/CD & DevOps Integration
- Integrate AI infrastructure with CI/CD pipelines.
- Support automated deployment of models and AI services.
Required Qualifications
- Bachelor's degree in Computer Science, Information Systems, or related field.
- 5+ years experience in infrastructure administration or cloud engineering.
- Strong hands-on experience with AI/ML infrastructure or data platforms.
- Proficiency with Linux administration and scripting (Python, Bash, PowerShell, Terraform, terra grunt).
- Experience with Docker and Kubernetes.
- Experience with GitHub Actions.
- Experience with LLM infrastructure set up.
- Experience with working in centralized team with triaging capabilities.
- AWS cloud services.
- Microsoft Azure cloud services.
Skills
ARM templatesAWSAWS CloudFormationAWS EKSAWS LambdaAWS SageMakerBashBicepCloudWatchDatadogDockerDocker FilesEC2ECSGitHub ActionsGPUIAMKubernetesLinuxMicrosoft AzureMicrosoft Azure AKSMicrosoft Azure Machine LearningMicrosoft Azure Microsoft FoundryPowerShellPrometheusPythonRBACTerraformTerra gruntVPC
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free