MD
Lead Site Reliability Engineer (SRE)
Macpower Digital Assets Edge
Rockville · On-site Full-time Lead 3w ago
About the role
Key Focus Areas
- Manage and optimize control towers, organizational policies, and multi-account environments.
- Oversee AWS backups, SSM patching, AMI deployments, and configuration pushes across multiple accounts.
- Manage and maintain core AWS services including EC2, ECS, EKS, RDS, S3, SageMaker, CloudFront, and Lambda.
- Implement S3, SFTP, and site externalization methods.
- Develop Infrastructure as Code (IaC) using Terraform, CloudFormation, and Python.
- Manage IAM policies, access controls, and permissions.
Core Responsibilities
- Manage and maintain cloud infrastructure to ensure high availability, reliability, and performance.
- Serve as the primary escalation point for all cloud infrastructure issues.
- Monitor cloud resource performance and cost efficiency.
- Lead major incident management and communicate timely updates to stakeholders.
- Perform due diligence and impact analysis before implementing changes to cloud platforms.
- Lead and mentor a team of cloud engineers to ensure performance and collaboration.
- Manage daily operations and ensure alignment with organizational objectives.
- Develop and implement incident management processes and conduct root cause analysis.
- Identify and automate repetitive infrastructure tasks using IaC principles.
- Continuously improve operational processes and standard operating procedures.
- Implement and enforce security controls, ensuring compliance with standards such as GDPR and HIPAA.
- Monitor cloud usage and conduct capacity planning to balance efficiency and scalability.
- Develop and test disaster recovery and business continuity plans.
- Collaborate with IT, business units, and vendors to deliver scalable cloud solutions.
- Document cloud configurations, processes, and reports, ensuring accessibility and version control.
Technical Skills
- Proficiency in AWS (EC2, ECS, EKS, RDS, S3, Lambda, SageMaker, CloudFront).
- Experience with Azure and OCI cloud environments.
- Infrastructure as Code (Terraform, CloudFormation, Ansible, Puppet, Chef).
- Scripting in Python and PowerShell.
- Strong understanding of cloud architecture, monitoring, and automation tools.
- System administration experience (Windows, Linux, VMware, Active Directory, Azure AD SSO).
- Strong networking knowledge (DNS, DHCP, PKI, LAN/WAN).
Leadership and Behavioral Skills
- Demonstrated experience in leading teams and managing cloud operations.
- Strong communication and stakeholder management across technical and business functions.
- Proactive problem-solver with excellent analytical and root cause analysis skills.
- Self-motivated with a continuous improvement mindset.
- Experienced in vendor management and contract negotiations.
Basic Qualifications
- Bachelor's degree in Computer Science, Information Technology, Electrical Engineering, or equivalent.
- Experience in cloud operations and team leadership in technical environments.
Preferred Certifications and Experience
- AWS Certified Solutions Architect - Associate or Professional.
- Microsoft Certified: Azure Architect.
- Familiarity with DevOps tools (CI/CD, Jenkins, Git).
- Experience with ITIL or ITSM frameworks.
Skills
AnsibleAWSAzureAzure AD SSOCDChefCloudFormationCloudFrontCIDHCPDNSDockerEC2ECSEKSGitIaCIAMITILITSMJenkinsLAN/WANLambdaLinuxMicrosoft AzureOCIPuppetPKIPowerShellPythonRDSSageMakerS3SFTPSSMTerraformVMwareWindows
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free