Summary
Lead the Cloud Engineering and Site Reliability efforts to ensure availability, performance, and scalability of customer facing retail systems. Serve as a player coach delivering hands on technical contributions while driving operational excellence, automation, and SRE best practices in a hybrid Tribeca based role.
Responsibilities
- Manage and mentor the Cloud Engineering and SRE team as a player coach.
- Oversee system reliability and availability to meet defined SLOs and SLIs.
- Implement Infrastructure as Code using Terraform to manage cloud resources.
- Monitor system performance and respond to incidents, conducting post mortems.
- Drive automation to improve operational efficiency and reliability.
- Collaborate with development teams to build scalable and resilient systems.
- Promote a culture of continuous learning and improvement.
- Ensure system health and security through monitoring and best practices.
- Establish and maintain monitoring and observability tooling and processes.
Requirements
- Bachelor’s degree in Computer Science, Engineering, Mathematics or equivalent experience.
6 years of experience planning, designing, building, and implementing IT systems.- Minimum 3 years managing major projects and supervising teams for complex system implementations.
- Proficient in scripting and programming such as PowerShell, Bash or Python.
- Extensive cloud infrastructure experience primarily with AWS; Oracle Cloud Infrastructure is a plus.
- Experience with Infrastructure as Code, Terraform preferred; certifications desirable.
- Strong understanding of networking, security and database technologies.
- Experience with monitoring and observability tools such as PagerDuty and AWS CloudWatch.
We have summarized this job description for you, click apply to see more details from the employer.