SummaryWe are seeking a strategic and hands-on Senior Manager of Site Reliability Engineering to lead our SRE team in Seattle. This role is pivotal in ensuring operational excellence and customer satisfaction by delivering resilient, scalable, and high-performing systems. You will guide a team of talented engineers, champion automation, and collaborate across disciplines to support business growth and innovation.
Responsibilities- Build and mentor a high-performing SRE team, fostering a culture of ownership, innovation, and continuous learning.
- Ensure the availability and performance of critical services through proactive monitoring, incident response, and root cause analysis.
- Implement automation across deployment, recovery, and scaling processes to reduce manual toil.
- Define and execute observability strategies using New Relic, Splunk, and other tools to detect and resolve issues before they impact users.
- Partner with engineering, product, and operations teams to align reliability goals with business priorities.
- Lead capacity planning and performance tuning for services running on AWS EKS and other cloud-native platforms.
- Establish and track SLOs, SLAs, and error budgets, continuously refining processes to improve system reliability and team efficiency.
Requirements- 5+ years in SRE, DevOps, or infrastructure engineering, with 2+ years in a leadership role.
- Expertise in cloud platforms, especially AWS, container orchestration (Kubernetes, EKS), and CI/CD pipelines.
- Proficiency in programming languages such as Python, Go, or Java.
- Hands-on experience with New Relic, Splunk, and Kubernetes.
- Strong analytical skills and a passion for root cause analysis and continuous improvement.
- Clear, concise, and collaborative communicator who thrives in cross-functional environments.
- Bachelor’s degree in Computer Science, Engineering, or equivalent experience.
We have summarized this job description for you, click apply to see more details from the employer.