SummaryAs a Junior Site Reliability Engineer at Nordstrom Technology, you will be responsible for ensuring the stability and performance of critical systems. You will focus on proactive monitoring, incident response, and root cause analysis to maintain system reliability and health.
Responsibilities- Monitor critical systems and maintain real-time dashboards to identify and respond to anomalies.
- Participate in on-call rotations to troubleshoot issues and restore services.
- Conduct root cause analysis and document findings to prevent recurrence.
- Collaborate with teams to enhance monitoring, logging, and alerting systems.
- Automate routine operational tasks and incident remediation.
- Support the definition and tracking of SLOs and SLIs.
- Assist in optimizing CI/CD pipelines and workflows.
- Create and maintain documentation for monitoring configurations and incident procedures.
- Work closely with engineering and infrastructure teams to improve system scalability and readiness.
Requirements- Bachelor’s degree in computer science, engineering, or related field, or equivalent experience.
- Understanding of site reliability engineering principles with a focus on monitoring and incident management.
- Experience with observability tools like Prometheus, Datadog, or Grafana.
- Proficiency in programming or scripting languages such as Python, Go, or Bash.
- Familiarity with cloud platforms like AWS or Azure.
- Knowledge of containerization technologies like Docker and Kubernetes.
- Strong analytical skills and attention to detail.
- Excellent communication skills for collaboration and documentation.
We have summarized this job description for you, click apply to see more details from the employer.