Site Reliability Engineer | Cloud Team | 2 Roles

Garmin
Hybrid
Regular employment
7 - 12 years of experience
Full Time
Cluj-Napoca, Romania
Responsibilities
We are a global company with offices in the US, Europe and Asia. In these centers, we carry out the various stages of product development, from initial concept to mass production of ready-to-sell units. We embrace a vertically integrated business model with strategic design, manufacturing, distribution, sales and support centers around the world to maximize our value to customers.
Garmin Private Cloud (GPC) will be our internal cloud, developed entirely using open-source technologies such as OpenStack and Kubernetes. GPC will enable Garmin to fully manage the technology, staffing, and costs associated with our evolving product platform.
The GPC team will be responsible for building and maintaining the platform that supports well-known Garmin services like Garmin Connect, ConnectIQ Appstore, Garmin Golf, and many other services.
We believe that collaboration leads to the best ideas, and we rely heavily on team interaction. As a hybrid role based in Cluj-Napoca, this position will require at least 3 days in the office each week.
Responsibilities
- Ensures the integrity of Garmin's production environment is maintained and that all releases into the environment are well-organized, communicated, and managed.
- Author and lead process improvements to the whole project lifecycle and release process.
- Establish and provide training to development teams on operational processes and automations that promote software integrity and stability.
- Lead design/definition activities for moderate- and high-complexity systems, features, and/or processes.
- Champion the shift-left culture of reliability and delivery performance within software development teams.
- Monitor and support moderate- and high-complexity software releases.
- Design and implement improvements to the software lifecycle and production pipeline through automated tools/systems that align with industry best practices.
- Coordinate and improve monitoring practices across software applications and infrastructure.
- Build and/or maintain tools to generate reports.
- Maintain accurate data to facilitate reporting on key reliability SLOs for multiple products/systems.
- Improve the team’s incident response by nurturing incident playbooks.
- Through post-incident activities, proactively identify and/or implement reliability improvements and automated mitigations of recurrence.
- Cultivate engagement in the SRE community to nurture standards, best practices, and training across product owners, software engineers, and other SREs.
- Participate in capacity planning to ensure software can scale sufficiently at peak times.
- Work collaboratively and professionally in a team environment with other Garmin associates to achieve goals.