Senior Site Reliability Engineer Manager
Overview
Come build and maintain the world’s computer as a member of the Microsoft Capacity Infrastructure Services team in Azure Core. The team ensures new servers are brought online (capacity buildout/provisioning) to enable Azure customers to leverage the latest offerings, see the illusion of infinite capacity, and grow the Azure business efficiently at hyperscale. You’ll also complete the cycle by safely taking old capacity offline (decommissioning/deprovisioning) and provisioning new capacity again in its place thus ensuring the cloud remains healthy and current.
As a Senior Site Reliability Engineering Manager, you’ll grow your team of site reliability engineers and service engineers to work with a breadth of partners across Microsoft including developers in service teams, hardware engineers, network engineers, datacenter technicians, supply chain managers, and business leaders to rapidly debug and resolve issues delaying the carefully orchestrated buildout and decommissioning sequences. You’ll drive continuous improvements with these teams to prevent repeats and address common classes of issues across the Azure software stack through design reviews and problem management.
This opportunity will enable you to learn unparalleled
- wide knowledge of how the Azure cloud is built and maintained while growing your people management skillset. The contacts you make with experts will enable you to deep dive on services and new technologies and partner for improvements. You’ll be stretched to automate mitigations tactically to cloud scale and strategically analyze data to identify problem areas for driving improvements to meet business needs.
Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Qualifications
Required Qualifications:
OR Bachelor's Degree in Computer Science, Information Technology, or related field AND technical experience in software engineering, network engineering, or systems administration
OR Master's Degree in Computer Science, Information Technology, or related field AND technical experience in software engineering, network engineering, or systems administration
Other Requirements:
Preferred Qualifications:
OR Doctorate Degree in Computer Science, Information Technology, or related field
- scale cloud or distributed systems
#azurecorejobs
Responsibilities
-
- end expertise in distributed systems design, interactions between cloud technology layers and components, functions of physical network devices, and dependencies at scale. Drives efforts within an organization to identify and recommend optimal configurations of cloud technology solutions and develops or modifies the code base that defines infrastructures to improve the reliability and operability of supported products.
-
- end technical expertise in the architecture, code, features, and operations of specific products as required to implement improvements in product availability, reliability, efficiency, observability, and/or performance. Drives code/design reviews with the engineering teams that develop and/or manage those products and shares learnings and recommendations across engineering teams working on related products within their organization.
- scale distributed systems and cloud technologies; manages efforts to research, develop, implement, and optimally utilize new tools, technologies, and/or processes to solve ambiguous problems and improve the availability, reliability, efficiency, observability, and/or performance of their team's supported products. Monitors the implementation of new tools, technologies, and processes as well as their impact on reliability, efficiency, observability, and/or performance to make recommendations for broader adoption within an organization.
- in for their recommendations from product teams and owners.
- call rotations and manages teams of Site Reliability Engineers (SREs) responding to incidents during regular
- call rotations to identify the level of impact, troubleshoot issues, and deploy appropriate fixes to resolve root cause(s) and prevent recurrence across related products. Ensures that SREs within an organization have the technical knowledge and resources required to respond to incidents, that relevant engineering teams, stakeholders, leaders are alerted to customer impacting issues, major issues are escalated to other teams as needed, and that key details related to incidents and their resolution are shared through
- mortem reports and during regular review meetings.
Educational resources
Discounts on products and services
Savings and investments
Maternity and paternity leave
Generous time away
Giving programs
Opportunities to network and connect
Fii primul, care se va înregistra la oferta de muncă respectivă!
-
De ce să cauți de muncă pe Lucrezi.ro?
În fiecare zi oferte noi de muncă Puteți alege dintr-o gamă largă de locuri de muncă: Scopul nostru este de a oferi o gamă cât mai largă de opțiuni Lasă să-ți fie trimise noile oferte prin e-mail Fii primul care răspunde la noile oferte de muncă Toate ofertele de muncă într-un singur loc (de la angajatori, agenții și alte portaluri) Toate serviciile pentru persoanele aflate în căutarea unui loc de muncă sunt gratuite Vă vom ajuta să găsiți un nou loc de muncă