Site Reliability Engineer
Job Description
Approach operations challenges with a software engineering perspective, leveraging:
- Coding, Automation and Engineering principles.
- Monitor and appropriate address system issues.
- Create strategies to detect issues.
- Design systems to troubleshoot automatically.
- Write and review
- mortems. - Collaborate with development teams and other stakeholders to identify potential risks.
- Once risks are identified, you will analyze and evaluate potential impact and likelihood of occurrence.
- Based on the risk assessment, you will implement various risk mitigation strategies to mitigate operational risks.
- Continuously monitor and review the effectiveness of their risk strategies.
- Study historical trends in terms of performance by using metrics like charts and graphs.
- Trace the problems with system monitoring tools.
- Monitor the log files to manage infrastructures at scale.
- Minimizing the MTTR for reliable systems is necessary to reduce downtime
- As an SRE, you can improve this metric by resolving the incidents quickly.
- Maintain internal tooling.
- Monitoring system performance, identifying bottlenecks, and executing pipeline optimization.
- Implementing comprehensive service metrics to track and report on system reliability, performance, and efficiency.
- Developing and maintaining CI/CD pipelines, enhancing the consistency and speed of software deployment.
- Automating routine tasks and creating tools to improve team efficiency and robust system.
- Collaborating with development teams to integrate operational considerations into the software development life cycle.
- Managing incident response protocols, including
- call rotations for junior engineers and strategic planning for senior personnel. - Conducting
- incident reviews to prevent recurrence and refine the system reliability framework. - Contributing to disaster recovery plans and ensuring robust backup systems are in place.
- Partner with development teams to improve services through rigorous testing and release procedures.
- Participate in system design consulting, platform management, and capacity planning.
- Create sustainable systems and services through automation and uplifts.
- Balance feature development speed and reliability with
- defined
- level objectives. - Working
- call shift to prevent incidents from ever happening. - Running our infrastructure with Ansible, Terraform, Git
Lab CI/CD, and Kubernetes.
Qualifications
- Experience in using: Linux, UNIX and Windows
- DB administration & maintenance: Oracle, Cassandra, Postgre
SQL, AWS DB setups, Caching DB. - Familiar with: GIT, Jira, Jenkins, Ansible
- Strong knowledge of Dev
Ops and CI/CD pipeline (Git
Hub, Terraform) - Knowledge of monitoring solutions: Grafana, Prometheus, Dynatrace
- 'Hands-on' AWS implementation experience across a broad range of AWS services.
- Must have AWS development experience (Containerization - Docker, Amazon EKS, Lambda, EC2, S3, Amazon Document DB, Postgre
SQL) - Experience with core AWS platform architecture, including areas such as: Organizations, Account Design, VPC, Subnet, segmentation strategies.
- Comfortable working with
- native infrastructure, such as AWS Lambda, Google App Engine, and Azure Cloud Services. - Backup and Disaster Recovery approach and design
- Environment and application automation
- Proficiency in programming languages such as Python, Go, or Java
- Familiar with Encryption, Logging, and Privacy/Security Protocols (e. g. , TLS 1. 2, ELK stack)
- Good knowledge of REST/SOAP/JSON web service API implementation.
- Bachelor's degree in Computer Science, Information Technology, or a related field.
- Relevant industry certifications, such as through the Site Reliability Engineering (SRE) Foundation.
- Strong understanding of
- based applications and infrastructure, including AWS, Azure, or Google Cloud. - Experience with IT operations best practices such as ITIL, COBIT, or Dev
Ops. - Experience with IT service management tools such as Service
Now or Remedy. - Familiarity with banking customer acquisition applications is preferred.
Additional Information
Benefits:
- Full access to foreign language learning platform
- Personalized access to tech learning platforms
- Tailored workshops and trainings to sustain your growth
- Medical subscription
- Meal tickets
- Monthly budget to allocate on flexible benefit platform
- Access to 7 Card services
- Wellbeing activities and gatherings
Hybrid: 1-2 days/week from office (Bucharest)
Fii primul, care se va înregistra la oferta de muncă respectivă!
-
De ce să cauți de muncă pe Lucrezi.ro?
În fiecare zi oferte noi de muncă Puteți alege dintr-o gamă largă de locuri de muncă: Scopul nostru este de a oferi o gamă cât mai largă de opțiuni Lasă să-ți fie trimise noile oferte prin e-mail Fii primul care răspunde la noile oferte de muncă Toate ofertele de muncă într-un singur loc (de la angajatori, agenții și alte portaluri) Toate serviciile pentru persoanele aflate în căutarea unui loc de muncă sunt gratuite Vă vom ajuta să găsiți un nou loc de muncă