Site Reliability Engineer
Job Description
Approach operations challenges with a software engineering perspective, leveraging:
- Monitor and appropriate address system issues.
- Create strategies to detect issues.
- Design systems to troubleshoot automatically.
- Write and review
- mortems. - Collaborate with development teams and other stakeholders to identify potential risks.
- Once risks are identified, you will analyze and evaluate potential impact and likelihood of occurrence.
- Based on the risk assessment, you will implement various risk mitigation strategies to mitigate operational risks.
- Continuously monitor and review the effectiveness of their risk strategies.
- Study historical trends in terms of performance by using metrics like charts and graphs.
- Trace the problems with system monitoring tools.
- Monitor the log files to manage infrastructures at scale.
- Minimizing the MTTR for reliable systems is necessary to reduce downtime. As an SRE, you can improve this metric by resolving the incidents quickly.
- Maintain internal tooling.
- Monitoring system performance, identifying bottlenecks, and executing pipeline optimization.
- Implementing comprehensive service metrics to track and report on system reliability, performance, and efficiency.
- Developing and maintaining CI/CD pipelines, enhancing the consistency and speed of software deployment.
- Automating routine tasks and creating tools to improve team efficiency and system robustness.
- Collaborating with development teams to integrate operational considerations into the software development life cycle.
- Managing incident response protocols, including
- call rotations for junior engineers and strategic planning for senior personnel. - Conducting
- incident reviews to prevent recurrence and refine the system reliability framework. - Contributing to disaster recovery plans and ensuring robust backup systems are in place.
- Partner with development teams to improve services through rigorous testing and release procedures.
- Participate in system design consulting, platform management, and capacity planning.
- Create sustainable systems and services through automation and uplifts.
- Balance feature development speed and reliability with
- defined
- level objectives. - Working
- call shift to prevent incidents from ever happening. - Running our infrastructure with Ansible, Terraform, Git
Lab CI/CD, and Kubernetes.
Qualifications
- Experience in using: Linux, UNIX and Windows.
- DB administration & maintenance: Oracle, Cassandra, Postgre
SQL, AWS DB setups, Caching DB. - Familiar with: GIT, Jira, Jenkins, Ansible.
- Strong knowledge of Dev
Ops and CI/CD pipeline (Git
Hub, Terraform). - Knowledge of monitoring solutions: Grafana, Prometheus, Dynatrace.
- 'Hands-on' AWS implementation experience across a broad range of AWS services.
- Must have AWS development experience (Containerization - Docker, Amazon EKS, Lambda, EC2, S3, Amazon Document DB, Postgre
SQL). - Experience with core AWS platform architecture, including areas such as: Organizations, Account Design, VPC, Subnet, segmentation strategies.
- Comfortable working with
- native infrastructure, such as AWS Lambda, Google App Engine, and Azure Cloud Services. - Backup and Disaster Recovery approach and design.
- Environment and application automation.
- Proficiency in programming languages such as Python, Go, or Java.
- Familiar with Encryption, Logging, and Privacy/Security Protocols (e. g. , TLS 1. 2, ELK stack).
- Good knowledge of REST/SOAP/JSON web service API implementation.
- Bachelor's degree in Computer Science, Information Technology, or a related field.
- Relevant industry certifications, such as through the Site Reliability Engineering (SRE) Foundation.
- Strong understanding of
- based applications and infrastructure, including AWS, Azure, or Google Cloud. - Experience with IT operations best practices such as ITIL, COBIT, or Dev
Ops. - Experience with IT service management tools such as Service
Now or Remedy. - Familiarity with banking customer acquisition applications is preferred.
Benefits
- Full access to foreign language learning platform
- Personalized access to tech learning platforms
- Tailored workshops and trainings to sustain your growth
- Medical subscription
- Meal tickets
- Monthly budget to allocate on flexible benefit platform
- Access to 7 Card services
- Wellbeing activities and gatherings
Hybrid: 1-2 days/week from office (Bucharest)
-
Informații detaliate despre oferta de muncă
Firma: Inetum Romania Localiția: Bucureşti
Bucharest, Bucharest, RomaniaAdăugat: 16. 3. 2025
Postul de muncă activ
Fii primul, care se va înregistra la oferta de muncă respectivă!