Site Reliability Engineer
Site Reliability Engineer (SRE) | Cloud & Dev
Ops | Remote/Hybrid
Role Overview
We are seeking a Site Reliability Engineer (SRE) to enhance system reliability, performance, and scalability by applying software engineering principles to operations. The ideal candidate will have expertise in cloud infrastructure, Dev
Ops practices, CI/CD pipelines, automation, monitoring, and incident response. This role involves collaborating with development teams, implementing proactive monitoring strategies, and optimizing system performance to minimize downtime and enhance operational efficiency.
Key Responsibilities
System Monitoring & Incident Management
Monitor and proactively address system issues to ensure high availability.
Develop strategies to detect, troubleshoot, and automate incident resolution.
Implement comprehensive service metrics to track and report system reliability.
Minimize Mean Time to Resolution (MTTR) by optimizing incident response and recovery.
Conduct
- mortems and root cause analyses to prevent recurrence of incidents.
Risk Assessment & Mitigation
Analyze potential risks and evaluate impact likelihood.
Implement and continuously improve risk mitigation strategies.
Monitor historical performance trends using metrics, charts, and logs.
Automation & Infrastructure Management
Develop and maintain CI/CD pipelines to enhance software deployment.
Automate routine tasks and build tools to improve team efficiency.
Work with Ansible, Terraform, Git
Lab CI/CD, Kubernetes for infrastructure automation.
Manage AWS, Azure, and Google Cloud environments, ensuring security and compliance.
Collaboration & Reliability Engineering
Partner with development teams to integrate operational considerations into the software lifecycle.
Support disaster recovery plans and backup strategies.
Conduct capacity planning, system design consulting, and platform management.
Balance feature development speed and reliability through
- level objectives.
On-Call Support & Continuous Improvement
Participate in
- call rotations, ensuring rapid response to incidents.
Improve monitoring solutions (Grafana, Prometheus, Dynatrace).
Optimize logging, security, and encryption protocols (TLS 1. 2, ELK Stack).
Required Skills & Experience
Proficiency in Linux, UNIX, and Windows environments.
Strong Dev
Ops and CI/CD experience (Git
Hub, Terraform, Jenkins, Ansible).
Hands-on experience with AWS services, including:
Docker, Amazon EKS, Lambda, EC2, S3, Document
DB, Postgre
SQL.
VPC, Subnet segmentation strategies, and AWS Organizations.
Experience with databases: Oracle, Cassandra, Postgre
SQL, AWS DB setups, Caching DBs.
Programming proficiency in Python, Go, or Java.
Knowledge of REST/SOAP/JSON web services APIs.
Familiarity with IT service management tools (Service
Now, Remedy).
Experience with ITIL, COBIT, or Dev
Ops best practices.
Relevant certifications (SRE Foundation, AWS, Azure, or Google Cloud).
Banking industry experience is a plus.
Why Join?
Work on
- impact,
- native infrastructure and automation projects.
Engage with
- edge Dev
Ops, cloud, and SRE methodologies.
Remote/hybrid flexibility with a focus on reliability engineering.
Opportunity to lead system resilience strategies in a
- paced environment.
If you are an experienced SRE with a passion for automation, cloud infrastructure, and performance optimization, wed love to hear from you!
- Informații detaliate despre oferta de muncă
Firma: Shape Your Future with Us Localiția: Bucureşti
Bucharest, Bucharest, RomaniaAdăugat: 27. 3. 2025
Postul de muncă activ
Fii primul, care se va înregistra la oferta de muncă respectivă!