27 may
|
Importante empresa
|
Colombia
27 may
Importante empresa
Colombia
Postúlate en Kit Empleo: kitempleo.com.co/empleo/1arbuw
EPAM is a leading general provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
We are looking for a skilled Senior Site Reliability Engineer to deliver advanced support and reliability engineering for critical cloud-based systems. The role focuses on ensuring reliability, performance and observability across AWS environments, with strong emphasis on Kubernetes, advanced monitoring, database expertise and distributed systems such as Kafka. The position involves incident response, proactive reliability improvements,
automation and collaboration with engineering teams to strengthen system resilience.
Responsibilities
Design, implement and maintain observability for AWS Cloud and Kubernetes workloads using Prometheus, Grafana, Open Telemetry, Fluent Bit, OpenSearch, CloudWatch, CloudTrail, Athena and other modern tooling
Monitor and troubleshoot EKS, Aurora RDS (Postgres) and other AWS infrastructure at an advanced level
Implement automated remediations and self-healing mechanisms
Participate in incident response, root-cause analysis and postmortems
Implement security measures impacting cluster reliability (IAM, network policies, Config)
Support and maintain current AWS infrastructure
Collaborate with L3 teams to escalate, troubleshoot and resolve operational issues
Requirements
3+ years of experience in site reliability engineering or advanced support roles
Expert-level proficiency in Grafana, Prometheus and OpenSearch
Expertise in Open Telemetry, Fluent Bit, Cloud
Postúlate en Kit Empleo: kitempleo.com.co/empleo/1arbuw
📌 Senior Site Reliability Engineer (Colombia)
🏢 Importante empresa
📍 Colombia