25 may
|
Somewhere
|
Bogotá
Postúlate en Kit Empleo: kitempleo.com.co/empleo/1ami9k
Site Reliability Engineer, AI Infrastructure (Philippines) - 59419399046
Remote
JOB ID - 19161
Location: Remote (LATAM, South Africa or PH)
Contract: Minimum 6-month contract with the potential for an indefinite extension based on performance.
Schedule: Full-Time, Monday-Friday, PST or PH timezone.
Reports to: Head of Infrastructure / SRE
About the Company
We operate state-of-the-art AI Factories across Europe and the US, running large-scale NVIDIA GPU clusters (H100, H200, B200, B300) on bare metal for frontier AI workloads. We design, build, and operate the full stack: datacenter power and cooling, InfiniBand fabrics, SLURM and Kubernetes orchestration, storage, and the control plane that turns raw iron into reliable compute for our customers.
Role Overview
We are hiring a Senior Site Reliability Engineer (SRE) to own the reliability of our GPU training and inference clusters from the US West Coast. You will serve as the on-call anchor for Asian hours, drive incident response on multi-thousand GPU fabrics, and push our platform toward higher availability, faster recovery,
and cleaner operations. This is a hands-on role with significant production impact from week one.
Key Responsibilities
I. Cluster Operations & Hardening
Production SLURM Management: Operate and harden production SLURM clusters running large-scale distributed training and inference jobs.
Hardware Health: Own the health of NVIDIA HGX and DGX nodes, including GPU, NVLink, NVSwitch, and BMC diagnostics.
Fabric Tuning: Debug and tune NVIDIA Quantum InfiniBand fabrics (NDR and HDR), including Subnet Manager, topology, adaptive routing, SHARP, and congestion issues.
Root Cause Analysis: Drive deep-dive RCA on GPU failures, XID errors, ECC events, thermal throttling, and link flaps.
II. Automation & Observability
Systems Automation: Write robust automation in Python, Go, or Bash to replace manual tasks, improve MTTR, and scale operations efficiently.
Observability Stack: Build and maintai
Postúlate en Kit Empleo: kitempleo.com.co/empleo/1ami9k
📌 Site Reliability Engineer, AI Infrastructure (Bogotá)
🏢 Somewhere
📍 Bogotá