Site Reliability Engineer, AI Infrastructure (Bogotá)

Site Reliability Engineer, AI Infrastructure (Bogotá)

25 may
|
Somewhere
|
Bogotá

25 may

Somewhere

Bogotá

Site Reliability Engineer, AI Infrastructure (Philippines) - 59419399046

Remote

JOB ID - 19161

Location: Remote (LATAM, South Africa or PH)
Contract: Minimum 6-month contract with the potential for an indefinite extension based on performance.
Schedule: Full-Time, Monday-Friday, PST or PH timezone.
Reports to: Head of Infrastructure / SRE

About the Company

We operate state-of-the-art AI Factories across Europe and the US, running large-scale NVIDIA GPU clusters (H100, H200, B200, B300) on bare metal for frontier AI workloads. We design, build, and operate the full stack: datacenter power and cooling, InfiniBand fabrics, SLURM and Kubernetes orchestration, storage, and the control plane that turns raw iron into reliable compute for our customers.

Role Overview

We are hiring a Senior Site Reliability Engineer (SRE) to own the reliability of our GPU training and inference clusters from the US West Coast. You will serve as the on-call anchor for Asian hours, drive incident response on multi-thousand GPU fabrics, and push our platform toward higher availability, faster recovery,



and cleaner operations. This is a hands-on role with significant production impact from week one.

Key Responsibilities

I. Cluster Operations & Hardening

Production SLURM Management: Operate and harden production SLURM clusters running large-scale distributed training and inference jobs.

Hardware Health: Own the health of NVIDIA HGX and DGX nodes, including GPU, NVLink, NVSwitch, and BMC diagnostics.

Fabric Tuning: Debug and tune NVIDIA Quantum InfiniBand fabrics (NDR and HDR), including Subnet Manager, topology, adaptive routing, SHARP, and congestion issues.

Root Cause Analysis: Drive deep-dive RCA on GPU failures, XID errors, ECC events, thermal throttling, and link flaps.

II. Automation & Observability

Systems Automation: Write robust automation in Python, Go, or Bash to replace manual tasks, improve MTTR, and scale operations efficiently.

Observability Stack: Build and maintai

📌 Site Reliability Engineer, AI Infrastructure (Bogotá)
🏢 Somewhere
📍 Bogotá

Postulate a este anuncio

Muestra tus habilidades a la empresa, rellenar el formulario y deja un toque personal en la carta, ayudará el reclutador en la elección del candidato.

Suscribete a esta alerta:
Escribe tu dirección de correo electrónico, te permitirá de estar al tanto de los últimos empleos por: site reliability engineer, ai infrastructure (bogotá) / bogotá
Suscribete a esta alerta:
Escribe tu dirección de correo electrónico, te permitirá de estar al tanto de los últimos empleos por: site reliability engineer, ai infrastructure (bogotá) / bogotá