Site Reliability Engineer, AI Infrastructure (Bogotá)

25 may

Somewhere

Bogotá

25 may

Somewhere

Bogotá

Site Reliability Engineer, AI Infrastructure (Philippines) - 59419399046

Remote

JOB ID - 19161

Location: Remote (LATAM, South Africa or PH)
Contract: Minimum 6-month contract with the potential for an indefinite extension based on performance.
Schedule: Full-Time, Monday-Friday, PST or PH timezone.
Reports to: Head of Infrastructure / SRE

About the Company

We operate state-of-the-art AI Factories across Europe and the US, running large-scale NVIDIA GPU clusters (H100, H200, B200, B300) on bare metal for frontier AI workloads. We design, build, and operate the full stack: datacenter power and cooling, InfiniBand fabrics, SLURM and Kubernetes orchestration, storage, and the control plane that turns raw iron into reliable compute for our customers.

Role Overview

We are hiring a Senior Site Reliability Engineer (SRE) to own the reliability of our GPU training and inference clusters from the US West Coast. You will serve as the on-call anchor for Asian hours, drive incident response on multi-thousand GPU fabrics, and push our platform toward higher availability, faster recovery,

and cleaner operations. This is a hands-on role with significant production impact from week one.

Key Responsibilities

I. Cluster Operations & Hardening

Production SLURM Management: Operate and harden production SLURM clusters running large-scale distributed training and inference jobs.

Hardware Health: Own the health of NVIDIA HGX and DGX nodes, including GPU, NVLink, NVSwitch, and BMC diagnostics.

Fabric Tuning: Debug and tune NVIDIA Quantum InfiniBand fabrics (NDR and HDR), including Subnet Manager, topology, adaptive routing, SHARP, and congestion issues.

Root Cause Analysis: Drive deep-dive RCA on GPU failures, XID errors, ECC events, thermal throttling, and link flaps.

II. Automation & Observability

Systems Automation: Write robust automation in Python, Go, or Bash to replace manual tasks, improve MTTR, and scale operations efficiently.

Observability Stack: Build and maintai

📌 Site Reliability Engineer, AI Infrastructure (Bogotá)
🏢 Somewhere
📍 Bogotá

Postulate a este anuncio

Muestra tus habilidades a la empresa, rellenar el formulario y deja un toque personal en la carta, ayudará el reclutador en la elección del candidato.

Urgente Técnico De campo en Electrónica Mecatronica o carreras a fines (Bogotá)

11 jun

Recaudo Bogota

Bogotá

11 jun
Recaudo Bogota
Bogotá

Técnico electrónico (SIRCI) - Bogotá Palabras clave: - Técnico electrónico Bogotá - Mantenimiento electrónico SIRCI - Soporte Centros de Control y Data Center - Técnico electrónico SIRCI [...]

OPERADOR LOGÍSTICO NUT BOGOTÁ

11 jun

Contactamos

Bogotá

11 jun
Contactamos
Bogotá

Descripción del Cargo Se buscan personas que trabajen con pasión, dinámicos que nos ayuden a consolidar los objetivos de nuestros aliados estratégicosImportante empresa con reconocimiento naciona [...]

ADMINISTRADOR DE PUNTO BOGOTÁ

11 jun

Contactamos

Bogotá

11 jun
Contactamos
Bogotá

Descripción del Cargo Se buscan personas que trabajen con pasión, dinámicos que nos ayuden a consolidar los objetivos de nuestros aliados estratégicos. Únete a esta maravillosa labor y haz pa [...]