NVIDIA-logo
NVIDIA
ยท
January 3, 2026
Apply Now
This job has closed.

AI and ML HPC Cluster Engineer

California, United States
Full-time
Onsite
$120K/yr - $190K/yr
Entry, Mid Level
NVIDIA is a pioneer in accelerated computing, known for inventing the GPU and driving breakthroughs in gaming, computer graphics, high-performance computing, and artificial intelligence. As an AI/ML HPC Cluster Engineer, you will provide technical engagement and problem solving on the management of large-scale HPC systems, ensuring efficient resource utilization and supporting researchers with their workloads.
Apply Now

Responsibilities

  • Support day-to-day operations of production on-premises and multi-cloud AI/HPC clusters, ensuring system health, user satisfaction, and efficient resource utilization
  • Directly administer internal research clusters, conduct upgrades, incident response, and reliability improvements
  • Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions
  • Maintain heterogeneous AI/ML clusters on-premises and in the cloud
  • Support our researchers to run their workloads including performance analysis and optimizations
  • Analyze and optimize cluster efficiency, job fragmentation, and GPU waste to meet internal SLA targets
  • Support root cause analysis and suggest corrective action. Proactively find and fix issues before they occur
  • Triage and support postmortems for reliability incidents affecting users or infrastructure
  • Participate in a shared on-call rotation supported by strong automation, clear paths for responding to critical issues, and well-defined incident workflows

Qualification

Required

  • Bachelor's degree in Computer Science, Electrical Engineering or related field or equivalent experience
  • Minimum 2 years of experience administering multi-node compute infrastructure
  • Background in managing AI/HPC job schedulers like Slurm, K8s, PBS, RTDA, BCM (formerly known as Bright), or LSF
  • Proficient in administering Centos/RHEL and/or Ubuntu Linux distributions
  • Proven understanding of cluster configuration management tools (Ansible, Puppet, Salt, etc.), container technologies (Docker, Singularity, Podman, Shifter, Charliecloud), Python programming, and bash scripting
  • Passion for continual learning and staying ahead of emerging technologies and effective approaches in the HPC and AI/ML infrastructure fields

Preferred

  • Background with NVIDIA GPUs, CUDA Programming, NCCL and MLPerf benchmarking
  • Experience with AI/ML concepts, algorithms, models, and frameworks (PyTorch, Tensorflow)
  • Experience with InfiniBand with IBOP and RDMA
  • Understanding of fast, distributed storage systems such as Lustre and GPFS for AI/HPC workloads
  • Applied knowledge in AI/HPC workflows that involve MPI

Benefits

  • Equity
  • Benefits
NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.
Glassdoor
4.6
Founded in 1993
Santa Clara, California, USA
10001+ employees
https://www.nvidia.com