Senior HPC Engineer – IFM
Application Open:
Full-Time
MBZUAI’s Institute of Foundation Models is seeking a Senior HPC Engineer to provide technical leadership in designing, operating, and evolving large-scale GPU infrastructure supporting frontier AI research. The Institute for Foundation Models (IFM) operates one of the world’s largest AI-focused supercomputing environments and is looking for an experienced HPC Engineer to contribute to groundbreaking research and development.
Key Responsibilities
- Lead operation and optimization of large-scale GPU clusters.
- Drive reliability, scalability, and performance improvements.
- Lead troubleshooting and root cause analysis of complex issues.
- Design and validate new cluster deployments and upgrades.
- Collaborate with researchers to optimize distributed AI training.
- Lead vendor engagement and technical reviews.
- Mentor junior engineers.
- Define monitoring, operational standards, and capacity planning processes.
- Participate in major incident management and escalations.
Academic Qualification
- Bachelor’s degree in computer science, Computer Engineering, Electrical Engineering, Software Engineering, Information Technology, Applied Mathematics, Physics, or related disciplines.
- Master’s Degree preferred.
Professional Experience Required
Essential:
- 5+ years in HPC, Linux infrastructure, cloud infrastructure, distributed systems, or large-scale production environments.
- Experience with Slurm and Linux administration.
- Experience troubleshooting compute, storage, and networking systems.
Preferred:
- GPU cluster operations.
- NVIDIA technologies including CUDA, NCCL, NVLink, and GPUDirect.
- InfiniBand networking.
- Weka, Lustre, BeeGFS, or similar storage platforms.
- Azure, AWS, or GCP.
- Terraform, Ansible, or Infrastructure-as-Code.
- PyTorch Distributed, Megatron-LM, DeepSpeed, FSDP, or large-scale AI training environments.
10
Unlock this job opportunity
View more options below
View full job details
See the complete job description, requirements, and application process



