Overview

The Yale Center for Research Computing (YCRC) is looking for a versatile system administrator/engineer to help ensure that Yale’s exceptional faculty and students have the AI HPC infrastructure they need to propel discovery and scholarship to improve the world. Join our growing team of system specialists, research facilitators, and project administration experts, focusing your work especially on GPU infrastructure enhancements and improvements as part of Yale’s comprehensive campus investment in AI.

As an experienced subject matter expert, you will help lead the system design, deployment and support of YCRC’s AI-focused research cluster and storage infrastructure. This role is primarily systems-facing, but has a researcher-facing component as well. Frequent interaction with other systems team members, research support specialists, and researchers is a routine part of the job. You will be expected to stay current on developments and trends in accelerator and overall high performance computing technologies, processes, and methodologies. We will look to you for insights on evolving tradeoffs in areas such as accelerator-based memory, precision, interconnects, power consumption, and cost.

This is a hybrid position, with a minimum of two days per week on site. YCRC’s office space is on the Yale campus. As part of the systems team, you will be expected to provide on-site equipment maintenance as needed. Infrastructure is hosted at a Yale data center in West Haven, CT, and at the Massachusetts Green High Performance Computing Center (MGHPCC) in Holyoke, MA.

Required Skills and Abilities

Experience with accelerators such as GPUs for AI, including expertise with system-level tradeoffs in such areas as accelerator-based memory, precision, within-node interconnect, multi-node interconnect, cost and power consumption.
Expertise in administration of HPC Linux clusters, including managing and configuring cluster provisioning and management tools, and batch scheduler.
Experience with high-speed networking such as InfiniBand and high-speed Ethernet.
Experience with large storage systems and parallel file systems such as GPFS and Lustre.
Expertise in Linux system administration, including managing the operating system, networking, storage, and security.
Expertise in automation and scripting in at least one scripting language.
Ability to work in a team environment in a fast-moving technology field. Excellent verbal and writing skills.
Ability to interact well with team members and end users. Ability to work independently and across units.
Attention to detail. Ability to take the care necessary to be entrusted with a system that hundreds of users depend on for research computation and the storage of research data.

Preferred Skills and Abilities

Demonstrated ability to specify, install, configure, and support multi-node GPU systems, and tune them for AI applications.
Demonstrated ability to design, implement, and maintain a local, customized implementation and configuration of a core HPC system such as the HPC provisioning system, the resource-management system, account/user lifecycle management, or user authentication and authorization systems.
Experience supporting technology in a research environment.
Expertise in configuration, deployment, support, and backup of large-scale parallel storage systems.
Experience administering high-speed networking such as InfiniBand or high-speed Ethernet in a cluster environment.
Expertise in computer security, preferably in the context of large, multi-user Linux environments.
Experience in a data-center environment, installing and trouble-shooting hardware.
Professional certifications related to the above.
Graduate degree in a related field.

Principal Responsibilities

Design, implement and advance core HPC systems such as the HPC provisioning system, the resource-management system, account/user lifecycle management, and user authentication and authorization systems.
Design, deploy, configure and support HPC clusters, including compute, networking, parallel storage and backup.
Install, administer and maintain hardware, system software, networking, accounts, and security measures.
Diagnose and correct system issues, whether these be issues with correct operation or performance.
Develop and maintain documentation.
Research developments in HPC architecture and new technologies, processes, and methodologies.
Determine specifications for new systems, and tailor these to meet research needs.

Required Education and Experience

Bachelor's Degree in a related field and a minimum of six years of related work experience or an equivalent combination of education and experience.

Salary Range

$90,000.00 - $165,750.00

Location

160 St. Ronan Street, New Haven, Connecticut

Yale University

Yale University, New Haven, CT, USA

"Senior High Performance Computing System Administrator"

Applications Close