Job Description:
We seek a hands-on Infrastructure & Systems Team Leader to ensure robust, stable, and efficient data center operations. This role requires technical expertise and leadership to optimize capacity, manage resources, and maintain high availability.
*This role is fully onsite*
Responsibilities:
- Ensure 24/7 stability of data centers, servers, storage, and network.
- Optimize resources and manage capacity for high-performance AI workloads.
- Operate and maintain GPU-based HPC clusters for AI and deep learning workloads.
- Manage virtualization environments for efficient workload distribution and scaling.
- Implement and enforce security best practices to protect infrastructure and data.
- Implement automation and monitoring to enhance efficiency and reliability.
- Coordinate with vendors and internal teams for seamless operations.
Qualifications:
- 10+ years in IT infrastructure, with hands-on experience in data center operations.
- Collaborate with U.S. service providers - colocation, networking, cloud, and hardware vendors
- Ensure seamless data centers operations and support
- Strong knowledge of GPUs, high-performance computing (HPC), and high-speed storage.
- Experience with virtualization technologies (VMware, Proxmox, KVM, etc.).
- Expertise in securing infrastructure and enforcing security best practices.
- Experience with capacity planning and performance optimization.