Kubernetes Batch + HPC Day North America 2022: Full Schedule

October 24, 2022 | Detroit, Michigan
View More Details & Registration Information

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2022 - Detroit, MI + Virtual and add this Co-Located event to your registration to participate in these sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Eastern Daylight Time (EDT), UTC -4. To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."

The schedule is subject to change.

10:10am EDT

Coordinate Workloads Colocation: QoS-Oriented Scheduling Enhancement on K8s - Zuowei Zhang & Tao Li, Alibaba Cloud

Kubernetes provides well-defined QoS Classes on pod as guaranteed, burstable, and best-effort. Users can colocate different QoS workloads to achieve resource overcommitment and improve cluster utilization. However, with scale increasing and workloads diversified, some limitations are becoming more: · Lower QoS will be easily throttled or killed once node runs out of resources · The noisy neighbor problem effects the performance of latency-sensitive application · Local hot spots affect the global We implements Koordinator based on Kubernetes with several add-ons to provide QoS-oriented scheduling enhancements: · Definition of sub-QoS classes for complex workloads in co-location scenarios and compatible with the Kubernetes existing QoS semantics · Using dynamic metrics of nodes and pod to provide a more reliable model for resource overcommitment, including resource usage profile and micro metrics such as CPU scheduling, memory allocate latency · Applying fine-grained resource orchestration and isolation mechanism on node to solve the noisy neighbor problem and improve the efficiency of latency-sensitive workloads and batch jobs

Speakers

Zuowei Zhang

Senior Engineer, Alibaba Cloud

Zuowei Zhang, senior engineer of Alibaba Cloud. He works in container service team, focusing on resource management in warehouse-scale computers by Kubenetes. Also, he has years of developing experience on distrubted system.

Tao Li

Senior Engineer, Alibaba Cloud

Tao Li, senior engineer of Alibaba Cloud. He works in container service team, focusing on cost optimization and ensuring runtime quality through scheduling in warehouse-scale, with years of developing experience in K8s scheduling.

Coordinate Workloads Colocation Zuowei Tao 102422 v1 pdf

Monday October 24, 2022 10:10am - 10:40am EDT
Room 252 AB

Sessions

Content Experience Level Mid-Level
Subject Efficient Software Packaging and Distribution in Kubernetes When Running Thousands or Millions of Jobs in Parallel
Talk Type Pre-record

11:20am EDT

⚡ Lightning Talk: Fluence: Approaching a Converged Computing Environment - Daniel Milroy, Lawrence Livermore National Laboratory & Claudia Misale, IBM T.J. Watson Research Center

Adoption of cloud technologies by high performance computing (HPC) is accelerating, and HPC users want their applications to perform well everywhere. While container orchestration provides resiliency, elasticity, and declarative management, it is not designed to enable app performance like HPC schedulers. In particular, Kube-scheduler is not suited to scheduling emerging HPC workflows that require pods placed advantageously. In response to interest in scheduling flexibility, the K8s community developed the Scheduling Framework to integrate new policies and schedulers. KubeFlux, a Scheduling Framework plugin based on the Fluxion open-source HPC scheduler, provides HPC scheduling capability in K8s. We detail our improvements to the MPI Operator and demonstrate its scalability to 16,384 ranks. With the improved operator we compare the performance of HPC benchmark apps scheduled by Kube-scheduler and KubeFlux. We conclude that KubeFlux makes pod placements that enable much higher app performance than Kube-scheduler. KubeFlux is an example of the rich capability that can be added to K8s and paves the way to converged computing environments with the best capabilities of HPC and cloud.

Speakers

Daniel Milroy

Computer Scientist, Lawrence Livermore National Laboratory

Daniel Milroy is a Computer Scientist at the Center for Applied Scientific Computing at the Lawrence Livermore National Laboratory. His research focuses on graph-based scheduling and resource representation and management for high performance computing (HPC) and cloud converged environments... Read More →

Claudia Misale

Staff Research Scientist, IBM T.J. Watson Research Center

Claudia Misale is a Staff Research Scientist in the Hybrid Cloud Infrastructure Software group at IBM T.J. Watson Research Center (NY). Her research is focused on Kubernetes for IBM Public Cloud, and also targets porting HPC applications to the cloud by enabling batch scheduling alternatives... Read More →

milroy misale kubecon na22 pdf

Monday October 24, 2022 11:20am - 11:30am EDT
Room 252 AB

Lightning Talks

Content Experience Level Mid-Level
Subject Solutions and Tools Built to Manage and Efficiently Run Batch Workloads on Kubernetes
Talk Type In-person

1:25pm EDT

Make Kubernetes Networking Ready for World Class AI and HPC Workloads - Sunyanan Choochotkaew, IBM & Gaurav Singh, Red Hat

While use of Kubernetes for various services is growing rapidly, it is still behind in the world of HPC and AI clusters. Part of the reason is that the lack of support for advanced features like multiple 100G networks available in HPC/AI Systems. Vast majority of AI systems in hyperscalers such as IBM Cloud, AWS, Azure, and Oracle Cloud come with two to 8 100G network interfaces on the A100 GPU nodes. However, by default in Kubernetes, a pod has only one network interface, but attaching multiple interfaces is often a requirement in the scenarios. Multus unlocks the potential of multi-networking feature in Kubernetes, but there are still challenges in usability, manageability, and scalability. We present Multi-NIC CNI, a new open-source project, to democratize multiple interfaces capability for everyone. This CNI saves users from the concerns regarding environment heterogeneity and acquiring CNI specific knowledge. This talk will introduce the architecture, use cases, and performance of the CNI, then show how beneficial it is for HPC/AI. We will demonstrate the CNI on a large scale GPU Cluster consisting of over 1400 GPUs and two 100G network interfaces that we build in IBM Cloud.

Speakers

Sunyanan Choochotkaew

Research Scientist, IBM

Sunyanan Choochotkaew is a research scientist at IBM Research - Tokyo, specializing in research on distributed computing and performance acceleration on Cloud platforms. She received her Ph.D. in information and computer sciences from Osaka University, Japan. She served as a program... Read More →

Gaurav Singh

Product Manager, Red Hat

Gaurav Singh is Product manager in RedHat Openshift . He is responsible for core openshift components like scheduler, kubelet and pod autoscaling . Prior to Red Hat . Gaurav Singh has worked as product manager for Siemens, Hitachi Vantara and Dell.

241022 Multi NIC CNI pdf

241022 Multi NIC CNI pptx

Monday October 24, 2022 1:25pm - 1:55pm EDT
Room 252 AB

Sessions

Content Experience Level Mid-Level
Subject Best Practices and Challenges Running Batch Workloads on Kubernetes
Talk Type In-person

2:00pm EDT

Building Armada – Running Batch Jobs at Massive Scale on Kubernetes - Jamie Poole, G-Research

Thousands of GPUs. Hundreds of thousands of CPUs. Learn how (and why!) G-Research designed and built Armada - a system to enable massive throughput of batch jobs running on Kubernetes. In this session you’ll hear how we use large scale batch compute on Kubernetes to spot patterns in financial markets and predict the future. Armada enables us to schedule millions of batch jobs across many clusters and tens of thousands of nodes, getting optimum utilisation of our hardware to enable our researchers to run the latest machine-learning and advanced data science techniques across vast datasets. We’ll cover the architecture and approach of Armada, challenges and techniques for running Kubernetes at scale and some war stories and lessons learned along the way.

Speakers

Jamie Poole

Compute Platform Engineering Manager, G-Research

Jamie Poole is the manager of the Compute Platform Engineering group at G-Research. He has an academic background in mathematics and physics, and professional background in software development and platforms in the government, defence and financial sectors. He is very enthusiastic... Read More →

Building Armada Jamie Poole pdf

Monday October 24, 2022 2:00pm - 2:30pm EDT
Room 252 AB

Sessions

Content Experience Level Mid-Level
Subject Solutions and Tools Built to Manage and Efficiently Run Batch Workloads on Kubernetes
Talk Type In-person

2:50pm EDT

Managed Kubernetes — Next Gen Academic Infrastructure? - Viktória Spišaková & Lukáš Hejtmánek, Masaryk University

Academic institutions run their own e-infrastructure comprising HPC systems with batch scheduling, cloud infrastructure that allows users to run VMs, and in the last few years, a container infrastructure to run containers. Frequently utilized softwares across all environments are various interactive applications. However, they are not suitable for running in HPC or cloud (as VM), since setting them up requires infrastructural technical knowledge which poses unnecessary load on users. In the Czech NREN, we identified opportunities, situations, use cases where we leveraged K8s for academic use. We decided to offer managed K8s infrastructure for research where users focus solely on their application (container image). Integration with the rest of the infrastructures is DevOps team task. We come to present the progress we have made in this field since the first version of our infrastructure during KCNA 2021. We discuss what a managed K8s infrastructure can offer, e.g. traditional Jupyter notebooks, RStudio servers, but also true 3D game streaming, personalised storage, and much more. Lastly, we open a discussion on the challenges we had to face and issues that are yet to be solved.

Speakers

Viktória Spišaková

IT specialist, Masaryk University

I am 22 y.o. female IT specialist in the area of container cloud computing, HPC integration with nearly 4 years experience as Linux admin and DevOps. Currently, I pursue PhD degree at Masaryk University where I research container-based solutions for problems of academic infrastructures... Read More →

Lukáš Hejtmánek

IT architect, Masaryk University

Lukas Hejtmanek received his Ph.D. degree in Computer Science from the Masaryk University, Brno, Czech Republic. He works as IT architect at Masaryk University in CERIT-SC project and is also storage specialist in at CESNET. His main IT interest is to improve architecture of HPC systems... Read More →

KubernetesBatchHPCDay NA pdf

Monday October 24, 2022 2:50pm - 3:20pm EDT
Room 252 AB

Sessions

Content Experience Level Mid-Level
Subject Case Studies Building Batch + HPC + Data Processing Clusters On-prem or Cloud Using Kubernetes
Talk Type Pre-record

3:25pm EDT

Beyond Experimental: Spark on Kubernetes - Weiwei Yang, Apple

Apache Spark on Kubernetes takes advantage of containers and the large, rapidly growing Kubernetes ecosystem to maximize the data processing capability on the cloud. However, running a large-scale production environment is not an effortless combination. Challenges at scale, dev-ops complexity, multi-cluster management, job scheduling, and autoscaling are all roadblocks that could quickly fail the mission. In this session, Bowen Li and Weiwei Yang will share their insights on leveraging open source technology such as Apache YuniKorn, Spark K8s operator, and Cloud primitives to evolve ML data infrastructure in the cloud, including considerations for multi-tenancy, observability, scalability, and cost-effectiveness.

Speakers

Weiwei Yang

Staff Software Engineer, AIML Data Infra, Apple

Weiwei Yang is a staff engineer at Apple AIML Data Infra, his work is to build the data infra for batch processing and AI/ML workloads. He is the V.P of Apache YuniKorn, committer, and PMC member of Apache Hadoop, he had lots of experience in optimizing the data infra for big data... Read More →

Beyond experimental Spark on Kubernetes pdf

Monday October 24, 2022 3:25pm - 3:55pm EDT
Room 252 AB

Sessions

Content Experience Level Mid-Level
Subject Best Practices and Challenges Running Batch Workloads on Kubernetes
Talk Type In-person