Registration is now open for CppCon 2023! The conference starts on October 1 and will be held in person in Aurora, CO. To whet your appetite for this year’s conference, we’re posting some upcoming talks that you will be able to attend this year. Here’s another CppCon future talk we hope you will enjoy – and register today for CppCon 2023!
Tuesday, October 3 • 16:45 - 17:45
by Dian-Lun Lin
Summary of the talk:
Task graph computing system (TGCS) plays an essential role in high-performance computing. Unlike loop-based models, TGCSs encapsulate function calls and their dependencies in a top-down task graph to implement irregular parallel decomposition strategies that scale to large numbers of processors, including manycore central processing units (CPUs) and graphics processing units (GPUs). As a result, recent years have seen a great deal amount of TGCS research, just name a few: Taskflow, oneTBB, Kokkos-DAG, and HPX. However, one common challenge faced by TGCSs is the issue of synchronization within each task. For instance, in scenarios where a task involves executing GPU operations, a CPU thread typically needs to wait until the GPU completes the operations before proceeding further. This synchronization overhead can hinder performance and limit the overall scalability of TGCSs.
The introduction of C++ coroutines in C++20 has revolutionized asynchronous programming, offering improved concurrency and expressiveness. However, integrating TGCS with C++ coroutines presents several challenges. Firstly, existing TGCS solutions are not compatible with C++ coroutines, as the coroutine paradigm deviates from traditional C++ programming. This incompatibility makes it difficult to seamlessly incorporate coroutines into existing TGCS frameworks. Secondly, C++ coroutine programming is extremely difficult and requires a solid understanding of the underlying concepts and mechanisms. The introduction of a new paradigm adds complexity and a steep learning curve for developers. Lastly, while C++ coroutines offer a powerful mechanism for managing asynchronous operations, designing and implementing an efficient scheduler to leverage their capabilities remains challenging. To fully exploit the benefits of C++ coroutines, there is a need for a specialized scheduler that can handle large numbers of coroutines and make optimal use of hardware resources.
To address these challenges, we present Taro: Task-Graph-Based Asynchronous Programming using C++ Coroutine. Taro offers a task-graph-based programming model for C++ coroutines, simplifying the expression of complex control flows and reducing development complexity. Additionally, Taro incorporates an efficient work-stealing scheduling algorithm tailored for C++ coroutines, minimizing unnecessary context switches, CPU migrations, and cache misses.
In this session, I will introduce Taro's programming model and demonstrate how Taro can enable efficient multitasking between CPU and GPU tasks, avoiding blocking wait on CPU threads for GPU tasks to finish. I will show the example code for using Taro. Finally, I will demonstrate how our solution can improve the performance of a real-world RTL simulation workload and microbenchmarks. Taro will be open-source and available on GitHub.