NVIDIA has today revealed the features included in its release of NVIDIA CUDA 11.4, which includes GPU-accelerated libraries, debugging and optimization tools, programming language enhancements, and a runtime library. The new runtime library has been created to help developers construct and deploy their applications on GPUs across the major CPU architectures: x86, Arm, and POWER, explains NVIDIA.
The latest CUDA 11.4 release ships with the R470 driver, a long-term support branch and focuses on enhancing the programming model and performance of your CUDA applications. ” CUDA continues to push the boundaries of GPU acceleration and lay the foundation for new applications in HPC, graphics, CAE applications, AI and deep learning, automotive, healthcare, and data sciences.”
“This release introduced key enhancements to improve the performance of NVIDIA CUDA Graphs without requiring any modifications to the application or any other user intervention. It also improves the ease of use of Multi-Process Service (MPS). We formalized the asynchronous programming model in the CUDA Programming Guide. CUDA graphs are ideal for workloads that are executed multiple times, so a key tradeoff in choosing graphs for a workload is amortizing the cost of creating a graph over repeated launches. The higher the number of repetitions or iterations, the larger the performance improvement.”
“Reducing graph launch latency is a common request from the developer community, especially in applications that have real-time constraints, such as 5G telecom workloads or AI inference workloads. CUDA 11.4 delivers performance improvements in reducing the CUDA graph launch times. In addition, we also integrated the stream-ordered memory allocation feature that was introduced in CUDA 11.2.”
“In NVIDIA CUDA 11.4, we made a couple of key changes to CUDA graph internals that further improve the launch performance. CUDA graphs already sidesteps streams to enable lower latency runtime execution. We extended this, to bypass streams even at the launch phase, submitting a graph as a single block of work directly to the hardware. We’ve seen good performance gains from these host improvements, both for single-threaded and multithreaded applications.”
Source : NVIDIA