Following on from the previous chapter designed to teach you more about Machine Learning frameworks, NVIDIA has today released chapter 2 providing the net step in the learning process covering what you need to know about data loading and data transfer bottlenecks and other aspects of efficient framework interoperability. In the first chapter NVIDIA discussed the pros and cons of distinct memory layouts as well as memory pools for asynchronous memory allocation to enable zero-copy functionality. In the second chapter NVIDIA discusses bottlenecks that can occur during data loading and transfers and how to mitigate them using Remote Direct Memory Access (RDMA) technology.
“Efficient pipeline design is crucial for data scientists. When composing complex end-to-end workflows, you may choose from a wide variety of building blocks, each of them specialized for a dedicated task. Unfortunately, repeatedly converting between data formats is an error-prone and performance-degrading endeavor. Let’s change that!”
“Thus far, we have worked on the assumption that the data is already loaded in memory and that a single GPU is used. This section highlights a few bottlenecks that might occur when loading your dataset from storage to device memory or when transferring data between two GPUs using either a single node or multiple nodes setting. We then discuss how to overcome them.
In a traditional workflow (Figure 1), when a dataset is loaded from storage to GPU memory, the data will be copied from the disk to the GPU memory using the CPU and the PCIe bus. Loading the data requires at least two copies of the data. The first one happens when transferring the data from the storage to the host memory (CPU RAM). The second copy of the data occurs when transferring the data from the host memory to the device memory (GPU VRAM).”
For the complete second chapter of Machine Learning frameworks jump over to the official NVIDIA blog by following the link below.
Source : NVIDIA