
Deep learning with GPU
As the name suggests, deep learning involves learning a deeper representation of data, which requires large amounts of computational power. Such massive computational power is usually not possible with modern day CPUs. GPUs, on the other hand, lend themselves very nicely to this task. GPUs were originally designed for rendering graphics in real time. The design of a typical GPU allows for the disproportionately larger number of arithmetic logical unit (ALU), which allows them to crunch a large number of calculations in real time.
GPUs used for general purpose computation have a high data parallel architecture, which means they can process a large number of data points in parallel, leading to higher computational throughput. Each GPU is composed of thousands of cores. Each of such cores consists of a number of functional units which contain cache and ALU among other modules. Each of these functional units executes exactly the same instruction set thereby allowing for massive data parallelism in GPUs. In the next section, we compare and contrast the design of a GPU with CPU.
The following table illustrates the differences between the design of CPU with a GPU. As shown, GPUs are designed to execute a large number of threads optimized to execute an identical control logic. Hence, each of the GPU cores is rather simple in design. CPUs, on the other hand, are designed to operate with fewer cores but are the more general purpose. Their basic core design can handle highly complex control logic, which is usually not possible in GPUs. Hence CPUs can be thought of like a commodity processing unit as opposed to GPUs which are specialized units:

In terms of relative performance comparison, GPU's have a much lower latency than CPUs for performing high data parallel operations. This is also especially true if the GPU has enough device memory to load all the required data needed for peak load computation. However, for a head to head the number of core comparison, CPU's have a much lower latency as each CPU core is much more complex and has an advanced state control logic as opposed to a GPU.
As such, the design of the algorithm has a great bearing on potential benefits of using GPU versus CPU. The following table outlines what algorithms are a good choice for a GPU implementation. Erik Smistad and their co-authors outline five different factors that determine the suitability of the algorithm towards using a GPU–data parallelism, thread count, branch divergence, memory usage, and synchronization.
The table Factors affecting GPU Computing by Dutta-Roy illustrates the impact of all of these factors on the suitability of using a GPU. As shown following, any algorithm which fares under the High column is more suited to using a GPU than others:
