The availability of large datasets has significantly contributed to recent advancements in deep Convolutional Neural Network (CNN) models. However, training a large CNN model using such datasets is a time-consuming task. This issue has been addressed by the parallelization and distribution of data/model during the training process. There are two ways to implement distributed deep learning processes: data parallelism and model parallelism. Data parallelism involves distributing the dataset across multiple workers, allowing them to process different portions simultaneously. While increasing the number of workers can reduce computation time, it also introduces additional communication time. In some cases, the increased communication time can outweigh the benefits gained from reduced computation time. In this paper, our focus is on reducing the overall computation time of data parallel approach by employing two strategies. First, we emphasize the preservation of dataset distribution across all workers, ensuring that each worker has access to representative data. Second, we explore the localization of parameters and the quantization of gradients to three levels: {-1, 0, 1} to reduce communication delays between the server and workers, as well as between workers themselves. By adopting these two strategies, we aim to enhance the performance of data parallel approach in the distributed deep learning processes. As a result of preserving the distribution of the data while sampling the entire data, each partition retains a similar mean and variance (capturing important first and second-order statistics). This approach guarantees that all worker machines train their local models on uniformly distributed data instead of random distribution. Additionally, localizing parameters limits the communication between the server and workers to gradients only. Furthermore, by quantizing gradients to 2-bits, we successfully achieve our objective of reducing computation time by enabling faster convergence without compromising test or validation accuracy. The experimental results demonstrate that employing these strategies in distributed deep learning effectively reduces communication overhead and leads to faster convergence when compared to methods that utilize random data sampling. These improvements were observed across multiple datasets such as MNIST, CIFAR-10, and Tiny ImageNet.
|