Research Article  Open Access
Qingliang Meng, Meiyu Huang, Yao Xu, Naijin Liu, Xueshuang Xiang, "Decentralized Distributed Deep Learning with LowBandwidth Consumption for Smart Constellations", Space: Science & Technology, vol. 2021, Article ID 9879246, 10 pages, 2021. https://doi.org/10.34133/2021/9879246
Decentralized Distributed Deep Learning with LowBandwidth Consumption for Smart Constellations
Abstract
For the spacebased remote sensing system, onboard intelligent processing based on deep learning has become an inevitable trend. To adapt to the dynamic changes of the observation scenes, there is an urgent need to perform distributed deep learning onboard to fully utilize the plentiful realtime sensing data of multiple satellites from a smart constellation. However, the network bandwidth of the smart constellation is very limited. Therefore, it is of great significance to carry out distributed training research in a lowbandwidth environment. This paper proposes a Randomized Decentralized Parallel Stochastic Gradient Descent (RDPSGD) method for distributed training in a lowbandwidth network. To reduce the communication cost, each node in RDPSGD just randomly transfers part of the information of the local intelligent model to its neighborhood. We further speed up the algorithm by optimizing the programming of random index generation and parameter extraction. For the first time, we theoretically analyze the convergence property of the proposed RDPSGD and validate the advantage of this method by simulation experiments on various distributed training tasks for image classification on different benchmark datasets and deep learning network architectures. The results show that RDPSGD can effectively save the time and bandwidth cost of distributed training and reduce the complexity of parameter selection compared with the TopKbased method. The method proposed in this paper provides a new perspective for the study of onboard intelligent processing, especially for online learning on a smart satellite constellation.
1. Introduction
With the breakthrough development of artificial intelligence and the rapid improvement of onboard computing and storage capabilities, it is an inevitable trend for remote sensing satellite systems to directly generate information required by users through intelligent processing onboard [1, 2]. As earth observation scenes usually present high dynamic characteristics, the traditional training on the groundprediction onboard working mode cannot satisfy the realtime and accurate perception requirement of users. There is an urgent need to learn and update the intelligent model onboard to adapt to the dynamic changes of the scenes.
Affected by factors such as satellite orbits, payloads, physical characteristics of target objects, and imaging methods, more and more intelligent tasks, such as emergency observations in disaster areas and searching for missing Malaysia Airlines, require the cooperation of multiple satellites. Therefore, relying only on the observation data of a single satellite makes it difficult to achieve precise learning of the global intelligent interpretation model for these cooperation tasks.
Benefited from the development of satellite technology and the reduction of satellite development cost, the number of satellites in orbit has increased sharply and the intersatellite networks have been gradually established, which build the foundation for multisatellite collaboration or smart constellation. Based on the collaborative working mode, it is available to integrate the realtime sensing data and computing capabilities of multiple satellites through distributed deep learning technology. Compared with learning the intelligent model on only one satellite, distributed deep learning can achieve the overall optimization and global convergence of the intelligent model without global information or human intervention and thus improve the collaborative perception and cognitive capabilities of the spacebased remote sensing system. Depending on how the tasks are parallelized across satellites, the distributed training can be divided into two categories: model parallelism and data parallelism [3]. Model parallelism means training different parts of networks with multiple workers, which is mainly used for training very large models [4, 5]. In contrast, data parallelism refers to the strategy of partitioning the dataset into smaller splits [6] or collecting data on different devices independently, which is the scenario we study here.
However, due to the particularity of the operating environment of the satellites, which is different from the cluster system on the ground, the network bandwidth of the smart constellation is often very limited. Therefore, it is of great significance and practical urgency to develop distributed deep learning research under a lowbandwidth environment. To deal with this problem, the traditional distributed training methods can be improved from two aspects.
The first aspect is to use decentralized network structures [7–9]. In the traditional centralized network structure, all nodes need to transmit their trained parameters or gradients of the intelligent model to the central server, waiting for the parameter or gradient fusion, and then receive the fused parameters or gradients from the central server. Instead, the decentralized network structure removes the central parameter server and allows all nodes to exchange parameters or gradients with adjacent nodes. In this way, the pressure of communication can be shared with each node to avoid congestion and improve the realtime capability of distributed training.
The second aspect is to reduce data transmission and save bandwidth usage. This can be achieved by communication delay, quantization, and sparsification. These techniques can be used independently or in combination to develop a comprehensive distributed training framework, such as sparse binary compression [10]. Communication delay means communicating after training several batches locally instead of one batch, which reduces the frequency of communication. This technique is used in Local SGD (Stochastic Gradient Descent) [11, 12], federated averaging [13], and federated learning [14]. Quantization means using a lowprecision value to replace the original precise parameters. For example, QSGD [15] (Quantized Stochastic Gradient Descent) adjust the number of bits sent per iteration to smoothly trade off the communication bandwidth against convergence time. The TernGrad approach [16] requires only three numerical levels , which aggressively reduce communication time. DoReFaNet [17] stochastically quantized gradients to low bitwidth numbers.
This paper mainly focuses on using the sparsification technique to overcome communication bottleneck in a lowbandwidth environment. In the sparsification method, only part of network parameters or gradients is sent. For example, Alistarh et al. [18] proposed sorting the gradients in decreasing order of magnitude and truncating the gradient to its top components. They prove the convergence of this TopKbased method analytically. Deep gradient compression [19] also uses the gradient magnitude as a simple heuristic for importance and employs momentum correction, local gradient clipping, momentum factor masking, and warmup training to preserve accuracy. Tsuzuku et al. [20] used the variance of gradients as a signal for compression. Adacomp [21] adaptively tunes the compression rate based on local gradient activity. Amiri and Gündüz [22] considered the physical layer aspects of wireless communication and proposed an analog computation scheme ADSGD (Analog Distributed Stochastic Gradient Descent).
We notice that all these methods choose the magnitude or variance as the indicator of importance, sort the gradients by importance, and then truncate the gradients to top components. For a deep neural network with millions to billions of parameters, this process could be timeconsuming due to its high complexity. In this paper, a novel method named RDPSGD (Randomized Decentralized Parallel Stochastic Gradient Descent) for reducing communication bandwidth by parameter sparsification is proposed. Unlike existing methods utilizing TopK sparsification, in each iteration, we select the parameter to be transferred in a random way, which greatly reduces the complexity of parameter screening. We prove that this strategy can also guarantee the convergence by both theoretical and experimental analysis and optimize the programming to fully leverage the advantage of random parameter sparsification.
The remainder of this paper is organized as follows: Section 2 proposes the RDPSGD method for smart satellite constellations, and Section 3 presents the programming optimization for the proposed method. Section 4 validates our method by experiments. Conclusions and future works are presented in Section 5.
2. Methodology
In this section, we first introduce the distributed training framework for smart satellite constellations. Then, based on the framework, we briefly review a classic distributed deep learning training method, namely, the DPSGD (Decentralized Parallel Stochastic Gradient Descent) method [7]. Lastly, motivated by the analysis of the communication complexity of DPSGD, we propose our RDPSGD method more suitable for a lowbandwidth satellite constellation environment.
2.1. Distributed Training Framework for Smart Satellite Constellations
The distributed training framework for smart satellite constellations is shown in Figure 1. In the framework, each satellite collects remote sensing images in real time and stores them locally. Besides, each satellite is equipped with an intelligent model to perform a certain perception or cognitive task, such as object detection and scene classification of the collected remote sensing images. If every satellite keeps its intelligent model unchanged, it cannot deal with remote sensing images of dynamic scenes or objects, and thus, it is necessary to learn and update the intelligent model onboard, whereas only training the intelligent model with its own local dataset is hard to achieve overall optimization and global convergence. Instead, for distributed training on a satellite constellation, multiple satellites can be connected by intersatellite links to form a communication network. A satellite is called a worker node in this network. A node not only trains the model using its own dataset but also exchanges and averages model parameters with adjacent nodes. In this way, the perception and cognitive capabilities of the satellite constellation can be fully utilized.
In this paper, we assume that the network is of a fixed ring structure, and there is no centralized parameter server in the system. As we mentioned earlier, this design can effectively avoid congestion of communication. However, the RDPSGD method proposed later can be easily applied to other network structures.
2.2. A Review of DPSGD
The distributed training method proposed in this paper is built upon DPSGD [7], which is a very popular decentralized distributed deep learning technique. Lian et al. [7] proved the convergence of DPSGD and showed that DPSGD outperforms centralized algorithms. The DPSGD [7] considers the following stochastic optimization problem: where is the training dataset and is a data sample. denotes the serialized parameter vector of an intelligent model with a specified deep learning network architecture, which is usually a convolutional neural network (such as ResNets [23]), and is a predefined loss function. The above optimization problem can be efficiently and effectively solved by the SGD algorithm [24].
To design parallel SGD algorithms on a decentralized network, the data are distributed onto all nodes such that the original objective defined in (1) can be rewritten as
Define with . There are two ways to distribute : shared data, ; local data with the same distribution, i.e., , has the same distribution of on local data, which is used in this paper.
2.2.1. Definitions and Notations
Throughout this paper, the following definitions and notations are used: (i) denotes the norm of a vector or the spectral norm of a matrix(ii) denotes the Frobenius norm of a matrix(iii) denotes the column vector of with for all elements(iv) denotes the size of a set (v)(vi) denotes the set of locations of nonzero entries in vector (vii) denotes the gradient of a function (viii) denotes the parameter vector at iteration on the th node(ix) denotes the concatenation of the local parameter vectors at iteration (x) denotes the weight matrix, i.e., the network topology, satisfying (i) and (ii) . We use to denote the degree of network . For a ringstructured network, is denoted by (xi) denotes the matrix , where is a vector of independent Bernoulli random variables, each of which has probability of (xii) denotes the binomial distribution(xiii)(xiv)
By the definitions and notations, we have the DPSGD in Algorithm 1.

2.3. RDPSGD
It is easy to check that the communication complexity of DPSGD is . Whereas in a network with low communication capacity, DPSGD may suffer from latency. Here, we introduce a random transferring technique to reduce communication, named RDPSGD (Randomized Decentralized Parallel Stochastic Gradient Descent). Specifically, in the process of model synchronization with adjacent nodes, only a part of the parameters of the intelligent model need to be randomly selected for transmission, which reduces the bandwidth cost and the complexity of parameter filtering compared with the TopKbased methods [18, 19]. The details are stated in Algorithm 2.

Now, we prove the convergence of RDPSGD. Firstly, in DPSGD, it makes a commonly used assumption on the weight matrix.
Assumption 1. is a symmetric stochastic matrix with .
From a global view, at the th iteration, Algorithm 1 can be viewed as
To prove the convergence of DPSGD, it needs a critical property on the weight matrix , i.e., Lemma 5 in the original publication of DPSGD [7], which we reformulate as follows:
Lemma 2. Under Assumption 1, for , we have .
Similarly, at the th iteration, Algorithm 2 can be viewed as
Denote the neighborhood weighted average as with the sparsity ratio (). Then, for each node, we just need to transfer the corresponding information specified in . Then, the communication complexity is almost . When , the communication complexity is reduced. Denote the operator with times composite of . Compared with DPSGD, we need to prove a similar property of like in Lemma 2, to complete the proof of the convergence of RDPSGD:
Lemma 3. Under Assumption 1, for , we have with probability at leastwhere .
Proof. Denote the random vector used to construct in the th operator in . Let , each entry of which indicates whether the th entry of local optimization parameters will be averaged at the th iteration in Algorithm 2. Thus, is a vector of independent Bernoulli random variables. Denote and the th row of matrix . Denote the th row of matrix . By (6) and the definition of , and , we have Let . Since is doubly stochastic and , we have , . Then, we have where denotes the times of in . Since , we have If , we have then it is easy to check Thus, combined with that are independent, yielding This ends the proof.
Denote which represents the decaying rate under probability than . Suppose empirical, we have
Figure 2 shows the numerical and theoretical in a ring network of 8 nodes. The value of varies from to with step under different minimal probability . The theoretical and numerical values of the convergence rate are in good agreement, indicating that RDPSGD can meet the convergence requirements.
Assuming that the average speed of the intersatellite link is 2 MB/s, the effect of utilizing RDPSGD is shown in Table 1. It can be seen that RDPSGD can reduce bandwidth and time cost linearly and thus make distributed training in lowbandwidth environment practical.
3. Programming Optimization
The refinement of the proposed RDPSGD algorithm over DPSGD also introduces additional programming problems. For the DPSGD method, a full cycle of parameter transmission for model synchronization, i.e., step 4 in Algorithm 1, can be divided into three parts: serialization of the parameters of the intelligent model (), communication of the parameters (), and deserialization of the parameters to recover the deep learning network structure of the intelligent model (); namely, the time cost of each cycle of parameter transmission for DPSGD is
For RDPSGD, aiming at lowbandwidth communication of the parameters (), extra steps are needed, including generation of the random index that indicates which parameters need to be transmitted and transmission of the random index (), i.e., step 2 in Algorithm 2, and extraction of parameters to be transmitted according to the random index () and expansion of the extracted sparse parameters into dense network parameters () in steps 5 and 6 in Algorithm 2; namely, the time cost of each cycle of parameter transmission for RDPSGD is
Then, the difference between and is
Therefore, RDPSGD has lower time complexity only if
Equation (19) shows that (i)the generation and transmission of the random index and the extraction and expansion of the parameters of the intelligent model should be optimized as far as possible to give full play to the acceleration effect of RDPSGD(ii)the lower the network bandwidth, the higher the value of , thus the more obvious the acceleration
In order to improve the acceleration effect of RDPSGD, we optimize the programming from two aspects.
Random index generation. We first observe step 2 of Algorithm 2, in which the random index vector is generated from Bernoulli distribution with probability . Suppose the total number of the parameters of the intelligent model is , this direct approach has to perform times of random number generation and thresholding regardless of . Consider is a binary vector, the indices of elements with value 1 are denoted by where the size of is . The difference of adjacent elements of is
We can infer that the elements of obey geometric distribution. Therefore, if we transform the random index vector into , we need only perform times of random number generation. Another advantage of transforming into is reducing bandwidth cost when the sparsity ratio is high. For example, when , an 8bit integer is enough to represent the value of each element in . In total, employ bit of data, which is less than bit using 01 Boolean representation.
Parameter extraction. Regarding serialization, deserialization, extraction, and expansion operations in steps 5 and 6 of Algorithm 2, the general implementation refers to a series of timeconsuming forloop operations. Aiming at speeding up RDPSGD, the builtin functions in Numpy and PyTorch is used as much as possible to take advantage of dedicated CPU and GPU acceleration for vector and tensor operations.
4. Experiments
We evaluate our RDPSGD methods on several distributed training tasks for image classification on different benchmark datasets and deep learning network architectures by ground simulation. Specifically, we studied the performance of ResNet20 [23] and VGG16 [25] on CIFAR10 [26] and ResNet50 [23] on ImageNet1k [27]. In our experiments, we test RDPSGD on a ringstructured network which consists of 8 worker nodes, each of which is simulated by a workstation with a RTX 3070 GPU. The dataset is randomly split into 8 subsets to simulate the different data collected by each satellite. The algorithm is implemented using PyTorch with Gloo as a communication backend.
Models are trained using SGD with momentum and weight decay on every single node. The setup of hyperparameter is as follows: (1)Batch size: 32 for ResNet20, 128 for VGG16, and 64 for ResNet50(2)Learning rate : For ResNet20, starting from 0.1, and divided by 10 at the 80th and 120th epoch. For VGG16, starting from 0.5 and divided by 2 every 25 epochs. For ResNet50, starting from 0.1 and divided by 10 every 30 epochs(3)Number of epochs and iterations: For ResNet20, the maximum number of epochs is 200, and there are 196 iterations per epoch on a single node. For VGG16, there are 200 epochs and iterations in total. For ResNet50, there are 90 epochs and iterations(4)Momentum: 0.9(5)Weight decay: 0.0001(6)Synchronization delay: For ResNet20 and VGG16, the parameter synchronization is performed after every batch, while for ResNet50, the parameter synchronization is performed after every 100 batches
The convergence and bandwidth saving effect of RDPSGD are analyzed. The acceleration effect of programming optimization is evaluated, and the time cost is compared with the TopKbased methods [18, 19].
4.1. Convergence of RDPSGD
Figures 3–5 show the convergence of the loss function and prediction accuracy of different models with different sparsity ratios using RGPSGD. As shown in Figure 3, training ResNet20 on CIFAR10 can achieve convergence under different sparsity rates with no accuracy loss. When the sparsity ratio is 0.1, i.e., the transmitted model parameters are reduced by 10 times, the training accuracy can still reach more than 90%. A similar phenomenon also presents when training VGG16 on CIFAR10 and training ResNet50 on ImageNet1k, as shown in Figures 4 and 5. These results demonstrate that the proposed RDPSGD method can converge on different distributed training tasks under different sparsity rates.
4.2. Bandwidth Cost
The bandwidth cost of one full cycle of transmitting ResNet50 from one node to another is shown in Figure 6. When the sparsity ratio is close to 1, the bandwidth cost of RDPSGD is higher than DPSGD, because an extra vector containing the indices of parameters is transmitted. After the critical value around 0.8, as the sparsity ratio continues to descend, the bandwidth cost decreases approximately linearly.
4.3. Programming Optimization
We evaluate the time cost of training ResNet50 on ImageNet for one epoch using different methods. Table 2 shows that, without programming optimization, the time cost of RDPSGD is even higher than DPSGD. After the programming optimization is applied, the average time cost of generation of random index reduces from 0.432 s to 0.056 s, which is speeded up by . Meanwhile, the average time cost of extraction and expansion of parameters reduces from 12.855 s to 0.431 s, which is speeded up by . And the whole speed up effect of programming optimization is shown in Table 2, where the time cost of RDPSGD is reduced to 781.15 s, less than that of DPSGD.

As we mentioned earlier, the lower the bandwidth, the more obvious the acceleration. To prove this, we use the Trickle tool to limit the bandwidth to no more than 200 kb/s. We define the time cost of parameter synchronization in one epoch as the whole time cost deducting the time needed for gradient calculation and backpropagation. The result in Table 3 shows that the speedup ratio is indeed higher when the bandwidth is lower, which is more relevant to the sparsity rate.

4.4. Comparison with the TopKBased Methods
Table 4 shows the time cost of parameter extraction of the TopKbased methods [18, 19] and RDPSGD at sparsity ratios 0.1 and 0.5. The result indicates that RDPSGD can accelerate the parameter extraction by compared with the TopKbased methods, through selecting the parameter to be transferred in a random way instead of screening the parameters according to their magnitudes, which requires sorting operation of high time complexity.

5. Conclusion and Future Work
This paper proposed RDPSGD, a decentralized distributed training algorithm with lowbandwidth consumption for a smart constellation that randomly selects a part of model parameters to transmit. We prove the convergence of this algorithm theoretically and optimize the programming to further speed up the practical application. The experiment results show that the convergence and acceleration requirements in a lowbandwidth environment can be met, and the algorithm can outperform the TopKbased method in parameter extraction, which shows that this is a promising method for future distributed deep learning on a spacebased remote sensing system.
The work in this paper can be improved in the future. Firstly, the algorithm is tested on distributed training tasks on a labeled dataset, while the data used for onboard training are usually unlabeled. The algorithm can be extended for semisupervised or unsupervised training. Secondly, our current experiment is conducted in a cluster environment with fixed network topology and homogeneous nodes on the ground, using software to simulate the lowbandwidth intersatellite network. We will study the performance of the algorithm in a dynamic heterogeneous network and carry out the onboard verification and corresponding engineering optimization research in the future.
Data Availability
The datasets used in this paper include CIFAR10 (http://www.cs.toronto.edu/~kriz/cifar.html) and ImageNet1k (https://imagenet.org/challenges/LSVRC/2012/).
Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this article.
Authors’ Contributions
X. Xiang and N. Liu conceived the idea of this study and supervised the study. X. Xiang performed theoretical proof and numerical analysis. Q. Meng, M. Huang, and X. Yao conducted the experiments. Q. Meng and M. Huang performed data analysis. X. Xiang, Q. Meng, and M. Huang wrote the manuscript. All authors discussed the results and contributed to the final version of the manuscript. Qingliang Meng and Meiyu Huang contributed equally to this work.
Acknowledgments
This is supported by the Beijing Nova Program of Science and Technology under Grant Z191100001119129 and the National Natural Science Foundation of China 61702520.
References
 G. Giuffrida, L. Diana, F. de Gioia et al., “Cloudscout: a deep neural network for onboard cloud detection on hyperspectral images,” Remote Sensing, vol. 12, no. 14, p. 2205, 2020. View at: Publisher Site  Google Scholar
 H. Li, H. Zheng, C. Han, H. Wang, and M. Miao, “Onboard spectral and spatial cloud detection for hyperspectral remote sensing images,” Remote Sensing, vol. 10, no. 1, p. 152, 2018. View at: Publisher Site  Google Scholar
 H. Zhang, Machine Learning Parallelism Could Be Adaptive, Composable and Automated, [Ph.D. thesis], Carnegie Mellon University, 2020.
 J. Devlin, M.W. Chang, K. Lee, and K. Toutanova, Bert: PreTraining of Deep Bidirectional Transformers for Language Understanding, NAACL, 2019.
 T. Brown, B. Mann, N. Ryder et al., “Language models are fewshot learners,” Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020. View at: Google Scholar
 X. Jia, S. Song, W. He et al., “Highly scalable deep learning training system with mixedprecision: training imagenet in four minutes,” 2018, https://arxiv.org/abs/1807.11205. View at: Google Scholar
 X. Lian, C. Zhang, H. Zhang, C.J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,” Advances in Neural Information Processing Systems, pp. 5330–5340, 2017. View at: Google Scholar
 H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “D2: decentralized training over decentralized data,” in International Conference on Machine Learning, PMLR, pp. 4848–4856, 2018. View at: Google Scholar
 X. Lian, W. Zhang, C. Zhang, and J. Liu, “Asynchronous decentralized parallel stochastic gradient descent,” in International Conference on Machine Learning, PMLR, pp. 3043–3052, Stockholm, Sweden, 2018. View at: Google Scholar
 F. Sattler and S. Wiedemann, “Sparse binary compression: towards distributed deep learning with minimal communication,” in 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8, Budapest, Hungary, 2019. View at: Publisher Site  Google Scholar
 S. U. Stich, “Local sgd converges fast and communicates little,” in 7th International Conference on Learning Representations, pp. 1–17, New Orleans, USA, 2019. View at: Google Scholar
 T. Lin, S. U. Stich, K. K. Patel, and M. Jaggi, “Don’t use large minibatches, use local sgd,” in 8th International Conference on Learning Representations, pp. 1–13, Addis Ababa, Ethiopia, 2020. View at: Google Scholar
 B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communicationefficient learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics, pp. 1273–1282, Seattle, Washington, USA, 2017. View at: Google Scholar
 J. Konečný, M. M. HB, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: strategies for improving communication efficiency,” 2016, https://arxiv.org/abs/1610.05492. View at: Google Scholar
 D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: communicationefficient sgd via gradient quantization and encoding,” Advances in Neural Information Processing Systems, vol. 30, pp. 1709–1720, 2017. View at: Google Scholar
 W. Wen, C. Xu, F. Yan et al., “Terngrad: ternary gradients to reduce communication in distributed deep learning,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1508–1518, Long Beach, CA, USA, 2017. View at: Google Scholar
 S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefanet: training low bitwidth convolutional neural networks with low bitwidth gradients,” 2016, https://arxiv.org/abs/1606.06160. View at: Google Scholar
 D. Alistarh, T. Hoefler, M. Johansson, S. Khirirat, N. Konstantinov, and C. Renggli, “The convergence of sparsified gradient methods,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 5977–5987, Montreal Convention Center, Montreal, Canada, 2018. View at: Google Scholar
 Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient compression: reducing the communication bandwidth for distributed training,” in 6th International Conference on Learning Representations, pp. 1–14, Vancouver, BC, Canada, 2018. View at: Google Scholar
 Y. Tsuzuku, H. Imachi, and T. Akiba, “Variancebased gradient compression for efficient distributed deep learning,” in 6th International Conference on Learning Representations, pp. 1–12, Vancouver, BC, Canada, 2018. View at: Google Scholar
 C.Y. Chen, J. Choi, D. Brand, A. Agrawal, W. Zhang, and K. Gopalakrishnan, “Adacomp: adaptive residual gradient compression for dataparallel distributed training,” in Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2827–2827, New Orleans, Louisiana, USA, 2018. View at: Google Scholar
 M. Mohammadi Amiri and D. Gunduz, “Machine learning at the wireless edge: distributed stochastic gradient descent overtheair,” IEEE Transactions on Signal Processing, vol. 68, pp. 2155–2169, 2020. View at: Publisher Site  Google Scholar
 K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, Las Vegas, NV, USA, 2016. View at: Publisher Site  Google Scholar
 O. Dekel, G. B. Ran, O. Shamir, and L. Xiao, “Optimal distributed online prediction using minibatches,” JMLR, vol. 13, no. 1, pp. 165–202, 2012. View at: Google Scholar
 K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in 3rd International Conference on Learning Representations, pp. 1–14, San Diego, CA, USA, 2015. View at: Google Scholar
 A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Handbook of Systemic Autoimmune Diseases, vol. 1, no. 4, 2009. View at: Google Scholar
 O. Russakovsky, J. Deng, H. Su et al., “ImageNet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2021 Qingliang Meng et al. Exclusive Licensee Beijing Institute of Technology Press. Distributed under a Creative Commons Attribution License (CC BY 4.0).