In order to utilize the distributed characteristic of sensors, distributed machine learning has become the mainstream approach, but the different computing capability of sensors and network delays greatly influence the accuracy and the convergence rate of the machine learning model. Our paper describes a reasonable parameter communication optimization strategy to balance the training overhead and the communication overhead. We extend the fault tolerance of iterative-convergent machine learning algorithms and propose the Dynamic Finite Fault Tolerance (DFFT). Based on the DFFT, we implement a parameter communication optimization strategy for distributed machine learning, named Dynamic Synchronous Parallel Strategy (DSP), which uses the performance monitoring model to dynamically adjust the parameter synchronization strategy between worker nodes and the Parameter Server (PS). This strategy makes full use of the computing power of each sensor, ensures the accuracy of the machine learning model, and avoids the situation that the model training is disturbed by any tasks unrelated to the sensors.
为了利用传感器的分布式特性,分布式机器学习已成为主流方法,但是传感器不同的计算能力和网络延迟极大地影响了机器学习模型的准确性和收敛速度。我们的论文描述了一种合理的参数通信优化策略,以平衡训练开销和通信开销。我们扩展了迭代收敛机器学习算法的容错能力,并提出了动态有限容错(DFFT)。基于DFFT,我们为分布式机器学习实现了一种参数通信优化策略,名为动态同步并行策略(DSP),它使用性能监测模型动态调整工作节点和参数服务器(PS)之间的参数同步策略。该策略充分利用了每个传感器的计算能力,确保了机器学习模型的准确性,并避免了模型训练受到任何与传感器无关的任务干扰的情况。