Abstract. To cope with the challenges of memory bottleneck and algorithmic scalability when massive data sets are involved, we propose a distributed least squares procedure in the framework of functional linear model and reproducing kernel Hilbert space. This approach divides the big data set into multiple subsets, applies regularized least squares regression on each of them, and then averages the individual outputs as a final prediction. We establish the non-asymptotic prediction error bounds for the proposed learning strategy under some regularity conditions. When the target function only has weak regularity, we also introduce some unlabelled data to construct a semi-supervised approach to enlarge the number of the partitioned subsets. Results in present paper provide a theoretical guarantee that the distributed algorithm can achieve the optimal rate of convergence while allowing the whole data set to be partitioned into a large number of subsets for parallel processing.
**摘要**:当涉及大规模数据集时,为应对内存瓶颈和算法可扩展性的挑战,我们在函数线性模型和再生核希尔伯特空间的框架下提出了一种分布式最小二乘法。该方法将大数据集划分为多个子集,对每个子集应用正则化最小二乘回归,然后将各个输出取平均作为最终预测。我们在一些正则性条件下,为所提出的学习策略建立了非渐近预测误差界。当目标函数仅有弱正则性时,我们还引入一些未标记数据来构建一种半监督方法,以增加划分后的子集数量。本文的结果为分布式算法提供了一种理论保证,即该算法能够实现最优收敛速度,同时允许将整个数据集划分为大量子集进行并行处理。