Modern High Performance Computing (HPC) applications, such as Earth science simulations, produce large amounts of data due to the surging of computing power, while big data applications have become more compute-intensive due to increasingly sophisticated analysis algorithms. The needs of both HPC and big data technologies for advanced HPC and big data applications create a demand for integrated system support. In this study, we introduce Scientific Data Processing (SciDP) to support both HPC and big data applications via integrated scientific data processing. SciDP can directly process scientific data stored on a Parallel File System (PFS), which is typically deployed in an HPC environment, in a big data programming environment running atop Hadoop Distributed File System (HDFS). SciDP seamlessly integrates PFS, HDFS, and the widely-used R data analysis system to support highly efficient processing of scientific data. It utilizes the merits of both PFS and HDFS for fast data transfer, overlaps computing with data accessing, and integrates R into the data transfer process. Experimental results show that SciDP accelerates analysis and visualization of a production NASA Center for Climate Simulation (NCCS) climate and weather application by 6x to 8x when compared to existing solutions.
现代高性能计算(HPC)应用,比如地球科学模拟,由于计算能力的激增会产生大量数据,而大数据应用由于日益复杂的分析算法变得计算密集度更高。先进的高性能计算和大数据应用对高性能计算和大数据技术的需求产生了对集成系统支持的需求。在这项研究中,我们引入科学数据处理(SciDP),通过集成的科学数据处理来支持高性能计算和大数据应用。SciDP能够在运行于Hadoop分布式文件系统(HDFS)之上的大数据编程环境中直接处理存储在并行文件系统(PFS)中的科学数据,PFS通常部署在高性能计算环境中。SciDP将PFS、HDFS以及广泛使用的R数据分析系统无缝集成,以支持对科学数据的高效处理。它利用PFS和HDFS在快速数据传输方面的优势,使计算与数据访问重叠,并将R集成到数据传输过程中。实验结果表明,与现有解决方案相比,SciDP将美国国家航空航天局气候模拟中心(NCCS)的一个气候和天气生产应用的分析和可视化速度提高了6到8倍。