Analyzing astronomical data with Apache Spark [IMA]

http://arxiv.org/abs/1804.07501


We investigate the performances of Apache Spark, a cluster computing framework, for analyzing data from future LSST-like galaxy surveys. Apache Spark attempts to address big data problems have hitherto proved successful in the industry, but its main use is often limited to naively structured data. We show how to manage more complex binary data structures such as those handled in astrophysics experiments, within a distributed environment. To this purpose, we first designed and implemented a Spark connector to handle sets of arbitrarily large FITS files, called spark-fits. The user interface is such that a simple file “drag-and-drop” to a cluster gives full advantage of the framework. We demonstrate the very high scalability of spark-fits using the LSST fast simulation tool, CoLoRe, and present the methodologies for measuring and tuning the performance bottlenecks for the workloads, scaling up to terabytes of FITS data on the Cloud@VirtualData, located at Universit\’e Paris Sud. We also evaluate its performance on Cori, a High-Performance Computing system located at NERSC, and widely used in the scientific community.

Read this paper on arXiv…

J. Peloton, C. Arnault and S. Plaszczynski
Mon, 23 Apr 18
28/63

Comments: 9 pages, 6 figures. Package available at this https URL