Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku [CL]

http://arxiv.org/abs/2304.11002


The increasing availability of machines relying on non-GPU architectures, such as ARM A64FX in high-performance computing, provides a set of interesting challenges to application developers. In addition to requiring code portability across different parallelization schemes, programs targeting these architectures have to be highly adaptable in terms of compute kernel sizes to accommodate different execution characteristics for various heterogeneous workloads. In this paper, we demonstrate an approach to code and performance portability that is based entirely on established standards in the industry. In addition to applying Kokkos as an abstraction over the execution of compute kernels on different heterogeneous execution environments, we show that the use of standard C++ constructs as exposed by the HPX runtime system enables superb portability in terms of code and performance based on the real-world Octo-Tiger astrophysics application. We report our experience with porting Octo-Tiger to the ARM A64FX architecture provided by Stony Brook’s Ookami and Riken’s Supercomputer Fugaku and compare the resulting performance with that achieved on well established GPU-oriented HPC machines such as ORNL’s Summit, NERSC’s Perlmutter and CSCS’s Piz Daint systems. Octo-Tiger scaled well on Supercomputer Fugaku without any major code changes due to the abstraction levels provided by HPX and Kokkos. Adding vectorization support for ARM’s SVE to Octo-Tiger was trivial thanks to using standard C++

Read this paper on arXiv…

P. Diehl, G. Daiß, K. Huck, et. al.
Mon, 24 Apr 23
11/41

Comments: N/A

Reproducing the results for NICER observation of PSR J0030+0451 [HEAP]

http://arxiv.org/abs/2304.01035


NASA’s Neutron Star Interior Composition Explorer (NICER) observed X-ray emission from the pulsar PSR J0030+0451 in 2018. Riley \textit{et al.} reported Bayesian parameter measurements of the mass and the radius of the star using pulse-profile modeling of the X-ray data. In this paper, we reproduce their result using the open-source software \textit{X-PSI} and the publicly available data. We reproduce the main result within the expected statistical error. We note the challenges we faced in reproducing the results. We demonstrate that not only that the analysis can be reproduced, it can also be reused in future works by changing the prior distribution for the radius, and by changing the sampler configuration. We find no significant change in the measurement of the mass and radius, demonstrating that the original result is robust to these changes. Finally, we provide a containerized working environment that facilitates third-party reproduction of the measurements of mass and radius of PSR J0030+0451 using the NICER observations.

Read this paper on arXiv…

C. Afle, P. Miles, S. Caino-Lores, et. al.
Tue, 4 Apr 23
49/111

Comments: 12 pages, 4 figures, 2 tables

Asymmetric distribution of data products from WALLABY, an SKA precursor neutral hydrogen survey [IMA]

http://arxiv.org/abs/2303.11670


The Widefield ASKAP L-band Legacy All-sky Blind surveY (WALLABY) is a neutral hydrogen survey (HI) that is running on the Australian SKA Pathfinder (ASKAP), a precursor telescope for the Square Kilometre Array (SKA). The goal of WALLABY is to use ASKAP’s powerful wide-field phased array feed technology to observe three quarters of the entire sky at the 21 cm neutral hydrogen line with an angular resolution of 30 arcseconds. Post-processing activities at the Australian SKA Regional Centre (AusSRC), Canadian Initiative for Radio Astronomy Data Analysis (CIRADA) and Spanish SKA Regional Centre prototype (SPSRC) will then produce publicly available advanced data products in the form of source catalogues, kinematic models and image cutouts, respectively. These advanced data products will be generated locally at each site and distributed across the network. Over the course of the full survey we expect to replicate data up to 10 MB per source detection, which could imply an ingestion of tens of GB to be consolidated in the other locations near real time. Here, we explore the use of an asymmetric database replication model and strategy, using PostgreSQL as the engine and Bucardo as the asynchronous replication service to enable robust multi-source pools operations with data products from WALLABY. This work would serve to evaluate this type of data distribution solution across globally distributed sites. Furthermore, a set of benchmarks have been developed to confirm that the deployed model is sufficient for future scalability and remote collaboration needs.

Read this paper on arXiv…

M. Parra-Royon, A. Shen, T. Reynolds, et. al.
Wed, 22 Mar 23
45/68

Comments: N/A

High Performance W-stacking for Imaging Radio Astronomy Data: a Parallel and Accelerated Solution [IMA]

http://arxiv.org/abs/2301.06061


Current and upcoming radio-interferometers are expected to produce volumes of data of increasing size that need to be processed in order to generate the corresponding sky brightness distributions through imaging. This represents an outstanding computational challenge, especially when large fields of view and/or high resolution observations are processed. We have investigated the adoption of modern High Performance Computing systems specifically addressing the gridding, FFT-transform and w-correction of imaging, combining parallel and accelerated solutions. We have demonstrated that the code we have developed can support dataset and images of any size compatible with the available hardware, efficiently scaling up to thousands of cores or hundreds of GPUs, keeping the time to solution below one hour even when images of the size of the order of billion or tens of billion of pixels are generated. In addition, portability has been targeted as a primary objective, both in terms of usability on different computing platforms and in terms of performance. The presented results have been obtained on two different state-of-the-art High Performance Computing architectures.

Read this paper on arXiv…

C. Gheller, G. Taffoni and D. Goz
Wed, 18 Jan 23
10/133

Comments: 16 pages, 12 figures, accepted for publication on RAS Techniques and Instruments

GPU-based high-precision orbital propagation of large sets of initial conditions through Picard-Chebyshev augmentation [CL]

http://arxiv.org/abs/2301.03989


The orbital propagation of large sets of initial conditions under high accuracy requirements is currently a bottleneck in the development of space missions, e.g. for planetary protection compliance analyses. The proposed approach can include any force source in the dynamical model through efficient Picard-Chebyshev (PC) numerical simulations. A two-level augmentation of the integration scheme is proposed, to run an arbitrary number of simulations within the same algorithm call, fully exploiting high performance and GPU (Graphics Processing Units) computing facilities. The performances obtained with implementation in C and NVIDIA CUDA programming languages are shown, on a test case taken from the optimization of a Solar Orbiter-like first resonant phase with Venus.

Read this paper on arXiv…

A. Masat, C. Colombo and A. Boutonnet
Wed, 11 Jan 23
34/80

Comments: N/A

The Gaia AVU-GSR parallel solver: preliminary studies of a LSQR-based application in perspective of exascale systems [IMA]

http://arxiv.org/abs/2212.11675


The Gaia Astrometric Verification Unit-Global Sphere Reconstruction (AVU-GSR) Parallel Solver aims to find the astrometric parameters for $\sim$10$^8$ stars in the Milky Way, the attitude and the instrumental specifications of the Gaia satellite, and the global parameter $\gamma$ of the post Newtonian formalism. The code iteratively solves a system of linear equations, $\mathbf{A} \times \vec{x} = \vec{b}$, where the coefficient matrix $\mathbf{A}$ is large ($\sim$$10^{11} \times 10^8$ elements) and sparse. To solve this system of equations, the code exploits a hybrid implementation of the iterative PC-LSQR algorithm, where the computation related to different horizontal portions of the coefficient matrix is assigned to separate MPI processes. In the original code, each matrix portion is further parallelized over the OpenMP threads. To further improve the code performance, we ported the application to the GPU, replacing the OpenMP parallelization language with OpenACC. In this port, $\sim$95% of the data is copied from the host to the device at the beginning of the entire cycle of iterations, making the code $compute$ $bound$ rather than $data$$-$$transfer$ $bound$. The OpenACC code presents a speedup of $\sim$1.5 over the OpenMP version but further optimizations are in progress to obtain higher gains. The code runs on multiple GPUs and it was tested on the CINECA supercomputer Marconi100, in anticipation of a port to the pre-exascale system Leonardo, that will be installed at CINECA in 2022.

Read this paper on arXiv…

V. Cesare, U. Becciani, A. Vecchiato, et. al.
Fri, 23 Dec 22
57/58

Comments: 18 pages, 8 figures, 3 pseudocodes, published in Astronomy and Computing, Volume 41, October 2022, 100660, accepted for publication on 4th October 2022

Calculation of the High-Energy Neutron Flux for Anticipating Errors and Recovery Techniques in Exascale Supercomputer Centres [CL]

http://arxiv.org/abs/2212.07770


The age of exascale computing has arrived and the risks associated with neutron and other atmospheric radiation are becoming more critical as the computing power increases, hence, the expected Mean Time Between Failures will be reduced because of this radiation. In this work, a new and detailed calculation of the neutron flux for energies above 50 MeV is presented. This has been done by using state-of-the-art Monte Carlo astroparticle techniques and including real atmospheric profiles at each one of the next 23 exascale supercomputing facilities. Atmospheric impact in the flux and seasonal variations were observed and characterised, and the barometric coefficient for high-energy neutrons at each site was obtained. With these coefficients, potential risks of errors associated with the increase in the flux of energetic neutrons, such as the occurrence of single event upsets or transients, and the corresponding failure-in-time rates, can be anticipated just by using the atmospheric pressure before the assignation of resources to critical tasks at each exascale facility. For more clarity, examples about how the rate of failures is affected by the cosmic rays are included, so administrators will better anticipate which more or less restrictive actions could take for overcoming errors.

Read this paper on arXiv…

H. Asorey and R. Mayo-García
Fri, 16 Dec 22
3/72

Comments: 23 pages, 6 figures, 2 tables

Adding Workflow Management Flexibility to LSST Pipelines Execution [IMA]

http://arxiv.org/abs/2211.15795


Data processing pipelines need to be executed at scales ranging from small runs up through large production data release runs resulting in millions of data products. As part of the Rubin Observatory’s pipeline execution system, BPS is the abstraction layer that provides an interface to different Workflow Management Systems (WMS) such as HTCondor and PanDA. During the submission process, the pipeline execution system interacts with the Data Butler to produce a science-oriented execution graph from algorithmic tasks. BPS converts this execution graph to a workflow graph and then uses a WMS-specific plugin to submit and manage the workflow. Here we will discuss the architectural design of this interface and report briefly on the recent production of the Data Preview 0.2 release and how the system is used by pipeline developers.

Read this paper on arXiv…

M. Gower, M. Kowalik, N. Lust, et. al.
Wed, 30 Nov 22
74/81

Comments: 4 pages, submitted to Astronomical Data Analysis Software and Systems XXXII, October 2022

Cutting the cost of pulsar astronomy: Saving time and energy when searching for binary pulsars using NVIDIA GPUs [IMA]

http://arxiv.org/abs/2211.13517


Using the Fourier Domain Acceleration Search (FDAS) method to search for binary pulsars is a computationally costly process. Next generation radio telescopes will have to perform FDAS in real time, as data volumes are too large to store. FDAS is a matched filtering approach for searching time-domain radio astronomy datasets for the signatures of binary pulsars with approximately linear acceleration. In this paper we will explore how we have reduced the energy cost of an SKA-like implementation of FDAS in AstroAccelerate, utilising a combination of mixed-precision computing and dynamic frequency scaling on NVIDIA GPUs. Combining the two approaches, we have managed to save 58% of the overall energy cost of FDAS with a (<3%) sacrifice in numerical sensitivity.

Read this paper on arXiv…

J. White, K. Adamek and W. Armour
Mon, 28 Nov 22
72/93

Comments: N/A

Reproducibility of the First Image of a Black Hole in the Galaxy M87 from the Event Horizon Telescope (EHT) Collaboration [IMA]

http://arxiv.org/abs/2205.10267


This paper presents an interdisciplinary effort aiming to develop and share sustainable knowledge necessary to analyze, understand, and use published scientific results to advance reproducibility in multi-messenger astrophysics. Specifically, we target the breakthrough work associated with the generation of the first image of a black hole, called M87. The image was computed by the Event Horizon Telescope Collaboration. Based on the artifacts made available by EHT, we deliver documentation, code, and a computational environment to reproduce the first image of a black hole. Our deliverables support new discovery in multi-messenger astrophysics by providing all the necessary tools for generalizing methods and findings from the EHT use case. Challenges encountered during the reproducibility of EHT results are reported. The result of our effort is an open-source, containerized software package that enables the public to reproduce the first image of a black hole in the galaxy M87.

Read this paper on arXiv…

R. Patel, B. Roachell, S. Caino-Lores, et. al.
Mon, 23 May 22
47/50

Comments: N/A

Solar-Cycle Variation of quiet-Sun Magnetism and Surface Gravity Oscillation Mode [SSA]

http://arxiv.org/abs/2205.04419


The origin of the quiet Sun magnetism is under debate. Investigating the solar cycle variation observationally in more detail can give us clues about how to resolve the controversies. We investigate the solar cycle variation of the most magnetically quiet regions and their surface gravity oscillation ($f$-) mode integrated energy ($E_f$). We use 12 years of HMI data and apply a stringent selection criteria, based on spatial and temporal quietness, to avoid any influence of active regions (ARs). We develop an automated high-throughput pipeline to go through all available magnetogram data and to compute $E_f$ for the selected quiet regions. We observe a clear solar cycle dependence of the magnetic field strength in the most quiet regions containing several supergranular cells. For patch sizes smaller than a supergranular cell, no significant cycle dependence is detected. The $E_f$ at the supergranular scale is not constant over time. During the late ascending phase of Cycle 24 (SC24, 2011-2012), it is roughly constant, but starts diminishing in 2013, as the maximum of SC24 is approached. This trend continues until mid-2017, when hints of strengthening at higher southern latitudes are seen. Slow strengthening continues, stronger at higher latitudes than at the equatorial regions, but $E_f$ never returns back to the values seen in 2011-2012. Also, the strengthening trend continues past the solar minimum, to the years when SC25 is already clearly ascending. Hence the $E_f$ behavior is not in phase with the solar cycle. The anticorrelation of $E_f$ with the solar cycle in gross terms is expected, but the phase shift of several years indicates a connection to the poloidal large-scale magnetic field component rather than the toroidal one. Calibrating AR signals with the QS $E_f$ does not reveal significant enhancement of the $f$-mode prior to AR emergence.

Read this paper on arXiv…

M. Korpi-Lagg, A. Korpi-Lagg, N. Olspert, et. al.
Tue, 10 May 22
51/70

Comments: 10 pages, 11 figures, submitted to Astronomy & Astrophysics

A Novel Cloud-Based Framework for Standardised Simulations in the Latin American Giant Observatory (LAGO) [IMA]

http://arxiv.org/abs/2204.02716


LAGO, the Latin American Giant Observatory, is an extended cosmic ray observatory, consisting of a wide network of water Cherenkov detectors located in 10 countries. With different altitudes and geomagnetic rigidity cutoffs, their geographic distribution, combined with the new electronics for control, atmospheric sensing and data acquisition, allows the realisation of diverse astrophysics studies at a regional scale. It is an observatory designed, built and operated by the LAGO Collaboration, a non-centralised alliance of 30 institutions from 11 countries.
While LAGO has access to different computational frameworks, it lacks standardised computational mechanisms to fully grasp its cooperative approach. The European Commission is fostering initiatives aligned to LAGO objectives, especially to enable Open Science and its long-term sustainability. This work introduces the adaptation of LAGO to this paradigm within the EOSC-Synergy project, focusing on the simulations of the expected astrophysical signatures at detectors deployed at the LAGO sites around the World.

Read this paper on arXiv…

A. Rubio-Montero, R. Pagán-Muñoz, R. Mayo-García, et. al.
Thu, 7 Apr 22
34/45

Comments: 10 pages, 3 figures, Invited Talk at the Winter Simulation Conference WSC2021, Phoenix, AZ, USA

Parthenon — a performance portable block-structured adaptive mesh refinement framework [CL]

http://arxiv.org/abs/2202.12309


On the path to exascale the landscape of computer device architectures and corresponding programming models has become much more diverse. While various low-level performance portable programming models are available, support at the application level lacks behind. To address this issue, we present the performance portable block-structured adaptive mesh refinement (AMR) framework Parthenon, derived from the well-tested and widely used Athena++ astrophysical magnetohydrodynamics code, but generalized to serve as the foundation for a variety of downstream multi-physics codes. Parthenon adopts the Kokkos programming model, and provides various levels of abstractions from multi-dimensional variables, to packages defining and separating components, to launching of parallel compute kernels. Parthenon allocates all data in device memory to reduce data movement, supports the logical packing of variables and mesh blocks to reduce kernel launch overhead, and employs one-sided, asynchronous MPI calls to reduce communication overhead in multi-node simulations. Using a hydrodynamics miniapp, we demonstrate weak and strong scaling on various architectures including AMD and NVIDIA GPUs, Intel and AMD x86 CPUs, as well as Fujitsu A64FX CPUs. At the largest scale, the miniapp reaches a total of 3.5×10^12 zone-cycles/s on 4096 Summit nodes (24576 GPUs) at ~55% weak scaling parallel efficiency (starting from a single node). In combination with being an open, collaborative project, this makes Parthenon an ideal framework to target exascale simulations in which the downstream developers can focus on their specific application rather than on the complexity of handling massively-parallel, device-accelerated AMR.

Read this paper on arXiv…

P. Grete, J. Dolence, J. Miller, et. al.
Mon, 28 Feb 22
29/38

Comments: 17 pages, 9 figures, submitted to IJHPCA

Astronomical data organization, management and access in Scientific Data Lakes [IMA]

http://arxiv.org/abs/2202.01828


The data volumes stored in telescope archives is constantly increasing due to the development and improvements in the instrumentation. Often the archives need to be stored over a distributed storage architecture, provided by independent compute centres. Such a distributed data archive requires overarching data management orchestration. Such orchestration comprises of tools which handle data storage and cataloguing, and steering transfers integrating different storage systems and protocols, while being aware of data policies and locality. In addition, it needs a common Authorisation and Authentication Infrastructure (AAI) layer which is perceived as a single entity by end users and provides transparent data access.
The scientific domain of particle physics also uses complex and distributed data management systems. The experiments at the Large Hadron Collider\,(LHC) accelerator at CERN generate several hundred petabytes of data per year. This data is globally distributed to partner sites and users using national compute facilities. Several innovative tools were developed to successfully address the distributed computing challenges in the context of the Worldwide LHC Computing Grid (WLCG).
The work being carried out in the ESCAPE project and in the Data Infrastructure for Open Science (DIOS) work package is to prototype a Scientific Data Lake using the tools developed in the context of the WLCG, harnessing different physics scientific disciplines addressing FAIR standards and Open Data. We present how the Scientific Data Lake prototype is applied to address astronomical data use cases. We introduce the software stack and also discuss some of the differences between the domains.

Read this paper on arXiv…

Y. Grange, V. Pandey, X. Espinal, et. al.
Mon, 7 Feb 22
10/46

Comments: 4 pages, 1 figure, to appear in the proceedings of Astronomical Data Analysis Software and Systems XXXI published by ASP

Inference-optimized AI and high performance computing for gravitational wave detection at scale [CL]

http://arxiv.org/abs/2201.11133


We introduce an ensemble of artificial intelligence models for gravitational wave detection that we trained in the Summit supercomputer using 32 nodes, equivalent to 192 NVIDIA V100 GPUs, within 2 hours. Once fully trained, we optimized these models for accelerated inference using NVIDIA TensorRT. We deployed our inference-optimized AI ensemble in the ThetaGPU supercomputer at Argonne Leadership Computer Facility to conduct distributed inference. Using the entire ThetaGPU supercomputer, consisting of 20 nodes each of which has 8 NVIDIA A100 Tensor Core GPUs and 2 AMD Rome CPUs, our NVIDIA TensorRT-optimized AI ensemble porcessed an entire month of advanced LIGO data (including Hanford and Livingston data streams) within 50 seconds. Our inference-optimized AI ensemble retains the same sensitivity of traditional AI models, namely, it identifies all known binary black hole mergers previously identified in this advanced LIGO dataset and reports no misclassifications, while also providing a 3X inference speedup compared to traditional artificial intelligence models. We used time slides to quantify the performance of our AI ensemble to process up to 5 years worth of advanced LIGO data. In this synthetically enhanced dataset, our AI ensemble reports an average of one misclassification for every month of searched advanced LIGO data. We also present the receiver operating characteristic curve of our AI ensemble using this 5 year long advanced LIGO dataset. This approach provides the required tools to conduct accelerated, AI-driven gravitational wave detection at scale.

Read this paper on arXiv…

P. Chaturvedi, A. Khan, M. Tian, et. al.
Fri, 28 Jan 22
3/64

Comments: 19 pages, 8 figure

A distributed computing infrastructure for LOFAR Italian community [IMA]

http://arxiv.org/abs/2201.11526


The LOw-Frequency ARray is a low-frequency radio interferometer composed by observational stations spread across Europe and it is the largest precursor of SKA in terms of effective area and generated data rates. In 2018, the Italian community officially joined LOFAR project, and it deployed a distributed computing and storage infrastructure dedicated to LOFAR data analysis. The infrastructure is based on 4 nodes distributed in different Italian locations and it offers services for pipelines execution, storage of final and intermediate results and support for the use of the software and infrastructure. As the analysis of the LOw-Frequency ARray data requires a very complex computational procedure, a container-based approach has been adopted to distribute software environments to the different computing resources. A science platform approach is used to facilitate interactive access to computational resources. In this paper, we describe the architecture and main features of the infrastructure.

Read this paper on arXiv…

G. Taffoni, U. Becciani, A. Bonafede, et. al.
Fri, 28 Jan 22
38/64

Comments: In Astronomical Data Analysis Software and Systems (ADASS) XXXI

The EOSC-Synergy cloud services implementation for the Latin American Giant Observatory (LAGO) [IMA]

http://arxiv.org/abs/2111.11190


The Latin American Giant Observatory (LAGO) is a distributed cosmic ray observatory at a regional scale in Latin America, by deploying a large network of Water Cherenkov detectors (WCD) and other astroparticle detectors in a wide range of latitudes from Antarctica to M\’exico, and altitudes from sea level to more than 5500 m a.s.l. Detectors telemetry, atmospherics conditions and flux of secondary particles at the ground are measured with extreme detail at each LAGO site by using our own-designed hardware and firmware (ACQUA).
To combine and analyse all these data, LAGO developed ANNA, our data analysis framework. Additionally, ARTI, a complete framework of simulations designed to simulate the expected signals at our detectors coming from primary cosmic rays entering the Earth atmosphere, allowing a precise characterization of the sites in realistic atmospheric, geomagnetic and detector conditions.
As the measured and synthetic data started to flow, we are facing challenging scenarios given a large amount of data emerging, performed on a diversity of detectors and computing architectures and e-infrastructures. These data need to be transferred, analyzed, catalogued, preserved, and provided for internal and public access and data-mining under an open e-science environment. In this work, we present the implementation of ARTI at the EOSC-Synergy cloud-based services as the first example of LAGO’ frameworks that will follow the FAIR principles for provenance, data curation and re-using of data.
For this, we calculate the flux of secondary particles expected in up to 1 week at detector level for all the 26 LAGO, and the 1-year flux of high energy secondaries expected at the ANDES Underground Laboratory and other sites. Therefore, we show how this development can help not only LAGO but other data-intensive cosmic rays observatories, muography experiments and underground laboratories.

Read this paper on arXiv…

J. Rubio-Montero, R. Pagán-Muñoz, R. Mayo-García, et. al.
Tue, 23 Nov 21
32/84

Comments: N/A

Spatially constrained direction dependent calibration [IMA]

http://arxiv.org/abs/2110.06780


Direction dependent calibration of widefield radio interferometers estimates the systematic errors along multiple directions in the sky. This is necessary because with most systematic errors that are caused by effects such as the ionosphere or the receiver beam shape, there is significant spatial variation. Fortunately, there is some deterministic behavior of these variations in most situations. We enforce this underlying smooth spatial behavior of systematic errors as an additional constraint onto spectrally constrained direction dependent calibration. Using both analysis and simulations, we show that this additional spatial constraint improves the performance of multi-frequency direction dependent calibration.

Read this paper on arXiv…

S. Yatawatta
Thu, 14 Oct 21
27/62

Comments: N/A

Extreme Scale Survey Simulation with Python Workflows [IMA]

http://arxiv.org/abs/2109.12060


The Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST) will soon carry out an unprecedented wide, fast, and deep survey of the sky in multiple optical bands. The data from LSST will open up a new discovery space in astronomy and cosmology, simultaneously providing clues toward addressing burning issues of the day, such as the origin of dark energy and and the nature of dark matter, while at the same time yielding data that will, in turn, pose fresh new questions. To prepare for the imminent arrival of this remarkable data set, it is crucial that the associated scientific communities be able to develop the software needed to analyze it. Computational power now available allows us to generate synthetic data sets that can be used as a realistic training ground for such an effort. This effort raises its own challenges — the need to generate very large simulations of the night sky, scaling up simulation campaigns to large numbers of compute nodes across multiple computing centers with different architectures, and optimizing the complex workload around memory requirements and widely varying wall clock times. We describe here a large-scale workflow that melds together Python code to steer the workflow, Parsl to manage the large-scale distributed execution of workflow components, and containers to carry out the image simulation campaign across multiple sites. Taking advantage of these tools, we developed an extreme-scale computational framework and used it to simulate five years of observations for 300 square degrees of sky area. We describe our experiences and lessons learned in developing this workflow capability, and highlight how the scalability and portability of our approach enabled us to efficiently execute it on up to 4000 compute nodes on two supercomputers.

Read this paper on arXiv…

A. Villarreal, Y. Babuji, T. Uram, et. al.
Mon, 27 Sep 21
9/68

Comments: Proceeding for eScience 2021, 9 pages, 5 figures

Optimizing the hybrid parallelization of BHAC [CL]

http://arxiv.org/abs/2108.12240


We present our experience with the modernization on the GR-MHD code BHAC, aimed at improving its novel hybrid (MPI+OpenMP) parallelization scheme. In doing so, we showcase the use of performance profiling tools usable on x86 (Intel-based) architectures. Our performance characterization and threading analysis provided guidance in improving the concurrency and thus the efficiency of the OpenMP parallel regions. We assess scaling and communication patterns in order to identify and alleviate MPI bottlenecks, with both runtime switches and precise code interventions. The performance of optimized version of BHAC improved by $\sim28\%$, making it viable for scaling on several hundreds of supercomputer nodes. We finally test whether porting such optimizations to different hardware is likewise beneficial on the new architecture by running on ARM A64FX vector nodes.

Read this paper on arXiv…

S. Cielo, O. Porth, L. Iapichino, et. al.
Mon, 30 Aug 21
27/38

Comments: 10 pages, 9 figures, 1 table; in review

Octo-Tiger's New Hydro Module and Performance Using HPX+CUDA on ORNL's Summit [CL]

http://arxiv.org/abs/2107.10987


Octo-Tiger is a code for modeling three-dimensional self-gravitating astrophysical fluids. It was particularly designed for the study of dynamical mass transfer between interacting binary stars. Octo-Tiger is parallelized for distributed systems using the asynchronous many-task runtime system, the C++ standard library for parallelism and concurrency (HPX) and utilizes CUDA for its gravity solver. Recently, we have remodeled Octo-Tiger’s hydro solver to use a three-dimensional reconstruction scheme. In addition, we have ported the hydro solver to GPU using CUDA kernels. We present scaling results for the new hydro kernels on ORNL’s Summit machine using a Sedov-Taylor blast wave problem. We also compare Octo-Tiger’s new hydro scheme with its old hydro scheme, using a rotating star as a test problem.

Read this paper on arXiv…

P. Diehl, G. Daiß, D. Marcello, et. al.
Mon, 26 Jul 21
7/62

Comments: Accepted to IEEE Cluster

Square Kilometre Array : Processing Voluminous MeerKAT Data on IRIS [IMA]

http://arxiv.org/abs/2105.14613


Processing astronomical data often comes with huge challenges with regards to data management as well as data processing. MeerKAT telescope is one of the precursor telescopes of the World’s largest observatory – Square Kilometre Array. So far, MeerKAT data was processed using the South African computing facility i.e. IDIA, and exploited to make ground-breaking discoveries. However, to process MeerKAT data on UK’s IRIS computing facility requires new implementation of the MeerKAT pipeline. This paper focuses on how to transfer MeerKAT data from the South African site to UK’s IRIS systems for processing. We discuss about our RapifXfer Data transfer framework for transferring the MeerKAT data from South Africa to the UK, and the MeerKAT job processing framework pertaining to the UK’s IRIS resources.

Read this paper on arXiv…

P. Thavasimani and A. Scaife
Tue, 1 Jun 21
64/72

Comments: 10 pages, 10 figures

Checkpoint, Restore, and Live Migration for Science Platforms [IMA]

http://arxiv.org/abs/2101.05782


We demonstrate a fully functional implementation of (per-user) checkpoint, restore, and live migration capabilities for JupyterHub platforms. Checkpointing — the ability to freeze and suspend to disk the running state (contents of memory, registers, open files, etc.) of a set of processes — enables the system to snapshot a user’s Jupyter session to permanent storage. The restore functionality brings a checkpointed session back to a running state, to continue where it left off at a later time and potentially on a different machine. Finally, live migration enables moving running Jupyter notebook servers between different machines, transparent to the analysis code and w/o disconnecting the user. Our implementation of these capabilities works at the system level, with few limitations, and typical checkpoint/restore times of O(10s) with a pathway to O(1s) live migrations. It opens a myriad of interesting use cases, especially for cloud-based deployments: from checkpointing idle sessions w/o interruption of the user’s work (achieving cost reductions of 4x or more), execution on spot instances w. transparent migration on eviction (with additional cost reductions up to 3x), to automated migration of workloads to ideally suited instances (e.g. moving an analysis to a machine with more or less RAM or cores based on observed resource utilization). The capabilities we demonstrate can make science platforms fully elastic while retaining excellent user experience.

Read this paper on arXiv…

M. Juric, S. Stetzler and C. Slater
Fri, 15 Jan 21
40/60

Comments: 4 pages, 2 figures, to appear in the Proceedings of ADASS XXX

Implementing CUDA Streams into AstroAccelerate — A Case Study [IMA]

http://arxiv.org/abs/2101.00941


To be able to run tasks asynchronously on NVIDIA GPUs a programmer must explicitly implement asynchronous execution in their code using the syntax of CUDA streams. Streams allow a programmer to launch independent concurrent execution tasks, providing the ability to utilise different functional units on the GPU asynchronously. For example, it is possible to transfer the results from a previous computation performed on input data n-1, over the PCIe bus whilst computing the result for input data n, by placing different tasks in different CUDA streams. The benefit of such an approach is that the time taken for the data transfer between the host and device can be hidden with computation. This case study deals with the implementation of CUDA streams into AstroAccelerate. AstroAccelerate is a GPU accelerated real-time signal processing pipeline for time-domain radio astronomy.

Read this paper on arXiv…

J. Novotný, K. Adámek and W. Armour
Tue, 5 Jan 21
2/82

Comments: submitted to ADASS XXX, 3 pages

TOPCAT Visualisation over the Web [CL]

http://arxiv.org/abs/2012.10560


The desktop GUI catalogue analysis tool TOPCAT, and its command-line counterpart STILTS, offer among other capabilities visual exploration of locally stored tables containing millions of rows or more. They offer many variations on the theme of scatter plots, density maps and histograms, which can be navigated interactively. These capabilities have now been extended to a client-server model, so that a plot server can be run close to the data storage, and remote lightweight HTML/JavaScript clients can configure and interact with plots based on that data. The interaction can include pan/zoom/rotate navigation, identifying individual points, and potentially subset selection. Since only the pixels and not the row data are transmitted to the client, this enables flexible remote visual exploration of large tables at relatively low bandwidth. The web client can request any of the plot options available from TOPCAT/STILTS. Possible applications include web-based visualisations of static datasets too large to transmit, visual previews of archive search results, service-configured arrays of plots for complex datasets, and embedding visualisations of local or remote tables into Jupyter notebooks.

Read this paper on arXiv…

M. Taylor
Tue, 22 Dec 20
87/89

Comments: 4 pages, 1 figure, to appear in proceedings of ADASS XXX; at submission time, some examples at this https URL

Confluence of Artificial Intelligence and High Performance Computing for Accelerated, Scalable and Reproducible Gravitational Wave Detection [CL]

http://arxiv.org/abs/2012.08545


Finding new ways to use artificial intelligence (AI) to accelerate the analysis of gravitational wave data, and ensuring the developed models are easily reusable promises to unlock new opportunities in multi-messenger astrophysics (MMA), and to enable wider use, rigorous validation, and sharing of developed models by the community. In this work, we demonstrate how connecting recently deployed DOE and NSF-sponsored cyberinfrastructure allows for new ways to publish models, and to subsequently deploy these models into applications using computing platforms ranging from laptops to high performance computing clusters. We develop a workflow that connects the Data and Learning Hub for Science (DLHub), a repository for publishing machine learning models, with the Hardware Accelerated Learning (HAL) deep learning computing cluster, using funcX as a universal distributed computing service. We then use this workflow to search for binary black hole gravitational wave signals in open source advanced LIGO data. We find that using this workflow, an ensemble of four openly available deep learning models can be run on HAL and process the entire month of August 2017 of advanced LIGO data in just seven minutes, identifying all four binary black hole mergers previously identified in this dataset, and reporting no misclassifications. This approach, which combines advances in AI, distributed computing, and scientific data infrastructure opens new pathways to conduct reproducible, accelerated, data-driven gravitational wave detection.

Read this paper on arXiv…

E. Huerta, A. Khan, X. Huang, et. al.
Thu, 17 Dec 20
60/85

Comments: 17 pages, 5 figures

Building Halo Merger Trees from the Q Continuum Simulation [CEA]

http://arxiv.org/abs/2008.08519


Cosmological N-body simulations rank among the most computationally intensive efforts today. A key challenge is the analysis of structure, substructure, and the merger history for many billions of compact particle clusters, called halos. Effectively representing the merging history of halos is essential for many galaxy formation models used to generate synthetic sky catalogs, an important application of modern cosmological simulations. Generating realistic mock catalogs requires computing the halo formation history from simulations with large volumes and billions of halos over many time steps, taking hundreds of terabytes of analysis data. We present fast parallel algorithms for producing halo merger trees and tracking halo substructure from a single-level, density-based clustering algorithm. Merger trees are created from analyzing the halo-particle membership function in adjacent snapshots, and substructure is identified by tracking the “cores” of merging halos — sets of particles near the halo center. Core tracking is performed after creating merger trees and uses the relationships found during tree construction to associate substructures with hosts. The algorithms are implemented with MPI and evaluated on a Cray XK7 supercomputer using up to 16,384 processes on data from HACC, a modern cosmological simulation framework. We present results for creating merger trees from 101 analysis snapshots taken from the Q Continuum, a large volume, high mass resolution, cosmological simulation evolving half a trillion particles.

Read this paper on arXiv…

E. Rangel, N. Frontiere, S. Habib, et. al.
Thu, 20 Aug 20
-1108/48

Comments: 2017 IEEE 24th International Conference on High Performance Computing

Understanding GPU-Based Lossy Compression for Extreme-Scale Cosmological Simulations [CL]

http://arxiv.org/abs/2004.00224


To help understand our universe better, researchers and scientists currently run extreme-scale cosmology simulations on leadership supercomputers. However, such simulations can generate large amounts of scientific data, which often result in expensive costs in data associated with data movement and storage. Lossy compression techniques have become attractive because they significantly reduce data size and can maintain high data fidelity for post-analysis. In this paper, we propose to use GPU-based lossy compression for extreme-scale cosmological simulations. Our contributions are threefold: (1) we implement multiple GPU-based lossy compressors to our opensource compression benchmark and analysis framework named Foresight; (2) we use Foresight to comprehensively evaluate the practicality of using GPU-based lossy compression on two real-world extreme-scale cosmology simulations, namely HACC and Nyx, based on a series of assessment metrics; and (3) we develop a general optimization guideline on how to determine the best-fit configurations for different lossy compressors and cosmological simulations. Experiments show that GPU-based lossy compression can provide necessary accuracy on post-analysis for cosmological simulations and high compression ratio of 5~15x on the tested datasets, as well as much higher compression and decompression throughput than CPU-based compressors.

Read this paper on arXiv…

S. Jin, P. Grosset, C. Biwer, et. al.
Thu, 2 Apr 20
34/56

Comments: 11 pages, 10 figures, accepted by IEEE IPDPS ’20

A Catalogue of Locus Algorithm Pointings for Optimal Differential Photometry for 23,779 Quasars [GA]

http://arxiv.org/abs/2003.04590


This paper presents a catalogue of optimised pointings for differential photometry of 23,779 quasars extracted from the Sloan Digital Sky Survey (SDSS) Catalogue and a score for each indicating the quality of the Field of View (FoV) associated with that pointing. Observation of millimagnitude variability on a timescale of minutes typically requires differential observations with reference to an ensemble of reference stars. For optimal performance, these reference stars should have similar colour and magnitude to the target quasar. In addition, the greatest quantity and quality of suitable reference stars may be found by using a telescope pointing which offsets the target object from the centre of the field of view. By comparing each quasar with the stars which appear close to it on the sky in the SDSS Catalogue, an optimum pointing can be calculated, and a figure of merit, referred to as the “score” calculated for that pointing. Highly flexible software has been developed to enable this process to be automated and implemented in a distributed computing paradigm, which enables the creation of catalogues of pointings given a set of input targets. Applying this technique to a sample of 40,000 targets from the 4th SDSS quasar catalogue resulted in the production of pointings and scores for 23,779 quasars. This catalogue is a useful resource for observers planning differential photometry studies and surveys of quasars to select those which have many suitable celestial neighbours for differential photometry

Read this paper on arXiv…

O. Creaner, K. Nolan, D. Grennan, et. al.
Wed, 11 Mar 20
9/65

Comments: 7 pages, 5 figures

The Locus Algorithm II: A robust software system to maximise the quality of fields of view for Differential Photometry [IMA]

http://arxiv.org/abs/2003.04574


We present the software system developed to implement the Locus Algorithm, a novel algorithm designed to maximise the performance of differential photometry systems by optimising the number and quality of reference stars in the Field of View with the target. Firstly, we state the design requirements, constraints and ambitions for the software system required to implement this algorithm. Then, a detailed software design is presented for the system in operation. Next, the data design including file structures used and the data environment required for the system are defined. Finally, we conclude by illustrating the scaling requirements which mandate a high-performance computing implementation of this system, which is discussed in the other papers in this series.

Read this paper on arXiv…

K. Nolan, E. Hickey and O. Creaner
Wed, 11 Mar 20
13/65

Comments: 11 Pages, 13 Figures

The Locus Algorithm III: A Grid Computing system to generate catalogues of optimised pointings for Differential Photometry [IMA]

http://arxiv.org/abs/2003.04565


This paper discusses the hardware and software components of the Grid Computing system used to implement the Locus Algorithm to identify optimum pointings for differential photometry of 61,662,376 stars and 23,799 quasars. The scale of the data, together with initial operational assessments demanded a High Performance Computing (HPC) system to complete the data analysis. Grid computing was chosen as the HPC solution as the optimum choice available within this project. The physical and logical structure of the National Grid computing Infrastructure informed the approach that was taken. That approach was one of layered separation of the different project components to enable maximum flexibility and extensibility.

Read this paper on arXiv…

O. Creaner, K. Nolan, J. Walsh, et. al.
Wed, 11 Mar 20
24/65

Comments: 12 Pages, 9 Figures

CUBE — Towards an Optimal Scaling of Cosmological N-body Simulations [CL]

http://arxiv.org/abs/2003.03931


N-body simulations are essential tools in physical cosmology to understand the large-scale structure (LSS) formation of the Universe. Large-scale simulations with high resolution are important for exploring the substructure of universe and for determining fundamental physical parameters like neutrino mass. However, traditional particle-mesh (PM) based algorithms use considerable amounts of memory, which limits the scalability of simulations. Therefore, we designed a two-level PM algorithm CUBE towards optimal performance in memory consumption reduction. By using the fixed-point compression technique, CUBE reduces the memory consumption per N-body particle toward 6 bytes, an order of magnitude lower than the traditional PM-based algorithms. We scaled CUBE to 512 nodes (20,480 cores) on an Intel Cascade Lake based supercomputer with $\simeq$95\% weak-scaling efficiency. This scaling test was performed in “Cosmo-$\pi$” — a cosmological LSS simulation using $\simeq$4.4 trillion particles, tracing the evolution of the universe over $\simeq$13.7 billion years. To our best knowledge, Cosmo-$\pi$ is the largest completed cosmological N-body simulation. We believe CUBE has a huge potential to scale on exascale supercomputers for larger simulations.

Read this paper on arXiv…

S. Cheng, H. Yu, D. Inman, et. al.
Tue, 10 Mar 20
36/63

Comments: 6 pages, 5 figures. Accepted for SCALE 2020, co-located as part of the proceedings of CCGRID 2020

Stochastic Calibration of Radio Interferometers [IMA]

http://arxiv.org/abs/2003.00986


With ever increasing data rates produced by modern radio telescopes like LOFAR and future telescopes like the SKA, many data processing steps are overwhelmed by the amount of data that needs to be handled using limited compute resources. Calibration is one such operation that dominates the overall data processing computational cost, nonetheless, it is an essential operation to reach many science goals. Calibration algorithms do exist that scale well with the number of stations of an array and the number of directions being calibrated. However, the remaining bottleneck is the raw data volume, which scales with the number of baselines, and which is proportional to the square of the number of stations. We propose a ‘stochastic’ calibration strategy where we only read in a mini-batch of data for obtaining calibration solutions, as opposed to reading the full batch of data being calibrated. Nonetheless, we obtain solutions that are valid for the full batch of data. Normally, data need to be averaged before calibration is performed to accommodate the data in size-limited compute memory. Stochastic calibration overcomes the need for data averaging before any calibration can be performed, and offers many advantages including: enabling the mitigation of faint radio frequency interference; better removal of strong celestial sources from the data; and better detection and spatial localization of fast radio transients.

Read this paper on arXiv…

S. Yatawatta
Tue, 3 Mar 20
45/68

Comments: N/A

Honing and proofing Astrophysical codes on the road to Exascale. Experiences from code modernization on many-core systems [CL]

http://arxiv.org/abs/2002.08161


The complexity of modern and upcoming computing architectures poses severe challenges for code developers and application specialists, and forces them to expose the highest possible degree of parallelism, in order to make the best use of the available hardware. The Intel$^{(R)}$ Xeon Phi$^{(TM)}$ of second generation (code-named Knights Landing, henceforth KNL) is the latest many-core system, which implements several interesting hardware features like for example a large number of cores per node (up to 72), the 512 bits-wide vector registers and the high-bandwidth memory. The unique features of KNL make this platform a powerful testbed for modern HPC applications. The performance of codes on KNL is therefore a useful proxy of their readiness for future architectures. In this work we describe the lessons learnt during the optimisation of the widely used codes for computational astrophysics P-Gadget-3, Flash and Echo. Moreover, we present results for the visualisation and analysis tools VisIt and yt. These examples show that modern architectures benefit from code optimisation at different levels, even more than traditional multi-core systems. However, the level of modernisation of typical community codes still needs improvements, for them to fully utilise resources of novel architectures.

Read this paper on arXiv…

S. Cielo, L. Iapichino, F. Baruffa, et. al.
Thu, 20 Feb 20
36/61

Comments: 16 pages, 10 figures, 4 tables. To be published in Future Generation of Computer Systems (FGCS), Special Issue on “On The Road to Exascale II: Advances in High Performance Computing and Simulations”

CHIPP: INAF pilot project for HTC, HPC and HPDA [IMA]

http://arxiv.org/abs/2002.01283


CHIPP (Computing HTC in INAF Pilot Project) is an Italian project funded by the Italian Institute for Astrophysics (INAF) and promoted by the ICT office of INAF. The main purpose of the CHIPP project is to coordinate the use of, and access to, already existing high throughput computing and high-performance computing and data processing resources (for small/medium size programs) for the INAF community. Today, Tier2/Tier3 systems (1,200 CPU/core) are provided at the INAF institutes at Trieste and Catania, but in the future, the project will evolve including also other computing infrastructures. During the last two years, more than 30 programs have been approved for a total request of 30 Million CPU-h. Most of the programs are HPC, data reduction and analysis, machine learning. In this paper, we describe in details the CHIPP infrastructures and the results of the first two years of activity.

Read this paper on arXiv…

G. Taffoni, U. Becciani, B. Garilli, et. al.
Wed, 5 Feb 20
47/67

Comments: 4 pages, conference, ADASS 2019

Real-Time RFI Mitigation for the Apertif Radio Transient System [IMA]

http://arxiv.org/abs/2001.03389


Current and upcoming radio telescopes are being designed with increasing sensitivity to detect new and mysterious radio sources of astrophysical origin. While this increased sensitivity improves the likelihood of discoveries, it also makes these instruments more susceptible to the deleterious effects of Radio Frequency Interference (RFI). The challenge posed by RFI is exacerbated by the high data-rates achieved by modern radio telescopes, which require real-time processing to keep up with the data. Furthermore, the high data-rates do not allow for permanent storage of observations at high resolution. Offline RFI mitigation is therefore not possible anymore. The real-time requirement makes RFI mitigation even more challenging because, on one side, the techniques used for mitigation need to be fast and simple, and on the other side they also need to be robust enough to cope with just a partial view of the data.
The Apertif Radio Transient System (ARTS) is the real-time, time-domain, transient detection instrument of the Westerbork Synthesis Radio Telescope (WSRT), processing 73 Gb of data per second. Even with a deep learning classifier, the ARTS pipeline requires state-of-the-art real-time RFI mitigation to reduce the number of false-positive detections. Our solution to this challenge is RFIm, a high-performance, open-source, tuned, and extensible RFI mitigation library. The goal of this library is to provide users with RFI mitigation routines that are designed to run in real-time on many-core accelerators, such as Graphics Processing Units, and that can be highly-tuned to achieve code and performance portability to different hardware platforms and scientific use-cases. Results on the ARTS show that we can achieve real-time RFI mitigation, with a minimal impact on the total execution time of the search pipeline, and considerably reduce the number of false-positives.

Read this paper on arXiv…

A. Sclocco, D. Vohl and R. Nieuwpoort
Mon, 13 Jan 20
7/61

Comments: 6 pages, 10 figures. To appear in Proceedings from the 2019 Radio Frequency Interference workshop (RFI 2019), Toulouse, France (23-26 September)

Two-level Dynamic Load Balancing for High Performance Scientific Applications [CL]

http://arxiv.org/abs/1911.06714


Scientific applications are often complex, irregular, and computationally-intensive. To accommodate the ever-increasing computational demands of scientific applications, high-performance computing (HPC) systems have become larger and more complex, offering parallelism at multiple levels (e.g., nodes, cores per node, threads per core). Scientific applications need to exploit all the available multilevel hardware parallelism to harness the available computational power. The performance of applications executing on such HPC systems may adversely be affected by load imbalance at multiple levels, caused by problem, algorithmic, and systemic characteristics. Nevertheless, most existing load balancing methods do not simultaneously address load imbalance at multiple levels. This work investigates the impact of load imbalance on the performance of three scientific applications at the thread and process levels. We jointly apply and evaluate selected dynamic loop self-scheduling (DLS) techniques to both levels. Specifically, we employ the extended LaPeSD OpenMP runtime library at the thread level and extend the DLS4LB MPI-based dynamic load balancing library at the process level. This approach is generic and applicable to any multiprocess-multithreaded computationally-intensive application (programmed using MPI and OpenMP). We conduct an exhaustive set of experiments to assess and compare six DLS techniques at the thread level and eleven at the process level. The results show that improved application performance, by up to 21%, can only be achieved by jointly addressing load imbalance at the two levels. We offer insights into the performance of the selected DLS techniques and discuss the interplay of load balancing at the thread level and process level.

Read this paper on arXiv…

A. Mohammed, A. Cavelan, F. Ciorba, et. al.
Wed, 20 Nov 19
72/73

Comments: N/A

Visualizing the world's largest turbulence simulation [CL]

http://arxiv.org/abs/1910.07850


In this exploratory submission we present the visualization of the largest interstellar turbulence simulations ever performed, unravelling key astrophysical processes concerning the formation of stars and the relative role of magnetic fields. The simulations, including pure hydrodynamical (HD) and magneto-hydrodynamical (MHD) runs, up to a size of $10048^3$ grid elements, were produced on the supercomputers of the Leibniz Supercomputing Centre and visualized using the hybrid parallel (MPI+TBB) ray-tracing engine OSPRay associated with VisIt. Besides revealing features of turbulence with an unprecedented resolution, the visualizations brilliantly showcase the stretching-and-folding mechanisms through which astrophysical processes such as supernova explosions drive turbulence and amplify the magnetic field in the interstellar gas, and how the first structures, the seeds of newborn stars are shaped by this process.

Read this paper on arXiv…

S. Cielo, L. Iapichino, J. Günther, et. al.
Fri, 18 Oct 19
39/77

Comments: 6 pages, 5 figures, accompanying paper of SC19 visualization showcase finalist. The full video is publicly available under this https URL

Speeding simulation analysis up with yt and Intel Distribution for Python [IMA]

http://arxiv.org/abs/1910.07855


As modern scientific simulations grow ever more in size and complexity, even their analysis and post-processing becomes increasingly demanding, calling for the use of HPC resources and methods. yt is a parallel, open source post-processing python package for numerical simulations in astrophysics, made popular by its cross-format compatibility, its active community of developers and its integration with several other professional Python instruments. The Intel Distribution for Python enhances yt’s performance and parallel scalability, through the optimization of lower-level libraries Numpy and Scipy, which make use of the optimized Intel Math Kernel Library (Intel-MKL) and the Intel MPI library for distributed computing. The library package yt is used for several analysis tasks, including integration of derived quantities, volumetric rendering, 2D phase plots, cosmological halo analysis and production of synthetic X-ray observation. In this paper, we provide a brief tutorial for the installation of yt and the Intel Distribution for Python, and the execution of each analysis task. Compared to the Anaconda python distribution, using the provided solution one can achieve net speedups up to 4.6x on Intel Xeon Scalable processors (codename Skylake).

Read this paper on arXiv…

S. Cielo, L. Iapichino and F. Baruffa
Fri, 18 Oct 19
42/77

Comments: 3 pages, 1 figure, published on Intel Parallel Universe Magazine

Data Aggregation In The Astroparticle Physics Distributed Data Storage [IMA]

http://arxiv.org/abs/1908.01554


German-Russian Astroparticle Data Life Cycle Initiative is an international project whose aim is to develop a distributed data storage system that aggregates data from the storage systems of different astroparticle experiments. The prototype of such a system, which is called the Astroparticle Physics Distributed Data Storage (APPDS), has been being developed. In this paper, the Data Aggregation Service, one of the core services of APDDS, is presented. The Data Aggregation Service connects all distributed services of APPDS together to find the necessary data and deliver them to users on demand.

Read this paper on arXiv…

M. Nguyen, A. Kryukov, J. Dubenskaya, et. al.
Tue, 6 Aug 19
38/76

Comments: 6 pages, 2 figures, Proceedings of the 3rd International Workshop on Data Life Cycle in Physics (Irkutsk, Russia, April 2-7, 2019)

Deep Learning for Energy Estimation and Particle Identification in Gamma-ray Astronomy [IMA]

http://arxiv.org/abs/1907.10480


Deep learning techniques, namely convolutional neural networks (CNN), have previously been adapted to select gamma-ray events in the TAIGA experiment, having achieved a good quality of selection as compared with the conventional Hillas approach. Another important task for the TAIGA data analysis was also solved with CNN: gamma-ray energy estimation showed some improvement in comparison with the conventional method based on the Hillas analysis. Furthermore, our software was completely redeveloped for the graphics processing unit (GPU), which led to significantly faster calculations in both of these tasks. All the results have been obtained with the simulated data of TAIGA Monte Carlo software; their experimental confirmation is envisaged for the near future.

Read this paper on arXiv…

E. Postnikov, A. Kryukov, S. Polyakov, et. al.
Thu, 25 Jul 19
64/72

Comments: 10 pages, 6 figures. arXiv admin note: text overlap with arXiv:1812.01551

Distributed data storage for modern astroparticle physics experiments [CL]

http://arxiv.org/abs/1907.06863


The German-Russian Astroparticle Data Life Cycle Initiative is an international project launched in 2018. The Initiative aims to develop technologies that provide a unified approach to data management, as well as to demonstrate their applicability on the example of two large astrophysical experiments – KASCADE and TAIGA. One of the key points of the project is the development of a distributed storage, which, on the one hand, will allow data of several experiments to be combined into a single repository with unified interface, and on the other hand, will provide data to all participants of experimental groups for multi-messenger analysis. Our approach to storage design is based on the single write-multiple read (SWMR) model for accessing raw or centrally processed data for further analysis. The main feature of the distributed storage is the ability to extract data either as a collection of files or as aggregated events from different sources. In the last case the storage provides users with a special service that aggregates data from different storages into a single sample. Thanks to this feature, multi-messenger methods used for more sophisticated data exploration can be applied. Users can use both Web-interface and Application Programming Interface (API) for accessing the storage. In this paper we describe the architecture of a distributed data storage for astroparticle physics and discuss the current status of our work.

Read this paper on arXiv…

A. Kryukov, M. Nguyen, I. Bychkov, et. al.
Wed, 17 Jul 19
48/75

Comments: N/A

Metadata Extraction from Raw Astroparticle Data of TAIGA Experiment [IMA]

http://arxiv.org/abs/1907.06183


Today, the operating TAIGA (Tunka Advanced Instrument for cosmic rays and Gamma Astronomy) experiment continuously produces and accumulates a large volume of raw astroparticle data. To be available for the scientific community these data should be well-described and formally characterized. The use of metadata makes it possible to search for and to aggregate digital objects (e.g. events and runs) by time and equipment through a unified interface to access them. The important part of the metadata is hidden and scattered in folder/files names and package headers. Such metadata should be extracted from binary files, transformed to a unified form of digital objects, and loaded into the catalog. To address this challenge we developed a concept of the metadata extractor that can be extended by facility-specific extraction modules. It is designed to automatically collect descriptive metadata from raw data files of all TAIGA formats.

Read this paper on arXiv…

I. Bychkov, J. Dubenskaya, E. Korosteleva, et. al.
Tue, 16 Jul 19
36/89

Comments: 9 pages, 3 figures, 3rd International Workshop on Data Life Cycle in Physics

Development of a data infrastructure for a global data and analysis center in astroparticle physics [IMA]

http://arxiv.org/abs/1907.02335


Nowadays astroparticle physics faces a rapid data volume increase. Meanwhile, there are still challenges of testing the theoretical models for clarifying the origin of cosmic rays by applying a multi-messenger approach, machine learning and investigation of the phenomena related to the rare statistics in detecting incoming particles. The problems are related to the accurate data mapping and data management as well as to the distributed storage and high-performance data processing. In particular, one could be interested in employing such solutions in study of air-showers induced by ultra-high energy cosmic and gamma rays, testing new hypotheses of hadronic interaction or cross-calibration of different experiments. KASCADE (Karlsruhe, Germany) and TAIGA (Tunka valley, Russia) are experiments in the field of astroparticle physics, aiming at the detection of cosmic-ray air-showers, induced by the primaries in the energy range of about hundreds TeVs to hundreds PeVs. They are located at the same latitude and have an overlap in operation runs. These factors determine the interest in performing a joint analysis of these data. In the German-Russian Astroparticle Data Life Cycle Initiative (GRADLCI), modern technologies of the distributed data management are being employed for establishing a reliable open access to the experimental cosmic-ray physics data collected by KASCADE and the Tunka-133 setup of TAIGA.

Read this paper on arXiv…

V. Tokareva, A. Haungs, D. Kang, et. al.
Fri, 5 Jul 19
34/52

Comments: 8 pages, 2 figures, The III International Workshop “Data life cycle in physics” (DLC-2019)

AXS: A framework for fast astronomical data processing based on Apache Spark [IMA]

http://arxiv.org/abs/1905.09034


We introduce AXS (Astronomy eXtensions for Spark), a scalable open-source astronomical data analysis framework built on Apache Spark, a widely used industry-standard engine for big data processing. Building on capabilities present in Spark, AXS aims to enable querying and analyzing almost arbitrarily large astronomical catalogs using familiar Python/AstroPy concepts, DataFrame APIs, and SQL statements. We achieve this by i) adding support to Spark for efficient on-line positional cross-matching and ii) supplying a Python library supporting commonly-used operations for astronomical data analysis. To support scalable cross-matching, we developed a variant of the ZONES algorithm \citep{there-goes_gray_2004} capable of operating in distributed, shared-nothing architecture. We couple this to a data partitioning scheme that enables fast catalog cross-matching and handles the data skew often present in deep all-sky data sets. The cross-match and other often-used functionalities are exposed to the end users through an easy-to-use Python API. We demonstrate AXS’ technical and scientific performance on SDSS, ZTF, Gaia DR2, and AllWise catalogs. Using AXS we were able to perform on-the-fly cross-match of Gaia DR2 (1.8 billion rows) and AllWise (900 million rows) data sets in ~ 30 seconds. We discuss how cloud-ready distributed systems like AXS provide a natural way to enable comprehensive end-user analyses of large datasets such as LSST.

Read this paper on arXiv…

P. Zečević, C. Slater, M. Jurić, et. al.
Thu, 23 May 19
67/67

Comments: N/A

K-Athena: a performance portable structured grid finite volume magnetohydrodynamics code [CL]

http://arxiv.org/abs/1905.04341


Large scale simulations are a key pillar of modern research and require ever increasing computational resources. Different novel manycore architectures have emerged in recent years on the way towards the exascale era. Performance portability is required to prevent repeated non-trivial refactoring of a code for different architectures. We combine Athena++, an existing magnetohydrodynamics (MHD) CPU code, with Kokkos, a performance portable on-node parallel programming paradigm, into K-Athena to allow efficient simulations on multiple architectures using a single codebase. We present profiling and scaling results for different platforms including Intel Skylake CPUs, Intel Xeon Phis, and NVIDIA GPUs. K-Athena achieves $>10^8$ cell-updates/s on a single V100 GPU for second-order double precision MHD calculations, and a speedup of 30 on up to 24,576 GPUs on Summit (compared to 172,032 CPU cores), reaching $1.94\times10^{12}$ total cell-updates/s at 76% parallel efficiency. Using a roofline analysis we demonstrate that the overall performance is currently limited by DRAM bandwidth and calculate a performance portability metric of 83.1%. Finally, we present the strategies used for implementation and the challenges encountered maximizing performance. This will provide other research groups with a straightforward approach to prepare their own codes for the exascale era. K-Athena is available at https://gitlab.com/pgrete/kathena .

Read this paper on arXiv…

P. Grete, F. Glines and B. O’Shea
Tue, 14 May 19
38/91

Comments: 12 pages, 6 figures, 1 table; submitted to IEEE Transactions on Parallel and Distributed Systems (TPDS)

Applicability study of the PRIMAD model to LIGO gravitational wave search workflows [CL]

http://arxiv.org/abs/1904.05211


The PRIMAD model with its six components (i.e., Platform, Research Objective, Implementation, Methods, Actors, and Data), provides an abstract taxonomy to represent computational experiments and enforce reproducibility by design. In this paper, we assess the model applicability to a set of Laser Interferometer Gravitational-Wave Observatory (LIGO) workflows from literature sources (i.e., published papers). Our work outlines potentials and limits of the model in terms of its abstraction levels and application process.

Read this paper on arXiv…

D. Chapp, D. Rorabaugh, D. Brown, et. al.
Thu, 11 Apr 19
27/54

Comments: N/A

The Physics of Eccentric Binary Black Hole Mergers. A Numerical Relativity Perspective [CL]

http://arxiv.org/abs/1901.07038


Gravitational wave observations of eccentric binary black hole mergers will provide unequivocal evidence for the formation of these systems through dynamical assembly in dense stellar environments. The study of these astrophysically motivated sources is timely in view of electromagnetic observations, consistent with the existence of stellar mass black holes in the globular cluster M22 and in the Galactic center, and the proven detection capabilities of ground-based gravitational wave detectors. In order to get insights into the physics of these objects in the dynamical, strong-field gravity regime, we present a catalog of 89 numerical relativity waveforms that describe binary systems of non-spinning black holes with mass-ratios $1\leq q \leq 10$, and initial eccentricities as high as $e_0=0.18$ fifteen cycles before merger. We use this catalog to provide landmark results regarding the loss of energy through gravitational radiation, both for quadrupole and higher-order waveform multipoles, and the astrophysical properties, final mass and spin, of the post-merger black hole as a function of eccentricity and mass-ratio. We discuss the implications of these results for gravitational wave source modeling, and the design of algorithms to search for and identify the complex signatures of these events in realistic detection scenarios.

Read this paper on arXiv…

E. Huerta, R. Haas, S. Habib, et. al.
Wed, 23 Jan 19
90/111

Comments: 11 pages, 5 figures, 2 appendices. A visualization of this numerical relativity waveform catalog is available at this https URL

NEARBY Platform for Detecting Asteroids in Astronomical Images Using Cloud-based Containerized Applications [IMA]

http://arxiv.org/abs/1901.04248


The continuing monitoring and surveying of the nearby space to detect Near Earth Objects (NEOs) and Near Earth Asteroids (NEAs) are essential because of the threats that this kind of objects impose on the future of our planet. We need more computational resources and advanced algorithms to deal with the exponential growth of the digital cameras’ performances and to be able to process (in near real-time) data coming from large surveys. This paper presents a software platform called NEARBY that supports automated detection of moving sources (asteroids) among stars from astronomical images. The detection procedure is based on the classic “blink” detection and, after that, the system supports visual analysis techniques to validate the moving sources, assisted by static and dynamical presentations.

Read this paper on arXiv…

V. Bacu, A. Sabou, T. Stefanut, et. al.
Tue, 15 Jan 19
2/83

Comments: IEEE 14th International Conference on Intelligent Computer Communication and Processing (ICCP), Cluj-Napoca, Romania

Targeting GPUs with OpenMP Directives on Summit: A Simple and Effective Fortran Experience [CL]

http://arxiv.org/abs/1812.07977


We use OpenMP directives to target hardware accelerators (GPUs) on Summit, a newly deployed supercomputer at the Oak Ridge Leadership Computing Facility (OLCF), demonstrating simplified access to GPU devices for users of our astrophysics code GenASiS and useful speedup on a sample fluid dynamics problem. At a lower level, we use the capabilities of Fortran 2003 for C interoperability to provide wrappers to the OpenMP device memory runtime library routines (currently available only in C). At a higher level, we use C interoperability and Fortran 2003 type-bound procedures to modify our workhorse class for data storage to include members and methods that significantly streamline the persistent allocation of and on-demand association to GPU memory. Where the rubber meets the road, users offload computational kernels with OpenMP target directives that are rather similar to constructs already familiar from multi-core parallelization. In this initial example we demonstrate total wall time speedups of ~4X in ‘proportional resource tests’ that compare runs with a given percentage of nodes’ GPUs with runs utilizing instead the same percentage of nodes’ CPU cores, and reasonable weak scaling up to 8000 GPUs vs. 56,000 CPU cores (1333 1/3 Summit nodes). These speedups increase to over 12X when pinned memory is used strategically. We make available the source code from this work at https://github.com/GenASiS/GenASiS_Basics.

Read this paper on arXiv…

R. Budiardja and C. Cardall
Thu, 20 Dec 18
61/62

Comments: Submitted to Parallel Computing: Systems and Applications

A distributed data warehouse system for astroparticle physics [IMA]

http://arxiv.org/abs/1812.01906


A distributed data warehouse system is one of the actual issues in the field of astroparticle physics. Famous experiments, such as TAIGA, KASCADE-Grande, produce tens of terabytes of data measured by their instruments. It is critical to have a smart data warehouse system on-site to store the collected data for further distribution effectively. It is also vital to provide scientists with a handy and user-friendly interface to access the collected data with proper permissions not only on-site but also online. The latter case is handy when scientists need to combine data from different experiments for analysis. In this work, we describe an approach to implementing a distributed data warehouse system that allows scientists to acquire just the necessary data from different experiments via the Internet on demand. The implementation is based on CernVM-FS with additional components developed by us to search through the whole available data sets and deliver their subsets to users’ computers.

Read this paper on arXiv…

M. Nguyen, A. Kryukov, J. Dubenskaya, et. al.
Thu, 6 Dec 18
28/52

Comments: 5 pages, 3 figures, The 8th International Conference “Distributed Computing and Grid-technologies in Science and Education” (GRID 2018)

Particle identification in ground-based gamma-ray astronomy using convolutional neural networks [IMA]

http://arxiv.org/abs/1812.01551


Modern detectors of cosmic gamma-rays are a special type of imaging telescopes (air Cherenkov telescopes) supplied with cameras with a relatively large number of photomultiplier-based pixels. For example, the camera of the TAIGA-IACT telescope has 560 pixels of hexagonal structure. Images in such cameras can be analysed by deep learning techniques to extract numerous physical and geometrical parameters and/or for incoming particle identification. The most powerful deep learning technique for image analysis, the so-called convolutional neural network (CNN), was implemented in this study. Two open source libraries for machine learning, PyTorch and TensorFlow, were tested as possible software platforms for particle identification in imaging air Cherenkov telescopes. Monte Carlo simulation was performed to analyse images of gamma-rays and background particles (protons) as well as estimate identification accuracy. Further steps of implementation and improvement of this technique are discussed.

Read this paper on arXiv…

E. Postnikov, I. Bychkov, J. Dubenskaya, et. al.
Wed, 5 Dec 18
47/73

Comments: 5 pages, 2 figures. Submitted to CEUR Workshop Proceedings, 8th International Conference “Distributed Computing and Grid-technologies in Science and Education” GRID 2018, 10 – 14 September 2018, Dubna, Russia

Using Binary File Format Description Languages for Documenting, Parsing, and Verifying Raw Data in TAIGA Experiment [IMA]

http://arxiv.org/abs/1812.01324


The paper is devoted to the issues of raw binary data documenting, parsing and verifying in astroparticle data lifecycle. The long-term preservation of raw data of astroparticle experiments as originally generated is essential for re-running analyses and reproducing research results. The selected high-quality raw data should have detailed documentation and accompanied by open software tools for access to them. We consider applicability of binary file format description languages to specify, parse and verify raw data of the Tunka Advanced Instrument for cosmic rays and Gamma Astronomy (TAIGA) experiment. The formal specifications are implemented for five data formats of the experiment and provide automatic generation of source code for data reading libraries in target programming languages (e.g. C++, Java, and Python). These libraries were tested on TAIGA data. They showed a good performance and help us to locate the parts with corrupted data. The format specifications can be used as metadata for exchanging of astroparticle raw data. They can also simplify software development for data aggregation from various sources for the multi-messenger analysis.

Read this paper on arXiv…

I. Bychkov, A. Demichev, J. Dubenskaya, et. al.
Wed, 5 Dec 18
57/73

Comments: N/A

JOVIAL: Notebook-based Astronomical Data Analysis in the Cloud [IMA]

http://arxiv.org/abs/1812.01477


Performing astronomical data analysis using only personal computers is becoming impractical for the very large data sets produced nowadays. As analysis is not a task that can be automatized to its full extent, the idea of moving processing where the data is located means also moving the whole scientific process towards the archives and data centers. Using Jupyter Notebooks as a remote service is a recent trend in data analysis that aims to deal with this problem, but harnessing the infrastructure to serve the astronomer without increasing the complexity of the service is a challenge. In this paper we present the architecture and features of JOVIAL, a Cloud service where astronomers can safely use Jupyter notebooks over a personal space designed for high-performance processing under the high-availability principle. We show that features existing only in specific packages can be adapted to run in the notebooks, and that algorithms can be adapted to run across the data center without necessarily redesigning them.

Read this paper on arXiv…

M. Araya, M. Osorio, M. Díaz, et. al.
Wed, 5 Dec 18
67/73

Comments: 8 pages, 10 figures, special issue of ADASS 2017

GPU Acceleration of an Established Solar MHD Code using OpenACC [CL]

http://arxiv.org/abs/1811.02605


GPU accelerators have had a notable impact on high-performance computing across many disciplines. They provide high performance with low cost/power, and therefore have become a primary compute resource on many of the largest supercomputers. Here, we implement multi-GPU acceleration into our Solar MHD code (MAS) using OpenACC in a fully portable, single-source manner. Our preliminary implementation is focused on MAS running in a reduced physics “zero-beta” mode. While valuable on its own, our main goal is to pave the way for a full physics, thermodynamic MHD implementation. We describe the OpenACC implementation methodology and challenges. “Time-to-solution” performance results of a production-level flux rope eruption simulation on multi-CPU and multi-GPU systems are shown. We find that the GPU-accelerated MAS code has the ability to run “zero-beta” simulations on a single multi-GPU server at speeds previously requiring multiple CPU server-nodes of a supercomputer.

Read this paper on arXiv…

R. Caplan, J. Linker, Z. Mikić, et. al.
Thu, 8 Nov 18
40/72

Comments: 13 pages, 9 figures

Architecture of Distributed Data Storage for Astroparticle Physics [CL]

http://arxiv.org/abs/1811.02403


For the successful development of the astrophysics and, accordingly, for obtaining more complete knowledge of the Universe, it is extremely important to combine and comprehensively analyze information of various types (e.g., about charged cosmic particles, gamma rays, neutrinos, etc.) obtained by using divers large-scale experimental setups located throughout the world. It is obvious that all kinds of activities must be performed continually across all stages of the data life cycle to help support effective data management, in particular, the collection and storage of data, its processing and analysis, refining the physical model, making preparations for publication, and data reprocessing taking refinement into account. In this paper we present a general approach to construction and the architecture of a system to be able to collect, store, and provide users’ access to astrophysical data. We also suggest a new approach to the construction of a metadata registry based on the blockchain technology.

Read this paper on arXiv…

A. Kryukov and A. Demichev
Wed, 7 Nov 18
49/94

Comments: 11 pages, 2 figures

ECHO-3DHPC: Advance the performance of astrophysics simulations with code modernization [CL]

http://arxiv.org/abs/1810.04597


We present recent developments in the parallelization scheme of ECHO-3DHPC, an efficient astrophysical code used in the modelling of relativistic plasmas. With the help of the Intel Software Development Tools, like Fortran compiler and Profile-Guided Optimization (PGO), Intel MPI library, VTune Amplifier and Inspector we have investigated the performance issues and improved the application scalability and the time to solution. The node-level performance is improved by $2.3 \times$ and, thanks to the improved threading parallelisation, the hybrid MPI-OpenMP version of the code outperforms the MPI-only, thus lowering the MPI communication overhead.

Read this paper on arXiv…

M. Bugli, L. Iapichino and F. Baruffa
Thu, 11 Oct 18
16/72

Comments: 7 pages, 6 figures. Accepted for publication on The Parallel Universe Magazine ( this https URL )

Supporting High-Performance and High-Throughput Computing for Experimental Science [CL]

http://arxiv.org/abs/1810.03056


The advent of experimental science facilities, instruments and observatories, such as the Large Hadron Collider (LHC), the Laser Interferometer Gravitational Wave Observatory (LIGO), and the upcoming Large Synoptic Survey Telescope (LSST), has brought about challenging, large-scale computational and data processing requirements. Traditionally, the computing infrastructures to support these facility’s requirements were organized into separate infrastructure that supported their high-throughput needs and those that supported their high-performance computing needs. We argue that in order to enable and accelerate scientific discovery at the scale and sophistication that is now needed, this separation between High-Performance Computing (HPC) and High-Throughput Computing (HTC) must be bridged and an integrated, unified infrastructure must be provided. In this paper, we discuss several case studies where such infrastructures have been implemented. These case studies span different science domains, software systems, and application requirements as well as levels of sustainable. A further aim of this paper is to provide a basis to determine the common characteristics and requirements of such infrastructures, as well as to begin a discussion of how best to support the computing requirements of existing and future experimental science facilities.

Read this paper on arXiv…

E. Huerta, R. Haas, S. Jha, et. al.
Tue, 9 Oct 18
10/77

Comments: 11 pages, 7 figures

SWIFT: Maintaining weak-scalability with a dynamic range of $10^4$ in time-step size to harness extreme adaptivity [CL]

http://arxiv.org/abs/1807.01341


Cosmological simulations require the use of a multiple time-stepping scheme. Without such a scheme, cosmological simulations would be impossible due to their high level of dynamic range; over eleven orders of magnitude in density. Such a large dynamic range leads to a range of over four orders of magnitude in time-step, which presents a significant load-balancing challenge. In this work, the extreme adaptivity that cosmological simulations present is tackled in three main ways through the use of the code SWIFT. First, an adaptive mesh is used to ensure that only the relevant particles are interacted in a given time-step. Second, task-based parallelism is used to ensure efficient load-balancing within a single node, using pthreads and SIMD vectorisation. Finally, a domain decomposition strategy is presented, using the graph domain decomposition library METIS, that bisects the work that must be performed by the simulation between nodes using MPI. These three strategies are shown to give SWIFT near-perfect weak-scaling characteristics, only losing 25% performance when scaling from 1 to 4096 cores on a representative problem, whilst being more than 30x faster than the de-facto standard Gadget-2 code.

Read this paper on arXiv…

J. Borrow, R. Bower, P. Draper, et. al.
Thu, 5 Jul 18
39/60

Comments: N/A

Performance Analysis of Distributed Radio Interferometric Calibration [IMA]

http://arxiv.org/abs/1805.00265


Distributed calibration based on consensus optimization is a computationally efficient method to calibrate large radio interferometers such as LOFAR and SKA. Calibrating along multiple directions in the sky and removing the bright foreground signal is a crucial step in many science cases in radio interferometry. The residual data contain weak signals of huge scientific interest and of particular concern is the effect of incomplete sky models used in calibration on the residual. In order to study this, we consider the mapping between the input uncalibrated data and the output residual data. We derive an analytical relationship between the input and output probability density functions which can be used to study the performance of calibration.

Read this paper on arXiv…

S. Yatawatta
Wed, 2 May 18
18/55

Comments: Draft, to be published in the Proceedings of IEEE Sensor Array and Multichannel Signal Processing Workshop (IEEE SAM 2018), published by IEEE

Analyzing astronomical data with Apache Spark [IMA]

http://arxiv.org/abs/1804.07501


We investigate the performances of Apache Spark, a cluster computing framework, for analyzing data from future LSST-like galaxy surveys. Apache Spark attempts to address big data problems have hitherto proved successful in the industry, but its main use is often limited to naively structured data. We show how to manage more complex binary data structures such as those handled in astrophysics experiments, within a distributed environment. To this purpose, we first designed and implemented a Spark connector to handle sets of arbitrarily large FITS files, called spark-fits. The user interface is such that a simple file “drag-and-drop” to a cluster gives full advantage of the framework. We demonstrate the very high scalability of spark-fits using the LSST fast simulation tool, CoLoRe, and present the methodologies for measuring and tuning the performance bottlenecks for the workloads, scaling up to terabytes of FITS data on the Cloud@VirtualData, located at Universit\’e Paris Sud. We also evaluate its performance on Cori, a High-Performance Computing system located at NERSC, and widely used in the scientific community.

Read this paper on arXiv…

J. Peloton, C. Arnault and S. Plaszczynski
Mon, 23 Apr 18
28/63

Comments: 9 pages, 6 figures. Package available at this https URL

Interactive 3D Visualization for Theoretical Virtual Observatories [IMA]

http://arxiv.org/abs/1803.11399


Virtual Observatories (VOs) are online hubs of scientific knowledge. They encompass a collection of platforms dedicated to the storage and dissemination of astronomical data, from simple data archives to e-research platforms offering advanced tools for data exploration and analysis. Whilst the more mature platforms within VOs primarily serve the observational community, there are also services fulfilling a similar role for theoretical data. Scientific visualization can be an effective tool for analysis and exploration of datasets made accessible through web platforms for theoretical data, which often contain spatial dimensions and properties inherently suitable for visualization via e.g. mock imaging in 2d or volume rendering in 3d. We analyze the current state of 3d visualization for big theoretical astronomical datasets through scientific web portals and virtual observatory services. We discuss some of the challenges for interactive 3d visualization and how it can augment the workflow of users in a virtual observatory context. Finally we showcase a lightweight client-server visualization tool for particle-based datasets allowing quantitative visualization via data filtering, highlighting two example use cases within the Theoretical Astrophysical Observatory.

Read this paper on arXiv…

T. Dykes, A. Hassan, C. Gheller, et. al.
Mon, 2 Apr 18
37/39

Comments: 10 Pages, 13 Figures, Accepted for Publication in Monthly Notices of the Royal Astronomical Society

Cataloging the Visible Universe through Bayesian Inference at Petascale [CL]

http://arxiv.org/abs/1801.10277


Astronomical catalogs derived from wide-field imaging surveys are an important tool for understanding the Universe. We construct an astronomical catalog from 55 TB of imaging data using Celeste, a Bayesian variational inference code written entirely in the high-productivity programming language Julia. Using over 1.3 million threads on 650,000 Intel Xeon Phi cores of the Cori Phase II supercomputer, Celeste achieves a peak rate of 1.54 DP PFLOP/s. Celeste is able to jointly optimize parameters for 188M stars and galaxies, loading and processing 178 TB across 8192 nodes in 14.6 minutes. To achieve this, Celeste exploits parallelism at multiple levels (cluster, node, and thread) and accelerates I/O through Cori’s Burst Buffer. Julia’s native performance enables Celeste to employ high-level constructs without resorting to hand-written or generated low-level code (C/C++/Fortran), and yet achieve petascale performance.

Read this paper on arXiv…

J. Regier, K. Pamnany, K. Fischer, et. al.
Thu, 1 Feb 18
38/55

Comments: accepted to IPDPS 2018

Distributed Model Construction in Radio Interferometric Calibration [IMA]

http://arxiv.org/abs/1801.09747


Calibration of a typical radio interferometric array yields thousands of parameters as solutions. These solutions contain valuable information about the systematic errors in the data (ionosphere and beam shape). This information could be reused in calibration to improve the accuracy and also can be fed into imaging to improve the fidelity. We propose a distributed optimization strategy to construct models for the systematic errors in the data using the calibration solutions. We formulate this as an elastic net regularized distributed optimization problem which we solve using the alternating direction method of multipliers (ADMM) algorithm. We give simulation results to show the feasibility of the proposed distributed model construction scheme.

Read this paper on arXiv…

S. Yatawatta
Wed, 31 Jan 18
24/65

Comments: Draft, to be published in the Proceedings of the 2018 International Conference on Acoustics, Speech and Signal Processing (IEEE ICASSP 2018), published by IEEE

A hybrid architecture for astronomical computing [CL]

http://arxiv.org/abs/1801.07548


With many large science equipment constructing and putting into use, astronomy has stepped into the big data era. The new method and infrastructure of big data processing has become a new requirement of many astronomers. Cloud computing, Map/Reduce, Hadoop, Spark, etc. many new technology has sprung up in recent years. Comparing to the high performance computing(HPC), Data is the center of these new technology. So, a new computing architecture infrastructure is necessary, which can be shared by both HPC and big data processing. Based on Astronomy Cloud project of Chinese Virtual Observatory (China-VO), we have made much efforts to optimize the designation of the hybrid computing platform. which include the hardware architecture, cluster management, Job and Resource scheduling.

Read this paper on arXiv…

C. Li, C. Cui, B. He, et. al.
Wed, 24 Jan 18
28/73

Comments: 4 pages, 2 figures, ADASS XXVI conference

MPI_XSTAR: MPI-based Parallelization of the XSTAR Photoionization Program [HEAP]

http://arxiv.org/abs/1712.00343


We describe a program for the parallel implementation of multiple runs of XSTAR, a photoionization code that is used to predict the physical properties of an ionized gas from its emission and/or absorption lines. The parallelization program, called MPI_XSTAR, has been developed and implemented in the C++ language by using the Message Passing Interface (MPI) protocol, a conventional standard of parallel computing. We have benchmarked parallel multiprocessing executions of XSTAR, using MPI_XSTAR, against a serial execution of XSTAR, in terms of the parallelization speedup and the computing resource efficiency. Our experience indicates that the parallel execution runs significantly faster than the serial execution, however, the efficiency in terms of the computing resource usage decreases with increasing the number of processors used in the parallel computing.

Read this paper on arXiv…

A. Danehkar, M. Nowak, J. Lee, et. al.
Mon, 4 Dec 17
1/72

Comments: 5 pages, 1 figure, accepted for publication in Publications of the Astronomical Society of the Pacific (PASP)

Cosmological Simulations in Exascale Era [IMA]

http://arxiv.org/abs/1712.00252


The architecture of Exascale computing facilities, which involves millions of heterogeneous processing units, will deeply impact on scientific applications. Future astrophysical HPC applications must be designed to make such computing systems exploitable. The ExaNeSt H2020 EU-funded project aims to design and develop an exascale ready prototype based on low-energy-consumption ARM64 cores and FPGA accelerators. We participate to the design of the platform and to the validation of the prototype with cosmological N-body and hydrodynamical codes suited to perform large-scale, high-resolution numerical simulations of cosmic structures formation and evolution. We discuss our activities on astrophysical applications to take advantage of the underlying architecture.

Read this paper on arXiv…

D. Goz, L. Tornatore, G. Taffoni, et. al.
Mon, 4 Dec 17
61/72

Comments: submitted to ASP

Data Multiplexing in Radio Interferometric Calibration [IMA]

http://arxiv.org/abs/1711.10221


New and upcoming radio interferometers will produce unprecedented amounts of data that demand extremely powerful computers for processing. This is a limiting factor due to the large computational power and energy costs involved. Such limitations restrict several key data processing steps in radio interferometry. One such step is calibration where systematic errors in the data are determined and corrected. Accurate calibration is an essential component in reaching many scientific goals in radio astronomy and the use of consensus optimization that exploits the continuity of systematic errors across frequency significantly improves calibration accuracy. In order to reach full consensus, data at all frequencies need to be calibrated simultaneously. In the SKA regime, this can become intractable if the available compute agents do not have the resources to process data from all frequency channels simultaneously. In this paper, we propose a multiplexing scheme that is based on the alternating direction method of multipliers (ADMM) with cyclic updates. With this scheme, it is possible to simultaneously calibrate the full dataset using far fewer compute agents than the number of frequencies at which data are available. We give simulation results to show the feasibility of the proposed multiplexing scheme in simultaneously calibrating a full dataset when a limited number of compute agents are available.

Read this paper on arXiv…

S. Yatawatta, F. Diblen, H. Spreeuw, et. al.
Wed, 29 Nov 17
8/69

Comments: MNRAS under review

Adaptive ADMM in Distributed Radio Interferometric Calibration [IMA]

http://arxiv.org/abs/1710.05656


Distributed radio interferometric calibration based on consensus optimization has been shown to improve the estimation of systematic errors in radio astronomical observations. The intrinsic continuity of systematic errors across frequency is used by a consensus polynomial to penalize traditional calibration. Consensus is achieved via the use of alternating direction method of multipliers (ADMM) algorithm. In this paper, we extend the existing distributed calibration algorithms to use ADMM with an adaptive penalty parameter update. Compared to a fixed penalty, its adaptive update has been shown to perform better in diverse applications of ADMM. In this paper, we compare two such popular penalty parameter update schemes: residual balance penalty update and spectral penalty update (Barzilai-Borwein). We apply both schemes to distributed radio interferometric calibration and compare their performance against ADMM with a fixed penalty parameter. Simulations show that both methods of adaptive penalty update improve the convergence of ADMM but the spectral penalty parameter update shows more stability.

Read this paper on arXiv…

S. Yatawatta, F. Diblen and H. Spreeuw
Tue, 17 Oct 17
38/163

Comments: Draft, to be published in the Proceedings of the 7th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP) (IEEE CAMSAP 2017), published by IEEE

Report: Performance comparison between C2075 and P100 GPU cards using cosmological correlation functions [CL]

http://arxiv.org/abs/1709.03264


In this report, some cosmological correlation functions are used to evaluate the differential performance between C2075 and P100 GPU cards. In the past, the correlation functions used in this work have been widely studied and exploited on some previous GPU architectures. The analysis of the performance indicates that a speedup in the range from 13 to 15 is achieved without any additional optimization process for the P100 card.

Read this paper on arXiv…

M. Cardenas-Montes, I. Mendez-Jimenez, J. Rodriguez-Vazquez, et. al.
Tue, 12 Sep 17
52/71

Comments: N/A

Bifrost: a Python/C++ Framework for High-Throughput Stream Processing in Astronomy [IMA]

http://arxiv.org/abs/1708.00720


Radio astronomy observatories with high throughput back end instruments require real-time data processing. While computing hardware continues to advance rapidly, development of real-time processing pipelines remains difficult and time-consuming, which can limit scientific productivity. Motivated by this, we have developed Bifrost: an open-source software framework for rapid pipeline development. Bifrost combines a high-level Python interface with highly efficient reconfigurable data transport and a library of computing blocks for CPU and GPU processing. The framework is generalizable, but initially it emphasizes the needs of high-throughput radio astronomy pipelines, such as the ability to process data buffers as if they were continuous streams, the capacity to partition processing into distinct data sequences (e.g., separate observations), and the ability to extract specific intervals from buffered data. Computing blocks in the library are designed for applications such as interferometry, pulsar dedispersion and timing, and transient search pipelines. We describe the design and implementation of the Bifrost framework and demonstrate its use as the backbone in the correlation and beamforming back end of the Long Wavelength Array station in the Sevilleta National Wildlife Refuge, NM.

Read this paper on arXiv…

M. Cranmer, B. Barsdell, D. Price, et. al.
Thu, 3 Aug 17
15/59

Comments: 25 pages, 13 figures, submitted to JAI. For the code, see this https URL

Performance Measurements of Supercomputing and Cloud Storage Solutions [CL]

http://arxiv.org/abs/1708.00544


Increasing amounts of data from varied sources, particularly in the fields of machine learning and graph analytics, are causing storage requirements to grow rapidly. A variety of technologies exist for storing and sharing these data, ranging from parallel file systems used by supercomputers to distributed block storage systems found in clouds. Relatively few comparative measurements exist to inform decisions about which storage systems are best suited for particular tasks. This work provides these measurements for two of the most popular storage technologies: Lustre and Amazon S3. Lustre is an open-source, high performance, parallel file system used by many of the largest supercomputers in the world. Amazon’s Simple Storage Service, or S3, is part of the Amazon Web Services offering, and offers a scalable, distributed option to store and retrieve data from anywhere on the Internet. Parallel processing is essential for achieving high performance on modern storage systems. The performance tests used span the gamut of parallel I/O scenarios, ranging from single-client, single-node Amazon S3 and Lustre performance to a large-scale, multi-client test designed to demonstrate the capabilities of a modern storage appliance under heavy load. These results show that, when parallel I/O is used correctly (i.e., many simultaneous read or write processes), full network bandwidth performance is achievable and ranged from 10 gigabits/s over a 10 GigE S3 connection to 0.35 terabits/s using Lustre on a 1200 port 10 GigE switch. These results demonstrate that S3 is well-suited to sharing vast quantities of data over the Internet, while Lustre is well-suited to processing large quantities of data locally.

Read this paper on arXiv…

M. Jones, J. Kepner, W. Arcand, et. al.
Thu, 3 Aug 17
49/59

Comments: 5 pages, 4 figures, to appear in IEEE HPEC 2017

Methods for compressible fluid simulation on GPUs using high-order finite differences [CL]

http://arxiv.org/abs/1707.08900


We focus on implementing and optimizing a sixth-order finite-difference solver for simulating compressible fluids on a GPU using third-order Runge-Kutta integration. Since graphics processing units perform well in data-parallel tasks, this makes them an attractive platform for fluid simulation. However, high-order stencil computation is memory-intensive with respect to both main memory and the caches of the GPU. We present two approaches for simulating compressible fluids using 55-point and 19-point stencils. We seek to reduce the requirements for memory bandwidth and cache size in our methods by using cache blocking and decomposing a latency-bound kernel into several bandwidth-bound kernels. Our fastest implementation is bandwidth-bound and integrates $343$ million grid points per second on a Tesla K40t GPU, achieving a $3.6 \times$ speedup over a comparable hydrodynamics solver benchmarked on two Intel Xeon E5-2690v3 processors. Our alternative GPU implementation is latency-bound and achieves the rate of $168$ million updates per second.

Read this paper on arXiv…

J. Pekkila, M. Vaisala, M. Kapyla, et. al.
Fri, 28 Jul 17
20/48

Comments: 14 pages, 7 figures

Benchmarking Data Analysis and Machine Learning Applications on the Intel KNL Many-Core Processor [CL]

http://arxiv.org/abs/1707.03515


Knights Landing (KNL) is the code name for the second-generation Intel Xeon Phi product family. KNL has generated significant interest in the data analysis and machine learning communities because its new many-core architecture targets both of these workloads. The KNL many-core vector processor design enables it to exploit much higher levels of parallelism. At the Lincoln Laboratory Supercomputing Center (LLSC), the majority of users are running data analysis applications such as MATLAB and Octave. More recently, machine learning applications, such as the UC Berkeley Caffe deep learning framework, have become increasingly important to LLSC users. Thus, the performance of these applications on KNL systems is of high interest to LLSC users and the broader data analysis and machine learning communities. Our data analysis benchmarks of these application on the Intel KNL processor indicate that single-core double-precision generalized matrix multiply (DGEMM) performance on KNL systems has improved by ~3.5x compared to prior Intel Xeon technologies. Our data analysis applications also achieved ~60% of the theoretical peak performance. Also a performance comparison of a machine learning application, Caffe, between the two different Intel CPUs, Xeon E5 v3 and Xeon Phi 7210, demonstrated a 2.7x improvement on a KNL node.

Read this paper on arXiv…

C. Byun, J. Kepner, W. Arcand, et. al.
Thu, 13 Jul 17
50/60

Comments: 6 pages; 9 figures; accepted to IEEE HPEC 2017

GPU-Accelerated Algorithms for Compressed Signals Recovery with Application to Astronomical Imagery Deblurring [CL]

http://arxiv.org/abs/1707.02244


Compressive sensing promises to enable bandwidth-efficient on-board compression of astronomical data by lifting the encoding complexity from the source to the receiver. The signal is recovered off-line, exploiting GPUs parallel computation capabilities to speedup the reconstruction process. However, inherent GPU hardware constraints limit the size of the recoverable signal and the speedup practically achievable. In this work, we design parallel algorithms that exploit the properties of circulant matrices for efficient GPU-accelerated sparse signals recovery. Our approach reduces the memory requirements, allowing us to recover very large signals with limited memory. In addition, it achieves a tenfold signal recovery speedup thanks to ad-hoc parallelization of matrix-vector multiplications and matrix inversions. Finally, we practically demonstrate our algorithms in a typical application of circulant matrices: deblurring a sparse astronomical image in the compressed domain.

Read this paper on arXiv…

A. Fiandrotti, S. Fosson, C. Ravazzi, et. al.
Mon, 10 Jul 17
12/64

Comments: N/A

Data Access for LIGO on the OSG [CL]

http://arxiv.org/abs/1705.06202


During 2015 and 2016, the Laser Interferometer Gravitational-Wave Observatory (LIGO) conducted a three-month observing campaign. These observations delivered the first direct detection of gravitational waves from binary black hole mergers. To search for these signals, the LIGO Scientific Collaboration uses the PyCBC search pipeline. To deliver science results in a timely manner, LIGO collaborated with the Open Science Grid (OSG) to distribute the required computation across a series of dedicated, opportunistic, and allocated resources. To deliver the petabytes necessary for such a large-scale computation, our team deployed a distributed data access infrastructure based on the XRootD server suite and the CernVM File System (CVMFS). This data access strategy grew from simply accessing remote storage to a POSIX-based interface underpinned by distributed, secure caches across the OSG.

Read this paper on arXiv…

D. Weitzel, B. Bockelman, D. Brown, et. al.
Thu, 18 May 17
38/60

Comments: 6 pages, 3 figures, submitted to PEARC17

Architecture of processing and analysis system for big astronomical data [IMA]

http://arxiv.org/abs/1703.10979


This work explores the use of big data technologies deployed in the cloud for processing of astronomical data. We have applied Hadoop and Spark to the task of co-adding astronomical images. We compared the overhead and execution time of these frameworks. We conclude that performance of both frameworks is generally on par. The Spark API is more flexible, which allows one to easily construct astronomical data processing pipelines.

Read this paper on arXiv…

I. Kolosov, S. Gerasimov and A. Meshcheryakov
Mon, 3 Apr 17
11/38

Comments: 4 pages, to appear in the Proceedings of ADASS 2016, Astronomical Society of the Pacific (ASP) Conference Series

Accelerating gravitational microlensing simulations using the Xeon Phi coprocessor [IMA]

http://arxiv.org/abs/1703.09707


Recently Graphics Processing Units (GPUs) have been used to speed up very CPU-intensive gravitational microlensing simulations. In this work, we use the Xeon Phi coprocessor to accelerate such simulations and compare its performance on a microlensing code with that of NVIDIA’s GPUs. For the selected set of parameters evaluated in our experiment, we find that the speedup by Intel’s Knights Corner coprocessor is comparable to that by NVIDIA’s Fermi family of GPUs with compute capability 2.0, but less significant than GPUs with higher compute capabilities such as the Kepler. However, the very recently released second generation Xeon Phi, Knights Landing, is about 5.8 times faster than the Knights Corner, and about 2.9 times faster than the Kepler GPU used in our simulations. We conclude that the Xeon Phi is a very promising alternative to GPUs for modern high performance microlensing simulations.

Read this paper on arXiv…

B. Chen, R. Kantowski, X. Dai, et. al.
Thu, 30 Mar 17
36/69

Comments: 18 pages, 3 figures, accepted by the Astronomy & Computing

AdiosStMan: Parallelizing Casacore Table Data System Using Adaptive IO System [CL]

http://arxiv.org/abs/1703.09257


In this paper, we investigate the Casacore Table Data System (CTDS) used in the casacore and CASA libraries, and methods to parallelize it. CTDS provides a storage manager plugin mechanism for third-party devel- opers to design and implement their own CTDS storage managers. Hav- ing this in mind, we looked into various storage backend techniques that can possibly enable parallel I/O for CTDS by implementing new storage managers. After carrying on benchmarks showing the excellent parallel I/O throughput of the Adaptive IO System (ADIOS), we implemented an ADIOS based parallel CTDS storage manager. We then applied the CASA MSTransform frequency split task to verify the ADIOS Storage Manager. We also ran a series of performance tests to examine the I/O throughput in a massively parallel scenario.

Read this paper on arXiv…

R. Wang, C. Harris and A. Wicenec
Wed, 29 Mar 17
58/63

Comments: 20 pages, journal article, 2016

Multi-GPU maximum entropy image synthesis for radio astronomy [IMA]

http://arxiv.org/abs/1703.02920


The maximum entropy method (MEM) is a well known deconvolution technique in radio-interferometry. This method solves a non-linear optimization problem with an entropy regularization term. Other heuristics such as CLEAN are faster but highly user dependent. Nevertheless, MEM has the following advantages: it is unsupervised, it has an statistical basis, it has a better resolution and better image quality under certain conditions. This work presents a high performance GPU version of non-gridded MEM, which is tested using interferometric and simulated data. We propose a single-GPU and a multi-GPU implementation for single and multi-spectral data, respectively. We also make use of the Peer-to-Peer and Unified Virtual Addressing features of newer GPUs which allows to exploit transparently and efficiently multiple GPUs. Several ALMA data sets are used to demonstrate the effectiveness in imaging and to evaluate GPU performance. The results show that a speedup from 1000 to 5000 times faster than a sequential version can be achieved, depending on data and image size. This has allowed us to reconstruct the HD142527 CO(6-5) short baseline data set in 2.1 minutes, instead of the 2.5 days that takes on CPU.

Read this paper on arXiv…

M. Carcamo, P. Roman, S. Casassus, et. al.
Thu, 9 Mar 17
36/54

Comments: 11 pages, 13 figures

Acceleration of low-latency gravitational wave searches using Maxwell-microarchitecture GPUs [IMA]

http://arxiv.org/abs/1702.02256


Low-latency detections of gravitational waves (GWs) are crucial to enable prompt follow-up observations to astrophysical transients by conventional telescopes. We have developed a low-latency pipeline using a technique called Summed Parallel Infinite Impulse Response (SPIIR) filtering, realized by a Graphic Processing Unit (GPU). In this paper, we exploit the new \textit{Maxwell} memory access architecture in NVIDIA GPUs, namely the read-only data cache, warp-shuffle, and cross-warp atomic techniques. We report a 3-fold speed-up over our previous implementation of this filtering technique. To tackle SPIIR with relatively few filters, we develop a new GPU thread configuration with a nearly 10-fold speedup. In addition, we implement a multi-rate scheme of SPIIR filtering using Maxwell GPUs. We achieve more than 100-fold speed-up over a single core CPU for the multi-rate filtering scheme. This results in an overall of 21-fold CPU usage reduction for the entire SPIIR pipeline.

Read this paper on arXiv…

X. Guo, Q. Chu, S. Chung, et. al.
Thu, 9 Feb 17
38/67

Comments: N/A

OpenCluster: A Flexible Distributed Computing Framework for Astronomical Data Processing [IMA]

http://arxiv.org/abs/1701.04907


The volume of data generated by modern astronomical telescopes is extremely large and rapidly growing. However, current high-performance data processing architectures/frameworks are not well suited for astronomers because of their limitations and programming difficulties. In this paper, we therefore present OpenCluster, an open-source distributed computing framework to support rapidly developing high-performance processing pipelines of astronomical big data. We first detail the OpenCluster design principles and implementations and present the APIs facilitated by the framework. We then demonstrate a case in which OpenCluster is used to resolve complex data processing problems for developing a pipeline for the Mingantu Ultrawide Spectral Radioheliograph. Finally, we present our OpenCluster performance evaluation. Overall, OpenCluster provides not only high fault tolerance and simple programming interfaces, but also a flexible means of scaling up the number of interacting entities. OpenCluster thereby provides an easily integrated distributed computing framework for quickly developing a high-performance data processing system of astronomical telescopes and for significantly reducing software development expenses.

Read this paper on arXiv…

S. Wei, F. Wang, H. Deng, et. al.
Thu, 19 Jan 17
3/42

Comments: N/A

Performance Optimisation of Smoothed Particle Hydrodynamics Algorithms for Multi/Many-Core Architectures [CL]

http://arxiv.org/abs/1612.06090


We describe a strategy for code modernisation of Gadget, a widely used community code for computational astrophysics. The focus of this work is on node-level performance optimisation, targeting current multi/many-core Intel architectures. We identify and isolate a sample code kernel, which is representative of a typical Smoothed Particle Hydrodynamics (SPH) algorithm. The code modifications include threading parallelism optimisation, change of the data layout into Structure of Arrays (SoA), auto-vectorisation and algorithmic improvements in the particle sorting. We measure lower execution time and improved threading scalability both on Intel Xeon ($2.6 \times$ on Ivy Bridge) and Xeon Phi ($13.7 \times$ on Knights Corner) systems. First tests on second generation Xeon Phi (Knights Landing) demonstrate the portability of the devised optimisation solutions to upcoming architectures.

Read this paper on arXiv…

F. Baruffa, L. Iapichino, N. Hammer, et. al.
Tue, 20 Dec 16
85/88

Comments: 18 pages, 5 figures, submitted

Learning an Astronomical Catalog of the Visible Universe through Scalable Bayesian Inference [CL]

http://arxiv.org/abs/1611.03404


Celeste is a procedure for inferring astronomical catalogs that attains state-of-the-art scientific results. To date, Celeste has been scaled to at most hundreds of megabytes of astronomical images: Bayesian posterior inference is notoriously demanding computationally. In this paper, we report on a scalable, parallel version of Celeste, suitable for learning catalogs from modern large-scale astronomical datasets. Our algorithmic innovations include a fast numerical optimization routine for Bayesian posterior inference and a statistically efficient scheme for decomposing astronomical optimization problems into subproblems.
Our scalable implementation is written entirely in Julia, a new high-level dynamic programming language designed for scientific and numerical computing. We use Julia’s high-level constructs for shared and distributed memory parallelism, and demonstrate effective load balancing and efficient scaling on up to 8192 Xeon cores on the NERSC Cori supercomputer.

Read this paper on arXiv…

J. Regier, K. Pamnany, R. Giordano, et. al.
Fri, 11 Nov 16
11/40

Comments: submitting to IPDPS’17

A Survey of High Level Frameworks in Block-Structured Adaptive Mesh Refinement Packages [CL]

http://arxiv.org/abs/1610.08833


Over the last decade block-structured adaptive mesh refinement (SAMR) has found increasing use in large, publicly available codes and frameworks. SAMR frameworks have evolved along different paths. Some have stayed focused on specific domain areas, others have pursued a more general functionality, providing the building blocks for a larger variety of applications. In this survey paper we examine a representative set of SAMR packages and SAMR-based codes that have been in existence for half a decade or more, have a reasonably sized and active user base outside of their home institutions, and are publicly available. The set consists of a mix of SAMR packages and application codes that cover a broad range of scientific domains. We look at their high-level frameworks, and their approach to dealing with the advent of radical changes in hardware architecture. The codes included in this survey are BoxLib, Cactus, Chombo, Enzo, FLASH, and Uintah.

Read this paper on arXiv…

A. Dubey, A. Almgren, J. Bell, et. al.
Fri, 28 Oct 16
37/73

Comments: N/A

Extreme Scale-out SuperMUC Phase 2 – lessons learned [CL]

http://arxiv.org/abs/1609.01507


In spring 2015, the Leibniz Supercomputing Centre (Leibniz-Rechenzentrum, LRZ), installed their new Peta-Scale System SuperMUC Phase2. Selected users were invited for a 28 day extreme scale-out block operation during which they were allowed to use the full system for their applications. The following projects participated in the extreme scale-out workshop: BQCD (Quantum Physics), SeisSol (Geophysics, Seismics), GPI-2/GASPI (Toolkit for HPC), Seven-League Hydro (Astrophysics), ILBDC (Lattice Boltzmann CFD), Iphigenie (Molecular Dynamic), FLASH (Astrophysics), GADGET (Cosmological Dynamics), PSC (Plasma Physics), waLBerla (Lattice Boltzmann CFD), Musubi (Lattice Boltzmann CFD), Vertex3D (Stellar Astrophysics), CIAO (Combustion CFD), and LS1-Mardyn (Material Science). The projects were allowed to use the machine exclusively during the 28 day period, which corresponds to a total of 63.4 million core-hours, of which 43.8 million core-hours were used by the applications, resulting in a utilization of 69%. The top 3 users were using 15.2, 6.4, and 4.7 million core-hours, respectively.

Read this paper on arXiv…

N. Hammer, F. Jamitzky, H. Satzger, et. al.
Wed, 7 Sep 16
46/61

Comments: 10 pages, 5 figures, presented at ParCo2015 – Advances in Parallel Computing, held in Edinburgh, September 2015. The final publication is available at IOS Press through this http URL

SpECTRE: A Task-based Discontinuous Galerkin Code for Relativistic Astrophysics [HEAP]

http://arxiv.org/abs/1609.00098


We introduce a new relativistic astrophysics code, SpECTRE, that combines a discontinuous Galerkin method with a task-based parallelism model. SpECTRE’s goal is to achieve more accurate solutions for challenging relativistic astrophysics problems such as core-collapse supernovae and binary neutron star mergers. The robustness of the discontinuous Galerkin method allows for the use of high-resolution shock capturing methods in regions where (relativistic) shocks are found, while exploiting high-order accuracy in smooth regions. A task-based parallelism model allows efficient use of the largest supercomputers for problems with a heterogeneous workload over disparate spatial and temporal scales. We argue that the locality and algorithmic structure of discontinuous Galerkin methods will exhibit good scalability within a task-based parallelism framework. We demonstrate the code on a wide variety of challenging benchmark problems in (non)-relativistic (magneto)-hydrodynamics. We demonstrate the code’s scalability including its strong scaling on the NCSA Blue Waters supercomputer up to the machine’s full capacity of 22,380 nodes using 671,400 threads.

Read this paper on arXiv…

L. Kidder, S. Field, F. Foucart, et. al.
Fri, 2 Sep 16
6/49

Comments: 39 pages, 13 figures, and 7 tables

A Communication Efficient and Scalable Distributed Data Mining for the Astronomical Data [IMA]

http://arxiv.org/abs/1606.07345


In 2020, ~60PB of archived data will be accessible to the astronomers. But to analyze such a paramount data will be a challenging task. This is basically due to the computational model used to download the data from complex geographically distributed archives to a central site and then analyzing it in the local systems. Because the data has to be downloaded to the central site, the network BW limitation will be a hindrance for the scientific discoveries. Also analyzing this PB-scale on local machines in a centralized manner is challenging. In this virtual observatory is a step towards this problem, however, it does not provide the data mining model. Adding the distributed data mining layer to the VO can be the solution in which the knowledge can be downloaded by the astronomers instead the raw data and thereafter astronomers can either reconstruct the data back from the downloaded knowledge or use the knowledge directly for further analysis.Therefore, in this paper, we present Distributed Load Balancing Principal Component Analysis for optimally distributing the computation among the available nodes to minimize the transmission cost and downloading cost for the end user. The experimental analysis is done with Fundamental Plane(FP) data, Gadotti data and complex Mfeat data. In terms of transmission cost, our approach performs better than Qi. et al. and Yue.et al. The analysis shows that with the complex Mfeat data ~90% downloading cost can be reduced for the end user with the negligible loss in accuracy.

Read this paper on arXiv…

A. Govada and S. Sahay
Fri, 24 Jun 16
46/47

Comments: Accepted in Astronomy and Computing, 2016, 20 Pages, 19 Figures

Mathematical Foundations of the GraphBLAS [CL]

http://arxiv.org/abs/1606.05790


The GraphBLAS standard (GraphBlas.org) is being developed to bring the potential of matrix based graph algorithms to the broadest possible audience. Mathematically the Graph- BLAS defines a core set of matrix-based graph operations that can be used to implement a wide class of graph algorithms in a wide range of programming environments. This paper provides an introduction to the mathematics of the GraphBLAS. Graphs represent connections between vertices with edges. Matrices can represent a wide range of graphs using adjacency matrices or incidence matrices. Adjacency matrices are often easier to analyze while incidence matrices are often better for representing data. Fortunately, the two are easily connected by matrix mul- tiplication. A key feature of matrix mathematics is that a very small number of matrix operations can be used to manipulate a very wide range of graphs. This composability of small number of operations is the foundation of the GraphBLAS. A standard such as the GraphBLAS can only be effective if it has low performance overhead. Performance measurements of prototype GraphBLAS implementations indicate that the overhead is low.

Read this paper on arXiv…

J. Kepner, P. Aaltonen, D. Bader, et. al.
Tue, 21 Jun 16
72/75

Comments: 9 pages; 11 figures; accepted to IEEE High Performance Extreme Computing (HPEC) conference 2016

Splotch: porting and optimizing for the Xeon Phi [CL]

http://arxiv.org/abs/1606.04427


With the increasing size and complexity of data produced by large scale numerical simulations, it is of primary importance for scientists to be able to exploit all available hardware in heterogenous High Performance Computing environments for increased throughput and efficiency. We focus on the porting and optimization of Splotch, a scalable visualization algorithm, to utilize the Xeon Phi, Intel’s coprocessor based upon the new Many Integrated Core architecture. We discuss steps taken to offload data to the coprocessor and algorithmic modifications to aid faster processing on the many-core architecture and make use of the uniquely wide vector capabilities of the device, with accompanying performance results using multiple Xeon Phi. Finally performance is compared against results achieved with the GPU implementation of Splotch.

Read this paper on arXiv…

T. Dykes, C. Gheller, M. Rivi, et. al.
Wed, 15 Jun 16
20/54

Comments: Version 1, 11 pages, 14 figures. Accepted for publication in International Journal of High Performance Computing Applications (IJHPCA)

SWIFT: Using task-based parallelism, fully asynchronous communication, and graph partition-based domain decomposition for strong scaling on more than 100,000 cores [CL]

http://arxiv.org/abs/1606.02738


We present a new open-source cosmological code, called SWIFT, designed to solve the equations of hydrodynamics using a particle-based approach (Smooth Particle Hydrodynamics) on hybrid shared/distributed-memory architectures. SWIFT was designed from the bottom up to provide excellent strong scaling on both commodity clusters (Tier-2 systems) and Top100-supercomputers (Tier-0 systems), without relying on architecture-specific features or specialized accelerator hardware. This performance is due to three main computational approaches: (1) Task-based parallelism for shared-memory parallelism, which provides fine-grained load balancing and thus strong scaling on large numbers of cores. (2) Graph-based domain decomposition, which uses the task graph to decompose the simulation domain such that the work, as opposed to just the data, as is the case with most partitioning schemes, is equally distributed across all nodes. (3) Fully dynamic and asynchronous communication, in which communication is modelled as just another task in the task-based scheme, sending data whenever it is ready and deferring on tasks that rely on data from other nodes until it arrives. In order to use these approaches, the code had to be re-written from scratch, and the algorithms therein adapted to the task-based paradigm. As a result, we can show upwards of 60% parallel efficiency for moderate-sized problems when increasing the number of cores 512-fold, on both x86-based and Power8-based architectures.

Read this paper on arXiv…

M. Schaller, P. Gonnet, A. Chalk, et. al.
Fri, 10 Jun 16
33/54

Comments: 9 pages, 7 figures. Code, scripts and examples available at this http URL

The Latin American Giant Observatory: a successful collaboration in Latin America based on Cosmic Rays and computer science domains [IMA]

http://arxiv.org/abs/1605.09295


In this work the strategy of the Latin American Giant Observatory (LAGO) to build a Latin American collaboration is presented. Installing Cosmic Rays detectors settled all around the Continent, from Mexico to the Antarctica, this collaboration is forming a community that embraces both high energy physicist and computer scientists. This is so because the data that are measured must be analytical processed and due to the fact that \textit{a priori} and \textit{a posteriori} simulations representing the effects of the radiation must be performed. To perform the calculi, customized codes have been implemented by the collaboration. With regard to the huge amount of data emerging from this network of sensors and from the computational simulations performed in a diversity of computing architectures and e-infrastructures, an effort is being carried out to catalog and preserve a vast amount of data produced by the water-Cherenkov Detector network and the complete LAGO simulation workflow that characterize each site. Metadata, Permanent Identifiers and the facilities from the LAGO Data Repository are described in this work jointly with the simulation codes used. These initiatives allow researchers to produce and find data and to directly use them in a code running by means of a Science Gateway that provides access to different clusters, Grid and Cloud infrastructures worldwide.

Read this paper on arXiv…

H. Asorey, R. Mayo-Garcia, L. Nunez, et. al.
Tue, 31 May 16
4/70

Comments: to be published in Proccedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing

Fine tuning consensus optimization for distributed radio interferometric calibration [IMA]

http://arxiv.org/abs/1605.09219


We recently proposed the use of consensus optimization as a viable and effective way to improve the quality of calibration of radio interferometric data. We showed that it is possible to obtain far more accurate calibration solutions and also to distribute the compute load across a network of computers by using this technique. A crucial aspect in any consensus optimization problem is the selection of the penalty parameter used in the alternating direction method of multipliers (ADMM) iterations. This affects the convergence speed as well as the accuracy. In this paper, we use the Hessian of the cost function used in calibration to appropriately select this penalty. We extend our results to a multi-directional calibration setting, where we propose to use a penalty scaled by the squared intensity of each direction.

Read this paper on arXiv…

S. Yatawatta
Tue, 31 May 16
67/70

Comments: Draft, to be published in the Proceedings of the 24th European Signal Processing Conference (EUSIPCO-2016) in 2016, published by EURASIP

Convection in Oblate Solar-Type Stars [SSA]

http://arxiv.org/abs/1603.05299


We present the first global 3D simulations of thermal convection in the oblate envelopes of rapidly-rotating solar-type stars. This has been achieved by exploiting the capabilities of the new Compressible High-ORder Unstructured Spectral difference (CHORUS) code. We consider rotation rates up to 85\% of the critical (breakup) rotation rate, which yields an equatorial radius that is up to 17\% larger than the polar radius. This substantial oblateness enhances the disparity between polar and equatorial modes of convection. We find that the convection redistributes the heat flux emitted from the outer surface, leading to an enhancement of the heat flux in the polar and equatorial regions. This finding implies that lower-mass stars with convective envelopes may not have darker equators as predicted by classical gravity darkening arguments. The vigorous high-latitude convection also establishes elongated axisymmetric circulation cells and zonal jets in the polar regions. Though the overall amplitude of the surface differential rotation, $\Delta \Omega$, is insensitive to the oblateness, the oblateness does limit the fractional kinetic energy contained in the differential rotation to no more than 61\%. Furthermore, we argue that this level of differential rotation is not enough to have a significant impact on the oblateness of the star.

Read this paper on arXiv…

J. Wang, M. Miesch and C. Liang
Fri, 18 Mar 16
4/53

Comments: N/A

PageRank Pipeline Benchmark: Proposal for a Holistic System Benchmark for Big-Data Platforms [CL]

http://arxiv.org/abs/1603.01876


The rise of big data systems has created a need for benchmarks to measure and compare the capabilities of these systems. Big data benchmarks present unique scalability challenges. The supercomputing community has wrestled with these challenges for decades and developed methodologies for creating rigorous scalable benchmarks (e.g., HPC Challenge). The proposed PageRank pipeline benchmark employs supercomputing benchmarking methodologies to create a scalable benchmark that is reflective of many real-world big data processing systems. The PageRank pipeline benchmark builds on existing prior scalable benchmarks (Graph500, Sort, and PageRank) to create a holistic benchmark with multiple integrated kernels that can be run together or independently. Each kernel is well defined mathematically and can be implemented in any programming environment. The linear algebraic nature of PageRank makes it well suited to being implemented using the GraphBLAS standard. The computations are simple enough that performance predictions can be made based on simple computing hardware models. The surrounding kernels provide the context for each kernel that allows rigorous definition of both the input and the output for each kernel. Furthermore, since the proposed PageRank pipeline benchmark is scalable in both problem size and hardware, it can be used to measure and quantitatively compare a wide range of present day and future systems. Serial implementations in C++, Python, Python with Pandas, Matlab, Octave, and Julia have been implemented and their single threaded performance has been measured.

Read this paper on arXiv…

P. Dreher, C. Byun, C. Hill, et. al.
Tue, 8 Mar 16
82/83

Comments: 9 pages, 7 figures, to appear in IPDPS 2016 Graph Algorithms Building Blocks (GABB) workshop

The Matsu Wheel: A Cloud-based Framework for Efficient Analysis and Reanalysis of Earth Satellite Imagery [CL]

http://arxiv.org/abs/1602.06888


Project Matsu is a collaboration between the Open Commons Consortium and NASA focused on developing open source technology for the cloud-based processing of Earth satellite imagery. A particular focus is the development of applications for detecting fires and floods to help support natural disaster detection and relief. Project Matsu has developed an open source cloud-based infrastructure to process, analyze, and reanalyze large collections of hyperspectral satellite image data using OpenStack, Hadoop, MapReduce, Storm and related technologies.
We describe a framework for efficient analysis of large amounts of data called the Matsu “Wheel.” The Matsu Wheel is currently used to process incoming hyperspectral satellite data produced daily by NASA’s Earth Observing-1 (EO-1) satellite. The framework is designed to be able to support scanning queries using cloud computing applications, such as Hadoop and Accumulo. A scanning query processes all, or most of the data, in a database or data repository.
We also describe our preliminary Wheel analytics, including an anomaly detector for rare spectral signatures or thermal anomalies in hyperspectral data and a land cover classifier that can be used for water and flood detection. Each of these analytics can generate visual reports accessible via the web for the public and interested decision makers. The resultant products of the analytics are also made accessible through an Open Geospatial Compliant (OGC)-compliant Web Map Service (WMS) for further distribution. The Matsu Wheel allows many shared data services to be performed together to efficiently use resources for processing hyperspectral satellite image data and other, e.g., large environmental datasets that may be analyzed for many purposes.

Read this paper on arXiv…

M. Patterson, N. Anderson, C. Bennett, et. al.
Tue, 23 Feb 16
75/78

Comments: 10 pages, accepted for presentation to IEEE BigDataService 2016

Gravitational wave astrophysics, data analysis and multimessenger astronomy [IMA]

http://arxiv.org/abs/1602.05573


This paper reviews gravitational wave sources and their detection. One of the most exciting potential sources of gravitational waves are coalescing binary black hole systems. They can occur on all mass scales and be formed in numerous ways, many of which are not understood. They are generally invisible in electromagnetic waves, and they provide opportunities for deep investigation of Einstein’s general theory of relativity. Sect. 1 of this paper considers ways that binary black holes can be created in the universe, and includes the prediction that binary black hole coalescence events are likely to be the first gravitational wave sources to be detected. The next parts of this paper address the detection of chirp waveforms from coalescence events in noisy data. Such analysis is computationally intensive. Sect. 2 reviews a new and powerful method of signal detection based on the GPU-implemented summed parallel infinite impulse response filters. Such filters are intrinsically real time alorithms, that can be used to rapidly detect and localise signals. Sect. 3 of the paper reviews the use of GPU processors for rapid searching for gravitational wave bursts that can arise from black hole births and coalescences. In sect. 4 the use of GPU processors to enable fast efficient statistical significance testing of gravitational wave event candidates is reviewed. Sect. 5 of this paper addresses the method of multimessenger astronomy where the discovery of electromagnetic counterparts of gravitational wave events can be used to identify sources, understand their nature and obtain much greater science outcomes from each identified event.

Read this paper on arXiv…

H. Lee, E. Bigot, Z. Du, et. al.
Fri, 19 Feb 16
42/50

Comments: N/A

Auto-Tuning Dedispersion for Many-Core Accelerators [CL]

http://arxiv.org/abs/1601.05052


In this paper, we study the parallelization of the dedispersion algorithm on many-core accelerators, including GPUs from AMD and NVIDIA, and the Intel Xeon Phi. An important contribution is the computational analysis of the algorithm, from which we conclude that dedispersion is inherently memory-bound in any realistic scenario, in contrast to earlier reports. We also provide empirical proof that, even in unrealistic scenarios, hardware limitations keep the arithmetic intensity low, thus limiting performance. We exploit auto-tuning to adapt the algorithm, not only to different accelerators, but also to different observations, and even telescopes. Our experiments show how the algorithm is tuned automatically for different scenarios and how it exploits and highlights the underlying specificities of the hardware: in some observations, the tuner automatically optimizes device occupancy, while in others it optimizes memory bandwidth. We quantitatively analyze the problem space, and by comparing the results of optimal auto-tuned versions against the best performing fixed codes, we show the impact that auto-tuning has on performance, and conclude that it is statistically relevant.

Read this paper on arXiv…

A. Sclocco, H. Bal, J. Hessels, et. al.
Wed, 20 Jan 16
5/58

Comments: 10 pages, published in the proceedings of IPDPS 2014

A polyphase filter for many-core architectures [IMA]

http://arxiv.org/abs/1511.03599


In this article we discuss our implementation of a polyphase filter for real-time data processing in radio astronomy. We describe in detail our implementation of the polyphase filter algorithm and its behaviour on three generations of NVIDIA GPU cards, on dual Intel Xeon CPUs and the Intel Xeon Phi (Knights Corner) platforms. All of our implementations aim to exploit the potential for data reuse that the algorithm offers. Our GPU implementations explore two different methods for achieving this, the first makes use of L1/Texture cache, the second uses shared memory. We discuss the usability of each of our implementations along with their behaviours. We measure performance in execution time, which is a critical factor for real-time systems, we also present results in terms of bandwidth (GB/s), compute (GFlop/s) and type conversions (GTc/s). We include a presentation of our results in terms of the sample rate which can be processed in real-time by a chosen platform, which more intuitively describes the expected performance in a signal processing setting. Our findings show that, for the GPUs considered, the performance of our polyphase filter when using lower precision input data is limited by type conversions rather than device bandwidth. We compare these results to an implementation on the Xeon Phi. We show that our Xeon Phi implementation has a performance that is 1.47x to 1.95x greater than our CPU implementation, however is not insufficient to compete with the performance of GPUs. We conclude with a comparison of our best performing code to two other implementations of the polyphase filter, showing that our implementation is faster in nearly all cases. This work forms part of the Astro-Accelerate project, a many-core accelerated real-time data processing library for digital signal processing of time-domain radio astronomy data.

Read this paper on arXiv…

K. Adamek, J. Novotny and W. Armour
Thu, 12 Nov 15
37/61

Comments: 19 pages, 20 figures, 5 tables

SWIFT: task-based hydrodynamics and gravity for cosmological simulations [IMA]

http://arxiv.org/abs/1508.00115


Simulations of galaxy formation follow the gravitational and hydrodynamical interactions between gas, stars and dark matter through cosmic time. The huge dynamic range of such calculations severely limits strong scaling behaviour of the community codes in use, with load-imbalance, cache inefficiencies and poor vectorisation limiting performance. The new swift code exploits task-based parallelism designed for many-core compute nodes interacting via MPI using asynchronous communication to improve speed and scaling. A graph-based domain decomposition schedules interdependent tasks over available resources. Strong scaling tests on realistic particle distributions yield excellent parallel efficiency, and efficient cache usage provides a large speed-up compared to current codes even on a single core. SWIFT is designed to be easy to use by shielding the astronomer from computational details such as the construction of the tasks or MPI communication. The techniques and algorithms used in SWIFT may benefit other computational physics areas as well, for example that of compressible hydrodynamics. For details of this open-source project, see www.swiftsim.com

Read this paper on arXiv…

T. Theuns, A. Chalk, M. Schaller, et. al.
Tue, 4 Aug 15
26/54

Comments: Proceedings of the EASC 2015 conference, Edinburgh, UK, April 21-23, 2015