HLC2: a highly efficient cross-matching framework for large astronomical catalogues on heterogeneous computing environments [IMA]

Posted on January 19, 2023 by arxiverbot

http://arxiv.org/abs/2301.07331

arxiverlogo

Cross-matching operation, which is to find corresponding data for the same celestial object or region from multiple catalogues,is indispensable to astronomical data analysis and research. Due to the large amount of astronomical catalogues generated by the ongoing and next-generation large-scale sky surveys, the time complexity of the cross-matching is increasing dramatically. Heterogeneous computing environments provide a theoretical possibility to accelerate the cross-matching, but the performance advantages of heterogeneous computing resources have not been fully utilized. To meet the challenge of cross-matching for substantial increasing amount of astronomical observation data, this paper proposes Heterogeneous-computing-enabled Large Catalogue Cross-matcher (HLC2), a high-performance cross-matching framework based on spherical position deviation on CPU-GPU heterogeneous computing platforms. It supports scalable and flexible cross-matching and can be directly applied to the fusion of large astronomical cataloguesfrom survey missions and astronomical data centres. A performance estimation model is proposed to locate the performance bottlenecks and guide the optimizations. A two-level partitioning strategy is designed to generate an optimized data placement according to the positions of celestial objects to increase throughput. To make HLC2 a more adaptive solution, the architecture-aware task splitting, thread parallelization, and concurrent scheduling strategies are designed and integrated. Moreover, a novel quad-direction strategy is proposed for the boundary problem to effectively balance performance and completeness. We have experimentally evaluated HLC2 using public released catalogue data. Experiments demonstrate that HLC2 scales well on different sizes of catalogues and the cross-matching speed is significantly improved compared to the state-of-the-art cross-matchers.

Read this paper on arXiv…

Y. Zhang, C. Yu, C. Sun, et. al.
Thu, 19 Jan 23
100/100

Comments: Accepted for publication in Monthly Notices of the Royal Astronomical Society

SciTS: A Benchmark for Time-Series Database in Scientific Experiments and Industrial Internet of Things [CL]

Posted on April 22, 2022 by arxiverbot

http://arxiv.org/abs/2204.09795

MostafaEtAl-2204.09795_f7.jpg

MostafaEtAl-2204.09795_f3.jpg

MostafaEtAl-2204.09795_f4.jpg

Time-series data has an increasingly growing usage in Industrial Internet of Things (IIoT) and large-scale scientific experiments. Managing time-series data needs a storage engine that can keep up with their constantly growing volumes while providing an acceptable query latency. While traditional ACID databases favor consistency over performance, many time-series databases with novel storage engines have been developed to provide better ingestion performance and lower query latency. To understand how the unique design of a time-series database affects its performance, we design SciTS, a highly extensible and parameterizable benchmark for time-series data. The benchmark studies the data ingestion capabilities of time-series databases especially as they grow larger in size. It also studies the latencies of 5 practical queries from the scientific experiments use case. We use SciTS to evaluate the performance of 4 databases of 4 distinct storage engines: ClickHouse, InfluxDB, TimescaleDB, and PostgreSQL.

Read this paper on arXiv…

J. Mostafa, S. Wehbi, S. Chilingaryan, et. al.
Fri, 22 Apr 22
51/64

Comments: N/A

The Locus Algorithm IV: Performance metrics of a grid computing system used to create catalogues of optimised pointings [IMA]

Posted on March 11, 2020 by arxiverbot

http://arxiv.org/abs/2003.04570

CreanerEtAl-2003.04570_f6.jpg

CreanerEtAl-2003.04570_f2.jpg

CreanerEtAl-2003.04570_f8.jpg

This paper discusses the requirements for and performance metrics of the the Grid Computing system used to implement the Locus Algorithm to identify optimum pointings for differential photometry of 61,662,376 stars and 23,779 quasars. Initial operational tests indicated a need for a software system to analyse the data and a High Performance Computing system to run that software in a scalable manner. Practical assessments of the performance of the software in a serial computing environment were used to provide a benchmark against which the performance metrics of the HPC solution could be compared, as well as to indicate any bottlenecks in performance. These performance metrics indicated a distinct split in the performance dictated more by differences in the input data than by differences in the design of the systems used. This indicates a need for experimental analysis of system performance, and suggests that algorithmic complexity analyses may lead to incorrect or naive conclusions, especially in systems with high data I/O overhead such as grid computing. Further, it implies that systems which reduce or eliminate this bottleneck such as in-memory processing could lead to a substantial increase in performance.

Read this paper on arXiv…

O. Creaner, J. Walsh, K. Nolan, et. al.
Wed, 11 Mar 20
44/65

Comments: 6 Pages, 1 Figure

Honing and proofing Astrophysical codes on the road to Exascale. Experiences from code modernization on many-core systems [CL]

Posted on February 20, 2020 by arxiverbot

http://arxiv.org/abs/2002.08161

CieloEtAl-2002.08161_f13.jpg

CieloEtAl-2002.08161_f2.jpg

CieloEtAl-2002.08161_f11.jpg

The complexity of modern and upcoming computing architectures poses severe challenges for code developers and application specialists, and forces them to expose the highest possible degree of parallelism, in order to make the best use of the available hardware. The Intel$^{(R)}$ Xeon Phi$^{(TM)}$ of second generation (code-named Knights Landing, henceforth KNL) is the latest many-core system, which implements several interesting hardware features like for example a large number of cores per node (up to 72), the 512 bits-wide vector registers and the high-bandwidth memory. The unique features of KNL make this platform a powerful testbed for modern HPC applications. The performance of codes on KNL is therefore a useful proxy of their readiness for future architectures. In this work we describe the lessons learnt during the optimisation of the widely used codes for computational astrophysics P-Gadget-3, Flash and Echo. Moreover, we present results for the visualisation and analysis tools VisIt and yt. These examples show that modern architectures benefit from code optimisation at different levels, even more than traditional multi-core systems. However, the level of modernisation of typical community codes still needs improvements, for them to fully utilise resources of novel architectures.

Read this paper on arXiv…

S. Cielo, L. Iapichino, F. Baruffa, et. al.
Thu, 20 Feb 20
36/61

Comments: 16 pages, 10 figures, 4 tables. To be published in Future Generation of Computer Systems (FGCS), Special Issue on “On The Road to Exascale II: Advances in High Performance Computing and Simulations”

Two-level Dynamic Load Balancing for High Performance Scientific Applications [CL]

Posted on November 20, 2019 by arxiverbot

http://arxiv.org/abs/1911.06714

MohammedEtAl-1911.06714_f15.jpg

MohammedEtAl-1911.06714_f30.jpg

MohammedEtAl-1911.06714_f3.jpg

Scientific applications are often complex, irregular, and computationally-intensive. To accommodate the ever-increasing computational demands of scientific applications, high-performance computing (HPC) systems have become larger and more complex, offering parallelism at multiple levels (e.g., nodes, cores per node, threads per core). Scientific applications need to exploit all the available multilevel hardware parallelism to harness the available computational power. The performance of applications executing on such HPC systems may adversely be affected by load imbalance at multiple levels, caused by problem, algorithmic, and systemic characteristics. Nevertheless, most existing load balancing methods do not simultaneously address load imbalance at multiple levels. This work investigates the impact of load imbalance on the performance of three scientific applications at the thread and process levels. We jointly apply and evaluate selected dynamic loop self-scheduling (DLS) techniques to both levels. Specifically, we employ the extended LaPeSD OpenMP runtime library at the thread level and extend the DLS4LB MPI-based dynamic load balancing library at the process level. This approach is generic and applicable to any multiprocess-multithreaded computationally-intensive application (programmed using MPI and OpenMP). We conduct an exhaustive set of experiments to assess and compare six DLS techniques at the thread level and eleven at the process level. The results show that improved application performance, by up to 21%, can only be achieved by jointly addressing load imbalance at the two levels. We offer insights into the performance of the selected DLS techniques and discuss the interplay of load balancing at the thread level and process level.

Read this paper on arXiv…

A. Mohammed, A. Cavelan, F. Ciorba, et. al.
Wed, 20 Nov 19
72/73

Comments: N/A

Direct N-body application on low-power and energy-efficient parallel architectures [CL]

Posted on November 1, 2019 by arxiverbot

http://arxiv.org/abs/1910.14496

GozEtAl-1910.14496_f4.jpg

GozEtAl-1910.14496_f1.jpg

GozEtAl-1910.14496_f5.jpg

The aim of this work is to quantitatively evaluate the impact of computation on the energy consumption on ARM MPSoC platforms, exploiting CPUs, embedded GPUs and FPGAs. One of them possibly represents the future of High Performance Computing systems: a prototype of an Exascale supercomputer. Performance and energy measurements are made using a state-of-the-art direct $N$-body code from the astrophysical domain. We provide a comparison of the time-to-solution and energy delay product metrics, for different software configurations. We have shown that FPGA technologies can be used for application kernel acceleration and are emerging as a promising alternative to “traditional” technologies for HPC, which purely focus on peak-performance than on power-efficiency.

Read this paper on arXiv…

D. Goz, G. Ieronymakis, V. Papaefstathiou, et. al.
Fri, 1 Nov 19
27/54

Comments: 10 pages, 5 figure, 2 tables; The final publication will be available at IOS Press

Visualizing the world's largest turbulence simulation [CL]

Posted on October 18, 2019 by arxiverbot

http://arxiv.org/abs/1910.07850

CieloEtAl-1910.07850_f5.jpg

CieloEtAl-1910.07850_f6.jpg

CieloEtAl-1910.07850_f4.jpg

In this exploratory submission we present the visualization of the largest interstellar turbulence simulations ever performed, unravelling key astrophysical processes concerning the formation of stars and the relative role of magnetic fields. The simulations, including pure hydrodynamical (HD) and magneto-hydrodynamical (MHD) runs, up to a size of $10048^3$ grid elements, were produced on the supercomputers of the Leibniz Supercomputing Centre and visualized using the hybrid parallel (MPI+TBB) ray-tracing engine OSPRay associated with VisIt. Besides revealing features of turbulence with an unprecedented resolution, the visualizations brilliantly showcase the stretching-and-folding mechanisms through which astrophysical processes such as supernova explosions drive turbulence and amplify the magnetic field in the interstellar gas, and how the first structures, the seeds of newborn stars are shaped by this process.

Read this paper on arXiv…

S. Cielo, L. Iapichino, J. Günther, et. al.
Fri, 18 Oct 19
39/77

Comments: 6 pages, 5 figures, accompanying paper of SC19 visualization showcase finalist. The full video is publicly available under this https URL

Speeding simulation analysis up with yt and Intel Distribution for Python [IMA]

Posted on October 18, 2019 by arxiverbot

http://arxiv.org/abs/1910.07855

arxiverlogo

As modern scientific simulations grow ever more in size and complexity, even their analysis and post-processing becomes increasingly demanding, calling for the use of HPC resources and methods. yt is a parallel, open source post-processing python package for numerical simulations in astrophysics, made popular by its cross-format compatibility, its active community of developers and its integration with several other professional Python instruments. The Intel Distribution for Python enhances yt’s performance and parallel scalability, through the optimization of lower-level libraries Numpy and Scipy, which make use of the optimized Intel Math Kernel Library (Intel-MKL) and the Intel MPI library for distributed computing. The library package yt is used for several analysis tasks, including integration of derived quantities, volumetric rendering, 2D phase plots, cosmological halo analysis and production of synthetic X-ray observation. In this paper, we provide a brief tutorial for the installation of yt and the Intel Distribution for Python, and the execution of each analysis task. Compared to the Anaconda python distribution, using the provided solution one can achieve net speedups up to 4.6x on Intel Xeon Scalable processors (codename Skylake).

Read this paper on arXiv…

S. Cielo, L. Iapichino and F. Baruffa
Fri, 18 Oct 19
42/77

Comments: 3 pages, 1 figure, published on Intel Parallel Universe Magazine

K-Athena: a performance portable structured grid finite volume magnetohydrodynamics code [CL]

Posted on May 14, 2019 by arxiverbot

http://arxiv.org/abs/1905.04341

GreteEtAl-1905.04341_f8.jpg

GreteEtAl-1905.04341_f7.jpg

GreteEtAl-1905.04341_f6.jpg

Large scale simulations are a key pillar of modern research and require ever increasing computational resources. Different novel manycore architectures have emerged in recent years on the way towards the exascale era. Performance portability is required to prevent repeated non-trivial refactoring of a code for different architectures. We combine Athena++, an existing magnetohydrodynamics (MHD) CPU code, with Kokkos, a performance portable on-node parallel programming paradigm, into K-Athena to allow efficient simulations on multiple architectures using a single codebase. We present profiling and scaling results for different platforms including Intel Skylake CPUs, Intel Xeon Phis, and NVIDIA GPUs. K-Athena achieves $>10^8$ cell-updates/s on a single V100 GPU for second-order double precision MHD calculations, and a speedup of 30 on up to 24,576 GPUs on Summit (compared to 172,032 CPU cores), reaching $1.94\times10^{12}$ total cell-updates/s at 76% parallel efficiency. Using a roofline analysis we demonstrate that the overall performance is currently limited by DRAM bandwidth and calculate a performance portability metric of 83.1%. Finally, we present the strategies used for implementation and the challenges encountered maximizing performance. This will provide other research groups with a straightforward approach to prepare their own codes for the exascale era. K-Athena is available at https://gitlab.com/pgrete/kathena .

Read this paper on arXiv…

P. Grete, F. Glines and B. O’Shea
Tue, 14 May 19
38/91

Comments: 12 pages, 6 figures, 1 table; submitted to IEEE Transactions on Parallel and Distributed Systems (TPDS)

Gravitational octree code performance evaluation on Volta GPU [CL]

Posted on November 8, 2018 by arxiverbot

http://arxiv.org/abs/1811.02761

MikiEtAl-1811.02761_f3.jpg

MikiEtAl-1811.02761_f1.jpg

MikiEtAl-1811.02761_f2.jpg

In this study, the gravitational octree code originally optimized for the Fermi, Kepler, and Maxwell GPU architectures is adapted to the Volta architecture. The Volta architecture introduces independent thread scheduling requiring either the insertion of the explicit synchronizations at appropriate locations or the enforcement of the same implicit synchronizations as do the Pascal or earlier architectures by specifying \texttt{-gencode arch=compute_60,code=sm_70}. The performance measurements on Tesla V100, the current flagship GPU by NVIDIA, revealed that the $N$-body simulations of the Andromeda galaxy model with $2^{23} = 8388608$ particles took $3.8 \times 10^{-2}$~s or $3.3 \times 10^{-2}$~s per step for each case. Tesla V100 achieves a 1.4 to 2.2-fold acceleration in comparison with Tesla P100, the flagship GPU in the previous generation. The observed speed-up of 2.2 is greater than 1.5, which is the ratio of the theoretical peak performance of the two GPUs. The independence of the units for integer operations from those for floating-point number operations enables the overlapped execution of integer and floating-point number operations. It hides the execution time of the integer operations leading to the speed-up rate above the theoretical peak performance ratio. Tesla V100 can execute $N$-body simulation with up to $25 \times 2^{20} = 26214400$ particles, and it took $2.0 \times 10^{-1}$~s per step. It corresponds to $3.5$~TFlop/s, which is 22\% of the single-precision theoretical peak performance.

Read this paper on arXiv…

Y. Miki
Thu, 8 Nov 18
55/72

Comments: 10 pages, 10 figures, 2 tables, submitted to Computer Physics Communications

Exploiting the Space Filling Curve Ordering of Particles in the Neighbour Search of Gadget3 [IMA]

Posted on October 24, 2018 by arxiverbot

http://arxiv.org/abs/1810.09898

RagagninEtAl-1810.09898_f14.jpg

RagagninEtAl-1810.09898_f16.jpg

RagagninEtAl-1810.09898_f15.jpg

Gadget3 is nowadays one of the most frequently used high performing parallel codes for cosmological hydrodynamical simulations. Recent analyses have shown t\ hat the Neighbour Search process of Gadget3 is one of the most time-consuming parts. Thus, a considerable speedup can be expected from improvements of the u\ nderlying algorithms. In this work we propose a novel approach for speeding up the Neighbour Search which takes advantage of the space-filling-curve particle ordering. Instead of performing Neighbour Search for all particles individually, nearby active particles can be grouped and one single Neighbour Search can be performed to obta\ in a common superset of neighbours. Thus, with this approach we reduce the number of searches. On the other hand, tree walks are performed within a larger searching radius. There is an optimal size of grouping that maximize the speedup, which we found by numerical experiments. We tested the algorithm within the boxes of the Magneticum project. As a result we obtained a speedup of $1.65$ in the Density and of $1.30$ in the Hydrodynamics computation, respectively, and a total speedup of $1.34.$

Read this paper on arXiv…

A. Ragagnin, N. Tchipev, M. Bader, et. al.
Wed, 24 Oct 18
12/75

Comments: 17 pages, 6 figures, published at Parallel Computing (ParCo)

Report: Performance comparison between C2075 and P100 GPU cards using cosmological correlation functions [CL]

Posted on September 12, 2017 by arxiverbot

http://arxiv.org/abs/1709.03264

arxiverlogo

In this report, some cosmological correlation functions are used to evaluate the differential performance between C2075 and P100 GPU cards. In the past, the correlation functions used in this work have been widely studied and exploited on some previous GPU architectures. The analysis of the performance indicates that a speedup in the range from 13 to 15 is achieved without any additional optimization process for the P100 card.

Read this paper on arXiv…

M. Cardenas-Montes, I. Mendez-Jimenez, J. Rodriguez-Vazquez, et. al.
Tue, 12 Sep 17
52/71

Comments: N/A

Galactos: Computing the Anisotropic 3-Point Correlation Function for 2 Billion Galaxies [CEA]

Posted on September 4, 2017 by arxiverbot

http://arxiv.org/abs/1709.00086

FriesenEtAl-1709.00086_f8.jpg

FriesenEtAl-1709.00086_f6.jpg

FriesenEtAl-1709.00086_f13.jpg

The nature of dark energy and the complete theory of gravity are two central questions currently facing cosmology. A vital tool for addressing them is the 3-point correlation function (3PCF), which probes deviations from a spatially random distribution of galaxies. However, the 3PCF’s formidable computational expense has prevented its application to astronomical surveys comprising millions to billions of galaxies. We present Galactos, a high-performance implementation of a novel, O(N^2) algorithm that uses a load-balanced k-d tree and spherical harmonic expansions to compute the anisotropic 3PCF. Our implementation is optimized for the Intel Xeon Phi architecture, exploiting SIMD parallelism, instruction and thread concurrency, and significant L1 and L2 cache reuse, reaching 39% of peak performance on a single node. Galactos scales to the full Cori system, achieving 9.8PF (peak) and 5.06PF (sustained) across 9636 nodes, making the 3PCF easily computable for all galaxies in the observable universe.

Read this paper on arXiv…

B. Friesen, M. Patwary, B. Austin, et. al.
Mon, 4 Sep 17
42/61

Comments: 11 pages, 7 figures, accepted to SuperComputing 2017

Performance Measurements of Supercomputing and Cloud Storage Solutions [CL]

Posted on August 3, 2017 by arxiverbot

http://arxiv.org/abs/1708.00544

JonesEtAl-1708.00544_f2.jpg

JonesEtAl-1708.00544_f4.jpg

JonesEtAl-1708.00544_f1.jpg

Increasing amounts of data from varied sources, particularly in the fields of machine learning and graph analytics, are causing storage requirements to grow rapidly. A variety of technologies exist for storing and sharing these data, ranging from parallel file systems used by supercomputers to distributed block storage systems found in clouds. Relatively few comparative measurements exist to inform decisions about which storage systems are best suited for particular tasks. This work provides these measurements for two of the most popular storage technologies: Lustre and Amazon S3. Lustre is an open-source, high performance, parallel file system used by many of the largest supercomputers in the world. Amazon’s Simple Storage Service, or S3, is part of the Amazon Web Services offering, and offers a scalable, distributed option to store and retrieve data from anywhere on the Internet. Parallel processing is essential for achieving high performance on modern storage systems. The performance tests used span the gamut of parallel I/O scenarios, ranging from single-client, single-node Amazon S3 and Lustre performance to a large-scale, multi-client test designed to demonstrate the capabilities of a modern storage appliance under heavy load. These results show that, when parallel I/O is used correctly (i.e., many simultaneous read or write processes), full network bandwidth performance is achievable and ranged from 10 gigabits/s over a 10 GigE S3 connection to 0.35 terabits/s using Lustre on a 1200 port 10 GigE switch. These results demonstrate that S3 is well-suited to sharing vast quantities of data over the Internet, while Lustre is well-suited to processing large quantities of data locally.

Read this paper on arXiv…

M. Jones, J. Kepner, W. Arcand, et. al.
Thu, 3 Aug 17
49/59

Comments: 5 pages, 4 figures, to appear in IEEE HPEC 2017

Benchmarking Data Analysis and Machine Learning Applications on the Intel KNL Many-Core Processor [CL]

Posted on July 13, 2017 by arxiverbot

http://arxiv.org/abs/1707.03515

arxiverlogo

Knights Landing (KNL) is the code name for the second-generation Intel Xeon Phi product family. KNL has generated significant interest in the data analysis and machine learning communities because its new many-core architecture targets both of these workloads. The KNL many-core vector processor design enables it to exploit much higher levels of parallelism. At the Lincoln Laboratory Supercomputing Center (LLSC), the majority of users are running data analysis applications such as MATLAB and Octave. More recently, machine learning applications, such as the UC Berkeley Caffe deep learning framework, have become increasingly important to LLSC users. Thus, the performance of these applications on KNL systems is of high interest to LLSC users and the broader data analysis and machine learning communities. Our data analysis benchmarks of these application on the Intel KNL processor indicate that single-core double-precision generalized matrix multiply (DGEMM) performance on KNL systems has improved by ~3.5x compared to prior Intel Xeon technologies. Our data analysis applications also achieved ~60% of the theoretical peak performance. Also a performance comparison of a machine learning application, Caffe, between the two different Intel CPUs, Xeon E5 v3 and Xeon Phi 7210, demonstrated a 2.7x improvement on a KNL node.

Read this paper on arXiv…

C. Byun, J. Kepner, W. Arcand, et. al.
Thu, 13 Jul 17
50/60

Comments: 6 pages; 9 figures; accepted to IEEE HPEC 2017

Characterising radio telescope software with the Workload Characterisation Framework [IMA]

Posted on December 5, 2016 by arxiver

http://arxiv.org/abs/1612.00456

GrangeEtAl-1612.00456_f2.jpg

GrangeEtAl-1612.00456_f1.jpg

GrangeEtAl-1612.00456_f3.jpg

We present a modular framework, the Workload Characterisation Framework (WCF), that is developed to reproducibly obtain, store and compare key characteristics of radio astronomy processing software. As a demonstration, we discuss the experiences using the framework to characterise a LOFAR calibration and imaging pipeline.

Read this paper on arXiv…

Y. Grange, R. Lakhoo, M. Petschow, et. al.
Mon, 5 Dec 16
47/61

Comments: 4 pages, 4 figures; to be published in ADASS XXVI (held October 16-20, 2016) proceedings. See this http URL for the poster

PageRank Pipeline Benchmark: Proposal for a Holistic System Benchmark for Big-Data Platforms [CL]

Posted on March 8, 2016 by arxiver

http://arxiv.org/abs/1603.01876

arxiverlogo

The rise of big data systems has created a need for benchmarks to measure and compare the capabilities of these systems. Big data benchmarks present unique scalability challenges. The supercomputing community has wrestled with these challenges for decades and developed methodologies for creating rigorous scalable benchmarks (e.g., HPC Challenge). The proposed PageRank pipeline benchmark employs supercomputing benchmarking methodologies to create a scalable benchmark that is reflective of many real-world big data processing systems. The PageRank pipeline benchmark builds on existing prior scalable benchmarks (Graph500, Sort, and PageRank) to create a holistic benchmark with multiple integrated kernels that can be run together or independently. Each kernel is well defined mathematically and can be implemented in any programming environment. The linear algebraic nature of PageRank makes it well suited to being implemented using the GraphBLAS standard. The computations are simple enough that performance predictions can be made based on simple computing hardware models. The surrounding kernels provide the context for each kernel that allows rigorous definition of both the input and the output for each kernel. Furthermore, since the proposed PageRank pipeline benchmark is scalable in both problem size and hardware, it can be used to measure and quantitatively compare a wide range of present day and future systems. Serial implementations in C++, Python, Python with Pandas, Matlab, Octave, and Julia have been implemented and their single threaded performance has been measured.

Read this paper on arXiv…

P. Dreher, C. Byun, C. Hill, et. al.
Tue, 8 Mar 16
82/83

Comments: 9 pages, 7 figures, to appear in IPDPS 2016 Graph Algorithms Building Blocks (GABB) workshop

Separable projection integrals for higher-order correlators of the cosmic microwave sky: Acceleration by factors exceeding 100 [CL]

Posted on April 3, 2015 by arxiver

http://arxiv.org/abs/1503.08809

BriggsEtAl-1503.08809_f7.jpg

BriggsEtAl-1503.08809_f1.jpg

BriggsEtAl-1503.08809_f2.jpg

We study the optimisation and porting of the “Modal” code on Intel(R) Xeon(R) processors and/or Intel(R) Xeon Phi(TM) coprocessors using methods which should be applicable to more general compute bound codes. “Modal” is used by the Planck satellite experiment for constraining general non-Gaussian models of the early universe via the bispectrum of the cosmic microwave background. We focus on the hot-spot of the code which is the projection of bispectra from the end of inflation to spherical shell at decoupling which defines the CMB we observe. This code involves a three-dimensional inner product between two functions, one of which requires an integral, on a non-rectangular sparse domain. We show that by employing separable methods this calculation can be reduced to a one dimensional summation plus two integrations reducing the dimensionality from four to three. The introduction of separable functions also solves the issue of the domain allowing efficient vectorisation and load balancing. This method becomes unstable in certain cases and so we present a discussion of the optimisation of both approaches. By making bispectrum calculations competitive with those for the power spectrum we are now able to consider joint analysis for cosmological science exploitation of new data. We demonstrate speed-ups of over 100x, arising from a combination of algorithmic improvements and architecture-aware optimizations targeted at improving thread and vectorization behaviour. The resulting MPI/OpenMP code is capable of executing on clusters containing Intel(R) Xeon(R) processors and/or Intel(R) Xeon Phi(TM) coprocessors, with strong-scaling efficiency of 98.6% on up to 16 nodes. We find that a single coprocessor outperforms two processor sockets by a factor of 1.3x and that running the same code across a combination of processors and coprocessors improves performance-per-node by a factor of 3.38x.

Read this paper on arXiv…

J. Briggs, J. Jaykka, J. Fergusson, et. al.
Fri, 3 Apr 15
1/43

Comments: N/A

Architecture, implementation and parallelization of the software to search for periodic gravitational wave signals [CL]

Posted on February 3, 2015 by arxiver

http://arxiv.org/abs/1410.3677

PoghosyanEtAl-1410.3677_f4.jpg

PoghosyanEtAl-1410.3677_f1.jpg

PoghosyanEtAl-1410.3677_f5.jpg

The parallelization, design and scalability of the \sky code to search for periodic gravitational waves from rotating neutron stars is discussed. The code is based on an efficient implementation of the F-statistic using the Fast Fourier Transform algorithm. To perform an analysis of data from the advanced LIGO and Virgo gravitational wave detectors’ network, which will start operating in 2015, hundreds of millions of CPU hours will be required – the code utilizing the potential of massively parallel supercomputers is therefore mandatory. We have parallelized the code using the Message Passing Interface standard, implemented a mechanism for combining the searches at different sky-positions and frequency bands into one extremely scalable program. The parallel I/O interface is used to escape bottlenecks, when writing the generated data into file system. This allowed to develop a highly scalable computation code, which would enable the data analysis at large scales on acceptable time scales. Benchmarking of the code on a Cray XE6 system was performed to show efficiency of our parallelization concept and to demonstrate scaling up to 50 thousand cores in parallel.

Read this paper on arXiv…

G. Poghosyan, S. Matta, A. Streit, et. al.
Tue, 3 Feb 15
37/80

Comments: 11 pages, 9 figures. Submitted to Computer Physics Communications

%d bloggers like this: