Asymmetric distribution of data products from WALLABY, an SKA precursor neutral hydrogen survey [IMA]

http://arxiv.org/abs/2303.11670


The Widefield ASKAP L-band Legacy All-sky Blind surveY (WALLABY) is a neutral hydrogen survey (HI) that is running on the Australian SKA Pathfinder (ASKAP), a precursor telescope for the Square Kilometre Array (SKA). The goal of WALLABY is to use ASKAP’s powerful wide-field phased array feed technology to observe three quarters of the entire sky at the 21 cm neutral hydrogen line with an angular resolution of 30 arcseconds. Post-processing activities at the Australian SKA Regional Centre (AusSRC), Canadian Initiative for Radio Astronomy Data Analysis (CIRADA) and Spanish SKA Regional Centre prototype (SPSRC) will then produce publicly available advanced data products in the form of source catalogues, kinematic models and image cutouts, respectively. These advanced data products will be generated locally at each site and distributed across the network. Over the course of the full survey we expect to replicate data up to 10 MB per source detection, which could imply an ingestion of tens of GB to be consolidated in the other locations near real time. Here, we explore the use of an asymmetric database replication model and strategy, using PostgreSQL as the engine and Bucardo as the asynchronous replication service to enable robust multi-source pools operations with data products from WALLABY. This work would serve to evaluate this type of data distribution solution across globally distributed sites. Furthermore, a set of benchmarks have been developed to confirm that the deployed model is sufficient for future scalability and remote collaboration needs.

Read this paper on arXiv…

M. Parra-Royon, A. Shen, T. Reynolds, et. al.
Wed, 22 Mar 23
45/68

Comments: N/A

SciTS: A Benchmark for Time-Series Database in Scientific Experiments and Industrial Internet of Things [CL]

http://arxiv.org/abs/2204.09795


Time-series data has an increasingly growing usage in Industrial Internet of Things (IIoT) and large-scale scientific experiments. Managing time-series data needs a storage engine that can keep up with their constantly growing volumes while providing an acceptable query latency. While traditional ACID databases favor consistency over performance, many time-series databases with novel storage engines have been developed to provide better ingestion performance and lower query latency. To understand how the unique design of a time-series database affects its performance, we design SciTS, a highly extensible and parameterizable benchmark for time-series data. The benchmark studies the data ingestion capabilities of time-series databases especially as they grow larger in size. It also studies the latencies of 5 practical queries from the scientific experiments use case. We use SciTS to evaluate the performance of 4 databases of 4 distinct storage engines: ClickHouse, InfluxDB, TimescaleDB, and PostgreSQL.

Read this paper on arXiv…

J. Mostafa, S. Wehbi, S. Chilingaryan, et. al.
Fri, 22 Apr 22
51/64

Comments: N/A

The demise of the filesystem and multi level service architecture [IMA]

http://arxiv.org/abs/1907.13060


Many astronomy data centres still work on filesystems. Industry has moved on; current practice in computing infrastructure is to achieve Big Data scalability using object stores rather than POSIX file systems. This presents us with opportunities for portability and reuse of software underlying processing and archive systems but it also causes problems for legacy implementations in current data centers.

Read this paper on arXiv…

W. O’Mullane, N. Gaffney, F. Economou, et. al.
Wed, 31 Jul 19
20/65

Comments: Submitted as decadal APC paper 2019. arXiv admin note: text overlap with arXiv:1905.05116

Towards the Tunka-Rex Virtual Observatory [IMA]

http://arxiv.org/abs/1906.10425


The Tunka Radio Extension (Tunka-Rex) is a cosmic-ray detector operating since 2012. The detection principle of Tunka-Rex is based on the radio technique, which impacts data acquisition and storage. In this paper we give a first detailed overview of the concept of the Tunka-Rex Virtual Observatory (TRVO), a framework for open access to the Tunka-Rex data, which currently is under active development and testing. We describe the structure of the data, main features of the interface and possible applications of the TRVO.

Read this paper on arXiv…

P. Bezyazeekov, N. Budnev, O. Fedorov, et. al.
Wed, 26 Jun 19
37/68

Comments: Proceedings of the 3rd International Workshop on Data Life Cycle in Physics, Irkutsk, Russia, April 2-7, 2019

Mega-Archive and the EURONEAR Tools for Datamining World Astronomical Images [IMA]

http://arxiv.org/abs/1905.08847


The world astronomical image archives represent huge opportunities to time-domain astronomy sciences and other hot topics such as space defense, and astronomical observatories should improve this wealth and make it more accessible in the big data era. In 2010 we introduced the Mega-Archive database and the Mega-Precovery server for data mining images containing Solar system bodies, with focus on near Earth asteroids (NEAs). This paper presents the improvements and introduces some new related data mining tools developed during the last five years. Currently, the Mega-Archive has indexed 15 million images available from six major collections (CADC, ESO, ING, LCOGT, NVO and SMOKA) and other instrument archives and surveys. This meta-data index collection is daily updated (since 2014) by a crawler which performs automated query of five major collections. Since 2016, these data mining tools run to the new dedicated EURONEAR server, and the database migrated to SQL engine which supports robust and fast queries. To constrain the area to search moving or fixed objects in images taken by large mosaic cameras, we built the graphical tools FindCCD and FindCCD for Fixed Objects which overlay the targets across one of seven mosaic cameras (Subaru-SuprimeCam, VST-OmegaCam, INT-WFC, VISTA-VIRCAM, CFHT-MegaCam, Blanco-DECam and Subaru-HSC), also plotting the uncertainty ellipse for poorly observed NEAs. In 2017 we improved Mega-Precovery, which offers now two options for calculus of the ephemerides and three options for the input (objects defined by designation, orbit or observations). Additionally, we developed Mega-Archive for Fixed Objects (MASFO) and Mega-Archive Search for Double Stars (MASDS). We believe that the huge potential of science imaging archives is still insufficiently exploited.

Read this paper on arXiv…

O. Vaduvescu, L. Curelaru and M. Popescu
Thu, 23 May 19
65/67

Comments: Paper submitted to Astronomy and Computing (25 Mar 2019)

A Machine Learning Dataset Prepared From the NASA Solar Dynamics Observatory Mission [SSA]

http://arxiv.org/abs/1903.04538


In this paper we present a curated dataset from the NASA Solar Dynamics Observatory (SDO) mission in a format suitable for machine learning research. Beginning from level 1 scientific products we have processed various instrumental corrections, downsampled to manageable spatial and temporal resolutions, and synchronized observations spatially and temporally. We illustrate the use of this dataset with two example applications: forecasting future EVE irradiance from present EVE irradiance and translating HMI observations into AIA observations. For each application we provide metrics and baselines for future model comparison. We anticipate this curated dataset will facilitate machine learning research in heliophysics and the physical sciences generally, increasing the scientific return of the SDO mission. This work is a direct result of the 2018 NASA Frontier Development Laboratory Program. Please see the appendix for access to the dataset.

Read this paper on arXiv…

R. Galvez, D. Fouhey, M. Jin, et. al.
Wed, 13 Mar 19
79/125

Comments: Accepted to The Astrophysical Journal Supplement Series; 11 pages, 8 figures

Fast in-database cross-matching of high-cadence, high-density source lists with an up-to-date sky model [IMA]

http://arxiv.org/abs/1803.02601


Coming high-cadence wide-field optical telescopes will image hundreds of thousands of sources per minute. Besides inspecting the near real-time data streams for transient and variability events, the accumulated data archive is a wealthy laboratory for making complementary scientific discoveries.
The goal of this work is to optimise column-oriented database techniques to enable the construction of a full-source and light-curve database for large-scale surveys, that is accessible by the astronomical community.
We adopted LOFAR’s Transients Pipeline as the baseline and modified it to enable the processing of optical images that have much higher source densities. The pipeline adds new source lists to the archive database, while cross-matching them with the known cataloged sources in order to build a full light-curve archive. We investigated several techniques of indexing and partitioning the largest tables, allowing for faster positional source look-ups in the cross matching algorithms. We monitored all query run times in long-term pipeline runs where we processed a subset of IPHAS data that have image source density peaks over $170,000$ per field of view ($500,000$ deg$^{-2}$).
Our analysis demonstrates that horizontal table partitions of declination widths of one-degree control the query run times. Usage of an index strategy where the partitions are densily sorted according to source declination yields another improvement. Most queries run in sublinear time and a few (<20%) run in linear time, because of dependencies on input source-list and result-set size. We observed that for this logical database partitioning schema the limiting cadence the pipeline achieved with processing IPHAS data is 25 seconds.

Read this paper on arXiv…

B. Scheers, S. Bloemen, H. Muhleisen, et. al.
Thu, 8 Mar 18
28/63

Comments: 16 pages, 5 figures; Accepted for publication in Astronomy & Computing

Time Series Cube Data Model [CL]

http://arxiv.org/abs/1702.01393


The purpose of this document is to create a data model and its serialization for expressing generic time series data. Already existing IVOA data models are reused as much as possible. The model is also made as generic as possible to be open to new extensions but at the same time closed for modifications. This enables maintaining interoperability throughout different versions of the data model. We define the necessary building blocks for metadata discovery, serialization of time series data and understanding it by clients. We present several categories of time series science cases with examples of implementation. We also take into account the most pressing topics for time series providers like tracking original images for every individual point of a light curve or time-derived axes like frequency for gravitational wave analysis. The main motivation for the creation of a new model is to provide a unified time series data publishing standard – not only for light curves but also more generic time series data, e.g., radial velocity curves, power spectra, hardness ratio, provenance linkage, etc. The flexibility is the most crucial part of our model – we are not dependent on any physical domain or frame models. While images or spectra are already stable and standardized products, the time series related domains are still not completely evolved and new ones will likely emerge in near future. That is why we need to keep models like Time Series Cube DM independent of any underlying physical models. In our opinion, this is the only correct and sustainable way for future development of IVOA standards.

Read this paper on arXiv…

J. Nadvornik, P. Skoda, D. Morris, et. al.
Tue, 7 Feb 17
61/64

Comments: 27 pages, 17 figures

Photo-z-SQL: integrated, flexible photometric redshift computation in a database [GA]

http://arxiv.org/abs/1611.01560


We present a flexible template-based photometric redshift estimation framework, implemented in C#, that can be seamlessly integrated into a SQL database (or DB) server and executed on-demand in SQL. The DB integration eliminates the need to move large photometric datasets outside a database for redshift estimation, and utilizes the computational capabilities of DB hardware. The code is able to perform both maximum likelihood and Bayesian estimation, and can handle inputs of variable photometric filter sets and corresponding broad-band magnitudes. It is possible to take into account the full covariance matrix between filters, and filter zero points can be empirically calibrated using measurements with given redshifts. The list of spectral templates and the prior can be specified flexibly, and the expensive synthetic magnitude computations are done via lazy evaluation, coupled with a caching of results. Parallel execution is fully supported. For large upcoming photometric surveys such as the LSST, the ability to perform in-place photo-z calculation would be a significant advantage. Also, the efficient handling of variable filter sets is a necessity for heterogeneous databases, for example the Hubble Source Catalog, and for cross-match services such as SkyQuery. We illustrate the performance of our code on two reference photo-z datasets, PHAT and CAPR/CANDELS. The code is available for download at https://github.com/beckrob/Photo-z-SQL.

Read this paper on arXiv…

R. Beck, L. Dobos, T. Budavari, et. al.
Tue, 8 Nov 16
25/75

Comments: 11 pages, 4 figures. Submitted to Astronomy & Computing on 2016 November 04

The Footprint Database and Web Services of the Herschel Space Observatory [IMA]

http://arxiv.org/abs/1606.03957


Data from the Herschel Space Observatory is freely available to the public but no uniformly processed catalogue of the observations has been published so far. To date, the Herschel Science Archive does not contain the exact sky coverage (footprint) of individual observations and supports search for measurements based on bounding circles only. Drawing on previous experience in implementing footprint databases, we built the Herschel Footprint Database and Web Services for the Herschel Space Observatory to provide efficient search capabilities for typical astronomical queries. The database was designed with the following main goals in mind: (a) provide a unified data model for meta-data of all instruments and observational modes, (b) quickly find observations covering a selected object and its neighbourhood, (c) quickly find every observation in a larger area of the sky, (d) allow for finding solar system objects crossing observation fields. As a first step, we developed a unified data model of observations of all three Herschel instruments for all pointing and instrument modes. Then, using telescope pointing information and observational meta-data, we compiled a database of footprints. As opposed to methods using pixellation of the sphere, we represent sky coverage in an exact geometric form allowing for precise area calculations. For easier handling of Herschel observation footprints with rather complex shapes, two algorithms were implemented to reduce the outline. Furthermore, a new visualisation tool to plot footprints with various spherical projections was developed. Indexing of the footprints using Hierarchical Triangular Mesh makes it possible to quickly find observations based on sky coverage, time and meta-data. The database is accessible via a web site (this http URL) and also as a set of REST web service functions.

Read this paper on arXiv…

L. Dobos, E. Varga-Verebelyi, E. Verdugo, et. al.
Tue, 14 Jun 16
39/67

Comments: Accepted for publication in Experimental Astronomy

Real-Time Data Mining of Massive Data Streams from Synoptic Sky Surveys [IMA]

http://arxiv.org/abs/1601.04385


The nature of scientific and technological data collection is evolving rapidly: data volumes and rates grow exponentially, with increasing complexity and information content, and there has been a transition from static data sets to data streams that must be analyzed in real time. Interesting or anomalous phenomena must be quickly characterized and followed up with additional measurements via optimal deployment of limited assets. Modern astronomy presents a variety of such phenomena in the form of transient events in digital synoptic sky surveys, including cosmic explosions (supernovae, gamma ray bursts), relativistic phenomena (black hole formation, jets), potentially hazardous asteroids, etc. We have been developing a set of machine learning tools to detect, classify and plan a response to transient events for astronomy applications, using the Catalina Real-time Transient Survey (CRTS) as a scientific and methodological testbed. The ability to respond rapidly to the potentially most interesting events is a key bottleneck that limits the scientific returns from the current and anticipated synoptic sky surveys. Similar challenge arise in other contexts, from environmental monitoring using sensor networks to autonomous spacecraft systems. Given the exponential growth of data rates, and the time-critical response, we need a fully automated and robust approach. We describe the results obtained to date, and the possible future developments.

Read this paper on arXiv…

S. Djorgovski, M. Graham, C. Donalek, et. al.
Tue, 19 Jan 16
30/67

Comments: 14 pages, an invited paper for a special issue of Future Generation Computer Systems, Elsevier Publ. (2015). This is an expanded version of a paper arXiv:1407.3502 presented at the IEEE e-Science 2014 conf., with some new content

Cross-matching Engine for Incremental Photometric Sky Survey [CL]

http://arxiv.org/abs/1506.07208


For light curve generation, a pre-planned photometry survey is needed nowadays, where all of the exposure coordinates have to be given and don’t change during the survey. This thesis shows it is not required and we can data-mine these light curves from astronomical data that was never meant for this purpose. With this approach, we can recycle all of the photometric surveys in the world and generate light curves of observed objects for them.
This thesis is addressing mostly the catalog generation process, which is needed for creating the light curves. In practice, it focuses on one of the most important problems in astroinformatics which is clustering data volumes on Big Data scale where most of the traditional techniques stagger. We consider a wide variety of possible solutions from the view of performance, scalability, distributability, etc. We defined criteria for time and memory complexity which we evaluated for all of the tested solutions. Furthermore, we created quality standards which we also take into account when evaluating the results.
We are using relational databases as a starting point of our implementation and compare them with the newest technologies potentially usable for solving our problem. These are noSQL Array databases or transferring the heavy computations of clustering towards supercomputers by using parallelism.

Read this paper on arXiv…

I. Nadvornik
Thu, 25 Jun 15
36/45

Comments: 57 pages, 36 figures

Building an Archive with Saada [IMA]

http://arxiv.org/abs/1409.0351


Saada transforms a set of heterogeneous FITS files or VOTables of various categories (images, tables, spectra …) in a database without writing code. Databases created with Saada come with a rich Web interface and an Application Programming Interface (API). They support the four most common VO services. Such databases can mix various categories of data in multiple collections. They allow a direct access to the original data while providing a homogenous view thanks to an internal data model compatible with the characterization axis defined by the VO. The data collections can be bound to each other with persistent links making relevant browsing paths and allowing data-mining oriented queries.

Read this paper on arXiv…

L. Michel, C. Motch, H. Nguyen, et. al.
Tue, 2 Sep 14
42/72

Comments: 18 pages, 5 figures Special VO issue

D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Database [CL]

http://arxiv.org/abs/1407.3859


Non-traditional, relaxed consistency, triple store databases are the backbone of many web companies (e.g., Google Big Table, Amazon Dynamo, and Facebook Cassandra). The Apache Accumulo database is a high performance open source relaxed consistency database that is widely used for government applications. Obtaining the full benefits of Accumulo requires using novel schemas. The Dynamic Distributed Dimensional Data Model (D4M)[this http URL] provides a uniform mathematical framework based on associative arrays that encompasses both traditional (i.e., SQL) and non-traditional databases. For non-traditional databases D4M naturally leads to a general purpose schema that can be used to fully index and rapidly query every unique string in a dataset. The D4M 2.0 Schema has been applied with little or no customization to cyber, bioinformatics, scientific citation, free text, and social media data. The D4M 2.0 Schema is simple, requires minimal parsing, and achieves the highest published Accumulo ingest rates. The benefits of the D4M 2.0 Schema are independent of the D4M interface. Any interface to Accumulo can achieve these benefits by using the D4M 2.0 Schema

Read this paper on arXiv…

J. Kepner, C. Anderson, W. Arcand, et. al.
Wed, 16 Jul 14
41/48

Comments: 6 pages; IEEE HPEC 2013

Computing on Masked Data: a High Performance Method for Improving Big Data Veracity [CL]

http://arxiv.org/abs/1406.5751


The growing gap between data and users calls for innovative tools that address the challenges faced by big data volume, velocity and variety. Along with these standard three V’s of big data, an emerging fourth “V” is veracity, which addresses the confidentiality, integrity, and availability of the data. Traditional cryptographic techniques that ensure the veracity of data can have overheads that are too large to apply to big data. This work introduces a new technique called Computing on Masked Data (CMD), which improves data veracity by allowing computations to be performed directly on masked data and ensuring that only authorized recipients can unmask the data. Using the sparse linear algebra of associative arrays, CMD can be performed with significantly less overhead than other approaches while still supporting a wide range of linear algebraic operations on the masked data. Databases with strong support of sparse operations, such as SciDB or Apache Accumulo, are ideally suited to this technique. Examples are shown for the application of CMD to a complex DNA matching algorithm and to database operations over social media data.

Read this paper on arXiv…

J. Kepner, V. Gadepally, P. Michaleas, et. al.
Tue, 24 Jun 14
66/82

Comments: to appear in IEEE High Performance Extreme Computing 2014 (ieee-hpec.org)

Achieving 100,000,000 database inserts per second using Accumulo and D4M [CL]

http://arxiv.org/abs/1406.4923


The Apache Accumulo database is an open source relaxed consistency database that is widely used for government applications. Accumulo is designed to deliver high performance on unstructured data such as graphs of network data. This paper tests the performance of Accumulo using data from the Graph500 benchmark. The Dynamic Distributed Dimensional Data Model (D4M) software is used to implement the benchmark on a 216-node cluster running the MIT SuperCloud software stack. A peak performance of over 100,000,000 database inserts per second was achieved which is 100x larger than the highest previously published value for any other database. The performance scales linearly with the number of ingest clients, number of database servers, and data size. The performance was achieved by adapting several supercomputing techniques to this application: distributed arrays, domain decomposition, adaptive load balancing, and single-program-multiple-data programming.

Read this paper on arXiv…

J. Kepner, W. Arcand, D. Bestor, et. al.
Fri, 20 Jun 14
2/48

Comments: 6 pages; to appear in IEEE High Performance Extreme Computing (HPEC) 2014

DAS: a data management system for instrument tests and operations [IMA]

http://arxiv.org/abs/1405.7584


The Data Access System (DAS) is a metadata and data management software system, providing a reusable solution for the storage of data acquired both from telescopes and auxiliary data sources during the instrument development phases and operations. It is part of the Customizable Instrument WorkStation system (CIWS-FW), a framework for the storage, processing and quick-look at the data acquired from scientific instruments. The DAS provides a data access layer mainly targeted to software applications: quick-look displays, pre-processing pipelines and scientific workflows. It is logically organized in three main components: an intuitive and compact Data Definition Language (DAS DDL) in XML format, aimed for user-defined data types; an Application Programming Interface (DAS API), automatically adding classes and methods supporting the DDL data types, and providing an object-oriented query language; a data management component, which maps the metadata of the DDL data types in a relational Data Base Management System (DBMS), and stores the data in a shared (network) file system. With the DAS DDL, developers define the data model for a particular project, specifying for each data type the metadata attributes, the data format and layout (if applicable), and named references to related or aggregated data types. Together with the DDL user-defined data types, the DAS API acts as the only interface to store, query and retrieve the metadata and data in the DAS system, providing both an abstract interface and a data model specific one in C, C++ and Python. The mapping of metadata in the back-end database is automatic and supports several relational DBMSs, including MySQL, Oracle and PostgreSQL.

Read this paper on arXiv…

M. Frailis, S. Sartor, A. Zacchei, et. al.
Fri, 30 May 14
69/74

Comments: Accepted for pubblication on ADASS Conference Series

IVOA Recommendation: TAPRegExt: a VOResource Schema Extension for Describing TAP Services [IMA]

http://arxiv.org/abs/1402.4742


This document describes an XML encoding standard for metadata about services implementing the table access protocol TAP [TAP], referred to as TAPRegExt. Instance documents are part of the service’s registry record or can be obtained from the service itself. They deliver information to both humans and software on the languages, output formats, and upload methods supported by the service, as well as data models implemented by the exposed tables, optional language features, and certain limits enforced by the service.

Read this paper on arXiv…

M. Demleitner, P. Dowler, R. Plante, et. al.
Thu, 20 Feb 14
44/52