Galactic ChitChat: Using Large Language Models to Converse with Astronomy Literature [CL]

http://arxiv.org/abs/2304.05406


We demonstrate the potential of the state-of-the-art OpenAI GPT-4 large language model to engage in meaningful interactions with Astronomy papers using in-context prompting. To optimize for efficiency, we employ a distillation technique that effectively reduces the size of the original input paper by 50\%, while maintaining the paragraph structure and overall semantic integrity. We then explore the model’s responses using a multi-document context (ten distilled documents). Our findings indicate that GPT-4 excels in the multi-document domain, providing detailed answers contextualized within the framework of related research findings. Our results showcase the potential of large language models for the astronomical community, offering a promising avenue for further exploration, particularly the possibility of utilizing the models for hypothesis generation.

Read this paper on arXiv…

I. Ciucă and Y. Ting
Thu, 13 Apr 23
42/59

Comments: 3 pages, submitted to RNAAS, comments very welcome from the community

Can AI Put Gamma-Ray Astrophysicists Out of a Job? [CL]

http://arxiv.org/abs/2303.17853


In what will likely be a litany of generative-model-themed arXiv submissions celebrating April the 1st, we evaluate the capacity of state-of-the-art transformer models to create a paper detailing the detection of a Pulsar Wind Nebula with a non-existent Imaging Atmospheric Cherenkov Telescope (IACT) Array. We do this to evaluate the ability of such models to interpret astronomical observations and sources based on language information alone, and to assess potential means by which fraudulently generated scientific papers could be identified during peer review (given that reliable generative model watermarking has yet to be deployed for these tools). We conclude that our jobs as astronomers are safe for the time being. From this point on, prompts given to ChatGPT and Stable Diffusion are shown in orange, text generated by ChatGPT is shown in black, whereas analysis by the (human) authors is in blue.

Read this paper on arXiv…

S. Spencer, V. Joshi and A. Mitchell
Mon, 3 Apr 23
22/53

Comments: N/A

Improving astroBERT using Semantic Textual Similarity [CL]

http://arxiv.org/abs/2212.00744


The NASA Astrophysics Data System (ADS) is an essential tool for researchers that allows them to explore the astronomy and astrophysics scientific literature, but it has yet to exploit recent advances in natural language processing. At ADASS 2021, we introduced astroBERT, a machine learning language model tailored to the text used in astronomy papers in ADS. In this work we:
– announce the first public release of the astroBERT language model;
– show how astroBERT improves over existing public language models on astrophysics specific tasks;
– and detail how ADS plans to harness the unique structure of scientific papers, the citation graph and citation context, to further improve astroBERT.

Read this paper on arXiv…

F. Grezes, T. Allen, S. Blanco-Cuaresma, et. al.
Fri, 2 Dec 22
20/81

Comments: N/A

Searching for Carriers of the Diffuse Interstellar Bands Across Disciplines, using Natural Language Processing [CL]

http://arxiv.org/abs/2211.08513


The explosion of scientific publications overloads researchers with information. This is even more dramatic for interdisciplinary studies, where several fields need to be explored. A tool to help researchers overcome this is Natural Language Processing (NLP): a machine-learning (ML) technique that allows scientists to automatically synthesize information from many articles. As a practical example, we have used NLP to conduct an interdisciplinary search for compounds that could be carriers for Diffuse Interstellar Bands (DIBs), a long-standing open question in astrophysics. We have trained a NLP model on a corpus of 1.5 million cross-domain articles in open access, and fine-tuned this model with a corpus of astrophysical publications about DIBs. Our analysis points us toward several molecules, studied primarily in biology, having transitions at the wavelengths of several DIBs and composed of abundant interstellar atoms. Several of these molecules contain chromophores, small molecular groups responsible for the molecule’s colour, that could be promising candidate carriers. Identifying viable carriers demonstrates the value of using NLP to tackle open scientific questions, in an interdisciplinary manner.

Read this paper on arXiv…

C. Dobrenan, F. Galliano, J. Minton, et. al.
Thu, 17 Nov 22
3/63

Comments: Accepted for publication by Journal of Interdisciplinary Methodologies and Issues in Science (JIMIS)

A New Task: Deriving Semantic Class Targets for the Physical Sciences [IMA]

http://arxiv.org/abs/2210.14760


We define deriving semantic class targets as a novel multi-modal task. By doing so, we aim to improve classification schemes in the physical sciences which can be severely abstracted and obfuscating. We address this task for upcoming radio astronomy surveys and present the derived semantic radio galaxy morphology class targets.

Read this paper on arXiv…

M. Bowles, H. Tang, E. Vardoulaki, et. al.
Thu, 27 Oct 22
12/55

Comments: 6 pages, 1 figure, Accepted at Fifth Workshop on Machine Learning and the Physical Sciences (NeurIPS 2022), Neural Information Processing Systems 2022

Building astroBERT, a language model for Astronomy & Astrophysics [CL]

http://arxiv.org/abs/2112.00590


The existing search tools for exploring the NASA Astrophysics Data System (ADS) can be quite rich and empowering (e.g., similar and trending operators), but researchers are not yet allowed to fully leverage semantic search. For example, a query for “results from the Planck mission” should be able to distinguish between all the various meanings of Planck (person, mission, constant, institutions and more) without further clarification from the user. At ADS, we are applying modern machine learning and natural language processing techniques to our dataset of recent astronomy publications to train astroBERT, a deeply contextual language model based on research at Google. Using astroBERT, we aim to enrich the ADS dataset and improve its discoverability, and in particular we are developing our own named entity recognition tool. We present here our preliminary results and lessons learned.

Read this paper on arXiv…

F. Grezes, S. Blanco-Cuaresma, A. Accomazzi, et. al.
Mon, 6 Dec 21
6/61

Comments: N/A

Towards Machine-assisted Meta-Studies: The Hubble Constant [IMA]

http://arxiv.org/abs/1902.00027


We present an approach for automatic extraction of measured values from the astrophysical literature, using the Hubble constant for our pilot study. Our rules-based model — a classical technique in natural language processing — has successfully extracted 298 measurements of the Hubble constant, with uncertainties, from the 208,541 available arXiv astrophysics papers. We have also created an artificial neural network classifier to identify papers which report novel measurements. This classifier is applied to the available arXiv data, and is demonstrated to work well in identifying papers which are reporting new measurements. From the analysis of our results we find that reporting measurements with uncertainties and the correct units is critical information to identify novel measurements in free text. Our results correctly highlight the current tension for measurements of the Hubble constant and recover the $3.5\sigma$ discrepancy — demonstrating that the tool presented in this paper is useful for meta-studies of astrophysical measurements from a large number of publications, and showing the potential to generalise this technique to other areas.

Read this paper on arXiv…

T. Crossland, P. Stenetorp, S. Riedel, et. al.
Mon, 4 Feb 19
36/60

Comments: 13 pages, 6 figures. Submitted to Monthly Notices of the Royal Astronomical Society

Quantifying the Cognitive Extent of Science [CL]

http://arxiv.org/abs/1511.00040


While the modern science is characterized by an exponential growth in scientific literature, the increase in publication volume clearly does not reflect the expansion of the cognitive boundaries of science. Nevertheless, most of the metrics for assessing the vitality of science or for making funding and policy decisions are based on productivity. Similarly, the increasing level of knowledge production by large science teams, whose results often enjoy greater visibility, does not necessarily mean that “big science” leads to cognitive expansion. Here we present a novel, big-data method to quantify the extents of cognitive domains of different bodies of scientific literature independently from publication volume, and apply it to 20 million articles published over 60-130 years in physics, astronomy, and biomedicine. The method is based on the lexical diversity of titles of fixed quotas of research articles. Owing to large size of quotas, the method overcomes the inherent stochasticity of article titles to achieve <1% precision. We show that the periods of cognitive growth do not necessarily coincide with the trends in publication volume. Furthermore, we show that the articles produced by larger teams cover significantly smaller cognitive territory than (the same quota of) articles from smaller teams. Our findings provide a new perspective on the role of small teams and individual researchers in expanding the cognitive boundaries of science. The proposed method of quantifying the extent of the cognitive territory can also be applied to study many other aspects of “science of science.”

Read this paper on arXiv…

S. Milojevic
Tue, 3 Nov 15
56/90

Comments: Accepted for publication in Journal of Informetrics