Sampling soils for microbiological analyses in Victoria Valley.

Sampling soils for microbiological analyses in Victoria Valley. Photo: Charles Lee, 2013.

Case Study: Machine learning-assisted meta-analysis of biological research in Antarctica

22 August 2022

To understand what Antarctic ecosystems may look like in a warming world, it’s critical to first comprehensively understand the current-day biodiversity and biogeography. But, unfortunately, there is not yet a spatially and taxonomically comprehensive source of such information.

A large body of scientific knowledge on terrestrial, freshwater, and marine biology in the Ross Sea region has, however, accumulated from over five decades of research across multiple National Antarctic Programmes. This information (addressing species distribution, connectivity, dispersal, and performance across a range of environmental conditions) exists in journal articles, book chapters, and grey literature such as student theses and institutional reports. Collating such information is challenging, since relevant articles make up a tiny sliver of the scientific literature available from sources like CORE, Semantic Scholar, and Scopus.

As part of the Antarctic Science Platform, we are using cutting-edge data science approaches to harvest this wealth of data. We are using machine learning algorithms to augment human annotators, to ensure that the quality of the final product is comparable to those generated using conventional approaches, while being as inclusive as possible.

A custom informatic pipeline was constructed and used to retrieve over 250,000 unique articles that potentially contain useful information using appropriately broad search terms (e.g., “Antarctic* AND invertebrate~”). Natural language processing, data structuring, machine learning, and deep learning were iteratively applied to the retrieved articles' titles and abstracts to generate relevance predictions for human validation. This process took eight months and identified ca. 8,000 relevant articles (i.e., containing spatially explicit information for biological species) across the entire Antarctic continent.

We are now (in August 2022) finishing the development of a second pipeline that utilises rule-based algorithms, natural language processing, and machine learning to identify articles specifically relevant to the Ross Sea region. This will be used to extract the biogeographical information and data related to biological and ecological processes contained within for further human validation and synthesis.

In general, our final machine learning and deep learning models perform exceptionally well, with excellent sensitivity (i.e., not predicting many false negatives) while exhibiting more variable (but still highly respectable) specificity (i.e., not predicting many false positives). Our yet-to-be-published machine learning pipelines are receiving interest from other researchers looking to adapt our unique philosophy and codebase for non-Antarctic applications.

Our broad approach to search terms (including discipline-appropriate keywords for Antarctic flora, fauna, and microbiota) has the added benefit of capturing articles containing biogeographic knowledge for other parts of Antarctica (we included everything south of the Antarctic Convergence), making our work more relevant to the international Antarctic science community. The relevant articles will be retained in a database that will form the basis for future research in collaboration with international researchers.

IMG 20180114 140627

Conducting a grid-based survey of vegetation in the Canada Glacier ASPA to cross-validate a drone-based survey. Photo: Charles Lee, 2018.

We are currently working on representing the extracted biogeographical knowledge using a framework that statistically accounts for spatial uncertainties (many articles were published before GPS was available for civilian use) and heterogeneous taxa habitat and dispersal ranges. This framework will form the basis of process-informed ecological models that underpin our effort to understand the future states of Antarctic ecosystems.

The compiled information will be the most comprehensive body of knowledge on Ross Sea region biodiversity and biogeography in existence, and it forms the foundation that will help us understand present-day patterns and relationships between the environment and ecosystems well enough to project how changes in climate, ice, and ocean currents will alter biogeographic and ecosystem processes.


Using a pulse-amplitude-modulation (PAM) fluorometer to conduct year-round monitoring of photosynthesis in lichens in the McMurdo Dry Valleys. Photo: Rolf Gademann, 2014

This Case Study was submitted to MBIE as part of the ASP annual reporting for the 2021-2022 year. It showcases a new information pipeline to compile biogeographic knowledge for Antarctica, which will form a comprehensive database to underpin our research and support collaboration with international researchers.