Fig 1 Dry Valleys landscape Charles Lee

Structuring five decades of Antarctic biogeographical records from scientific literature

Comprehensive databases of biogeographic information support research and conservation – what biogeographic data from across Antarctica is hidden as unstructured information in the scientific literature? Dr. Charles Lee aims to find out in this Opportunity Fund project.


Comprehensive records of Antarctic biogeography (i.e., where biological organisms exist) are fundamental to systematic conservation of Antarctic biota and understanding how their distribution may be altered by climate-driven environmental change.

Scientific publications on physiology, genetics, biochemistry, and other aspects of Antarctic biota often contain accidental biogeographical information (e.g., sampling location for specific taxa in the material and methods section). However, Antarctic biogeographical databases typically don’t capture this vast collection of biogeographical data that exists in the scientific literature as unstructured information.

Leveraging recent developments in machine learning, the Evolving Biogeography Register (EBR) was developed to fill this gap. Broad search terms were used to retrieve articles that may contain accidental or explicit biogeographic data. Such broad search terms allow unmatched sensitivity for relevant articles, but also retrieve a vast number of irrelevant results. To address the latter challenge, the EBR informatic pipeline uses machine learning algorithms to predict whether individual articles are likely to be relevant and to identify biogeographical records.

In other words, the EBR informatic pipeline helps focus human annotators on articles that are most likely relevant and directs them to plausible biogeographical data in those articles. Consequently, EBR generates biogeographic records that are at least as reliable as those obtained through conventional literature review.

The initial register contains 25,000+ unstructured biogeographical records from more than 250,000 scientific articles, but was limited to the Ross Sea region. There is an opportunity to work with international collaborators to use the EBR informatic pipeline to generate an unprecedented collection of biogeographical records across the whole of Antarctica.

Research overview

This project will leverage and add value to the work already done by retrieving and structuring biogeographical records for key terrestrial biota across the entire Antarctic continent.

The research objectives are to:

  1. Improve the informatic and annotation pipelines to optimise them for extension, collaboration, and long-term maintenance.
  2. Develop training materials for the annotation pipeline and train national and national researchers to annotate biogeographic records.
  3. Make the expanded database publicly available.
  4. Generate (annotate, curate, and incorporate) new biogeographic records from across the continent, working with international collaborators.
  5. Generate comprehensive data products from conventional and novel data sources with consistent taxonomic and geographic references.

By expanding the register to the whole of Antarctica (and, fortuitously, many sub-Antarctic islands), this project will generate an invaluable source of information for the Antarctic science community and allow unprecedented insight into the biogeography and community assembly of Antarctic biota.

Fig 2 Plant life in Ant Gabrielle Koerich

Simple plants like algae, moss, and lichen grow in Antarctica. Photo: Gabrielle Koerich


  • Michael Meredyth-Young, the Data Curator for the Antarctic Science Platform and Antarctica New Zealand, will maintain and update the EBR.
  • Augusto Pellegrinetti, a freelance data scientist, will provide programming support and advice.