A beginner’s guide to metadata and ontologies

Maja Magel
Charlie Pauvert

2024-07-18

Learning objectives

  • define metadata and ontology in the context of biology and microbiology
  • identify the importance of reliable metadata in data-centric biology
  • recognize that FAIR is a process and that every step counts

Modern biology

Modern biology = big data?

big data vs data-centric

Book cover Data centric biology by Leonelli S.

[…] the real source of innovation in current biology is the attention paid to data handling and dissemination practices […] rather than the emergence of big data and associated methods per se. Leonelli (2016, 1)

Nothing is ever new

Is the attention to data handling and standardization really new in modern biology?

Model Organisms as seen by Vincent Van Gogh illustrated by SketchingScience

Model organisms

  • Standardization focuses research efforts (and funds)
  • But does not prohibit studying different organisms
  • Quick start research on other organisms reusing findings in model organisms

Key to success: curated data description

Outline

  1. Why curate data?
  2. How to curate data?

Why curate data?

Data curation arguments

We curate data …

  1. in order to store for long term archive
  2. to give biological context for interpretation
  3. to be able to reanalyze in the future
  4. to be scrutinized against possible misconduct

(optional) Can you think of novel arguments?

Which argument is pivotal for your research?

Is is different for the research produced by others?

How to curate data?

Biological context

Biology is highly contextual (few rules, many exceptions), meaning context is the key point to address to make data travel across research investigations.

Decontextualisation

Seeing the forest for the trees

Recontextualisation

Verifying up the forest geographic coordinates

Decontextualisation

Borneo rainforest by Maria Stenzel

Figure 1 ENVO paper

The labeling of data through bio-ontologies ensures that they are at least temporarily decoupled from information about the local features of their production. Leonelli (2016, 30)

Make data adaptable to new research settings.

Earth Microbiome Project

100 individual studies decontextualised but recontextualised for new insights!

Figure 1 EMP paper

How to curate data?

  • Decontextualisation relies on ontologies

  • Recontextualisation?

It enables users to evaluate the potential meaning of data by assessing their provenance through the consultation of metadata. This is necessary to identify the value of data as evidence, thus helping to build an interpretation of their biological significance in a new research setting.

Leonelli (2016, 30)

Metadata definition

Metadata are data about the data, or a “love note to the future” (Scott 2011).

Metadata are “reliability labels(Leonelli 2016, 28)

Types of metadata:

  • Descriptive: what is the data? e.g., title, description

  • Structural: how the data is organized? e.g., file, collection

  • Administrative: what is the provenance? e.g., versions, license

  • Quality: How good is the data? e.g., quality rank

Exercise: metadata

Task:

  • List 1-4 metadata that you have already encountered

  • Write them in the pad under one of the four types

In a nutshell

Forget-me-not

  • Metadata are important for re-usability of your data.
  • Ontologies help scientists and machines to use common terms to help generalize your data.

What about FAIR data?

FAIR principles

The FAIR principles explained by scibite.com

Source: https://scibite.com

15 principles were outlined by Wilkinson et al. (2016)

Take-home message

FAIR data takes time

meme about how fair data takes time

FAIR data is a process

“Even if you don’t know how to go all the way to zero-to-60 open science, zero-to-20 is also really good

Ellen Bledsoe in Perkel (2023)

kitten learning to hunt

References

Buttigieg, Pier, Norman Morrison, Barry Smith, Christopher J Mungall, Suzanna E Lewis, and the ENVO Consortium. 2013. “The Environment Ontology: Contextualising Biological and Biomedical Entities.” Journal of Biomedical Semantics 4 (1): 43. https://doi.org/10.1186/2041-1480-4-43.
Leonelli, Sabina. 2016. Data-Centric Biology: A Philosophical Study. Chicago ; London: The University of Chicago Press.
Perkel, Jeffrey M. 2023. “How to Make Your Scientific Data Accessible, Discoverable and Useful.” Nature 618 (7967): 1098–99. https://doi.org/10.1038/d41586-023-01929-7.
Scott, Jason. 2011. “The Metadata Mania.” ASCII by Jason Scott. http://ascii.textfiles.com/archives/3181.
Thompson, Luke R., Jon G. Sanders, Daniel McDonald, Amnon Amir, Joshua Ladau, Kenneth J. Locey, Robert J. Prill, et al. 2017. “A Communal Catalogue Reveals Earth’s Multiscale Microbial Diversity.” Nature 551 (7681): 457–63. https://doi.org/10.1038/nature24621.
Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3 (1): 160018. https://doi.org/10.1038/sdata.2016.18.