First steps towards FAIR (meta)data descriptions, ontology terms and data deposition
2024-12-04
Biometadata Crashcourse: First steps towards FAIR (meta)data descriptions, ontology terms and data deposition
From the The Carpentries Code of conduct:
What is your background with data deposition and what are your expectations? Note them in the pad.
[…] the real source of innovation in current biology is the attention paid to data handling and dissemination practices […] rather than the emergence of big data and associated methods per se. Leonelli (2016, 1)
Is the attention to data handling and standardization really new in modern biology?
Key to reusability success: curated data description and published (meta)data
100 individual studies decontextualised but recontextualised for new insights!
Curated and well-described that stays on your hard drive is of limited interest to the scientific community.
National and international efforts exist to create and maintain data repositories for the life sciences.
For nucleotide sequence data, the INSDC integrate and mirrors data repositories from three regions:
USA with the NCBI
Europe with the EMBL-EBI
Japan with the DDBJ
Task
Report the repository in the pad
homework: Where do your peers usually publish their data?
alternatively follow NFDI4Microbiota’s concise cheatsheet
Metadata are data about the data, or a “love note to the future” (Scott 2011).
Metadata are “reliability labels” (Leonelli 2016, 28)
Types of metadata:
Descriptive: what is the data? e.g., title, description
Structural: how the data is organized? e.g., file, collection
Administrative: what is the provenance? e.g., versions, license
Quality: How good is the data? e.g., quality rank
Task:
List 1-4 metadata that you have already encountered
Write them in the pad under one of the four types
Metadata field | Type of constraint |
---|---|
description | free-text |
geolocation | coordinates |
biome | ontology term |
Fields are organized into coherent metadata standard for a given type of data
They are built and maintained by a combination of stakeholders
Metadata standards (should) indicate for each field:
the description of the metadata field
the level of requirements (mandatory, recommended, optional)
the cardinality, that is the range of expected values for the metadata field
a persistent identifier for the field
Task:
Look-up a standard that match your model system or type of sample
Identify the mandatory metadata fields
Compare with your neighbor
The set of mandatory fields is sometimes referred to as the minimal requirements.
Filling out these requirements and all the optional metadata fields would be ideal (if only possible) but is time-consuming
NFDI4Microbiota started to collect overlap of minimal metadata fields across different platforms, data types and biological & environmental metadata
Task:
Biology is highly contextual (few rules, many exceptions), meaning context is the key point to address to make data travel across research investigations.
Why should you share this information?
In space, no one can hear you scream.
In data repositories, no one can hear your data description. Take the time to describe your data for humans and machines to understand!
MIxS lists three mandatory environmental metadata fields that expect ontology terms.
Metadata field | Abbreviation | Definition |
---|---|---|
broad-scale environmental context | env_broad_scale |
global correlation; ecosystem |
local environmental context | env_local_scale |
in local vicinity; causal influences |
environmental medium | env_medium |
immediate surroundings of your sample during sampling |
Metadata field | Abbreviation | Recommended use of subclasses from |
---|---|---|
broad-scale environmental context | env_broad_scale |
biome [ENVO:00000428] |
local environmental context | env_local_scale |
deeper hierarchy than broad-scale (UBERON terms accepted) |
environmental medium | env_medium |
environmental material [ENVO:00010483] |
What can you tell us about the biological & environmental provenance of your dataset or sample?
Where are your samples from? Do you use a model organism?
How can you make this information machine-accessible?
Let’s talk about…
cake food product [FOODON:00001278]
?pie
pie [44315]
gene that encodes the pineapple eye protein (fruit fly) [PR:Q9VKW2]
?List of terms, usually taken from the scientific literature
Ontology terms:
have curated textual definitions and synonyms
are arranged in a hierarchy from general to specific
have defined relationships with others terms (e.g., is_a
, has_condition
)
have persistent identifiers
can be cross-referenced with other resources (ontology or not)
Ontology should reflect existing knowledge
You can find terms in ontologies using the search bar of:
OLS is the official ontology service of the EMBL-EBI
Demonstration: navigating OLS for term “intestines” in UBERON, ENVO
Results, we found the term
intestine [UBERON:0000160]
in both ontologies UBERON and ENVO (as imported term from UBERON)
The correct writing convention for ontology terms and their term identifiers is: > term [ontology-acronym:sequence-number], e.g. intestine [UBERON:0000160]
Task: using Ontology Lookup Service v4
Look-up the following keywords in that order via the search bar: pond, ear and leaf
Select a term for each using these ontologies:
Uber-anatomy ontology (UBERON)
Plant Ontology (PO)
Environmental Ontology (ENVO)
Report the term and the term identifier in the pad
Can we start the next lesson on selecting ontology terms for YOUR dataset descriptions?
TASK: Pick any 3 terms from your dataset description.
MIxS lists three mandatory environmental metadata fields that expect ontology terms.
Metadata field | Abbreviation | Recommended use of subclasses from |
---|---|---|
broad-scale environmental context | env_broad_scale |
biome [ENVO:00000428] |
local environmental context | env_local_scale |
deeper hierarchy than broad-scale (UBERON terms accepted) |
environmental medium | env_medium |
environmental material [ENVO:00010483] |
Task alone or by pairs:
Browse ENVO or UBERON (see previous table)
List ontology terms fitting your data
Fill out the following template on the pad:
env_broad_scale
env_local_scale
env_medium
trouble finding an appropriate term downstream of the recommended class? See instructions of using other ontologies with the MIxS standard
Data and metadata collection can be crucial (e.g., COVID-19)
DataHarmonizer: a super spreadsheet to help (Gill et al. 2023)
Load metadata standards
Fill the template
Validate your metadata against the template
Leverage DataHarmonizer to lower barriers to collect, study and biosample data
Not going to use it for data submission but data description!
Receive guidance on how to meet standards
See the “full” methods section in the pad!
Task:
Working FAIRly takes time and effort1
How much metadata is necessary to understand your research and to enable (inter-) disciplinary research?
“Even if you don’t know how to go all the way to zero-to-60 open science, zero-to-20 is also really good”
Ellen Bledsoe in Perkel (2023)