RO-Crate Tutorial Notebook.

RO-Crate Tutorial Notebook.#

Notebook Repository RO-Crate Specification Ongoing Development

Challenge:

Any data science project or workflow always starts with the data itself. This often requires an understanding of the data’s structure, the context in which is was collected, any underlying assumptions about the data, licencing information and the provenance of the data. This information is collectively known as metadata and is usually supplied with the data but is often in separate files or documentation. In addition the data itself often comes in multiple files that sometimes need to be downloaded individually and uploaded to separate platfoms in order to perform any analysis. This can provide a challenge to working with the data and lead to large amounts of time spent getting the data ready for analysis.

Approach:

This notebook demonstrates the use of a Research Object Crate or RO-Crate (https://www.researchobject.org/ro-crate/) to extract data from the Environmental Information Data Centre (EIDC), inspect its metadata and then read in the data itself. An RO-Crate is a way of packaging up the entire research object (data, metadata, methods, etc) so that the data and metadata are linked in a coherant context no matter where they are stored. RO-Crate can also be used read in the data directly from source rather than hacing to donwload it manually first.

Running the Notebook:

To run the notebook it is advised to first clone the repository housing the notebook (’git clone NERC-CEH/ds-toolbox-notebook-rocrate-tutorial.git’). This will create a folder in the current directory called ds-toolbox-notebook-rocrate-tutorial, which holds all the relevant files including the notebook, environment file and relevant input data. The next step is to create a conda environment with the right packages installed using the clean yml file (’conda env create -f environment_clean.yml’), which creates the rocrate-tutorial environment that can be activated using ‘conda activate rocrate-tutorial’. At this point the user can either run code from the notebook in their preferred IDE or via the jupyter interface using the command ‘jupyter notebook’.

Generalisability:

This notebook is setup up to be generalisable to other EIDC datasets that are have been packaged as RO-Crate objects. In order to use the notebook for a different EIDC dataset the user is onlyu required to change the RO-Crate URL that the notebooks read in. This notebook is setup to inspect the specific metadata of the example dataset used, however, the concepts deomnstrated here should be applicable to other RO-Crate objects and can be easily adapatable to understand the contents of other RO-Crate objects.

This notebook demonstrates how to read an detached RO-CRATE from the EIDC. An detached RO-CRATE is one where the JSON-LD describing the data sits separetely to the data and provides remote links to access the data. In this case the JSON-LD is read directly from the EIDC URL for the dataset. This then provides all of the metadata describing the data and also the ability to read in the data directly.

First load in the required libraries for the notebook

Retrieve the RO-Crate object from the EIDC #

The first step of the workflow is to get hold of the RO-CRATE from the EIDC for the dataset that we want to explore. In this case we want to look at the COSMOS (cosmic-ray soil moisture) dataset. In order to get hold of the RO-CRATE object we need the unique ID of the dataset which in this example we will supply as a URL. This is for demonstration purposes only, in the future the hope is that we will be able to directly provide the unique ID from the EIDC to the notebook and the data will be ready to load. The first step is to create a function that gets hold of the crate using the URL.

#Specify the URL.
url = "https://catalogue.ceh.ac.uk/documents/2dce161d-2fab-47bb-9fe6-38e7ed1ae18a?format=rocrate"

#Read in the RO-Crate object.
EIDC_crate = read_detached_crate(url)

#Check that we have read in the correct type of object.
print(type(EIDC_crate))

<class 'rocrate.rocrate.ROCrate'>

Let take a quick look inside the crate #

We have 2 types of entity stored in a crate:

Data Entities - These primarily exist in their own right as a file or directory (can either be a direct file in the RO-Crate root or downloadable via URL).
Contextual Entities - These exist primarily outside the digital sphere (E.g. A Person of a place) or are conceptual descriptions that mainly exist as metadata such as coordinates or a contact point for the data.

To add slight confusion to the mix some contextual entities can also be considered data entities whereby the content can be downloaded. A licence is a very good example of this whereby you can download the licence itself but is not usually considered a data/research output so would therefore be better classed as a contextual entity.

#List the contextual entities of the dataset.
#Also print the type and id.
for entity in EIDC_crate.contextual_entities:
    print(entity.id, entity.type)

#eidc-dataCatalogue DataCatalog
https://ror.org/04xw4m193 Organization
https://doi.org/10.5285/2dce161d-2fab-47bb-9fe6-38e7ed1ae18a PropertyValue
#bbox0 Place
#geoshape0 GeoShape
https://orcid.org/0000-0002-8303-2969 Person
https://orcid.org/0000-0001-6011-4075 Person
#creator2 Person
https://orcid.org/0000-0002-0203-5958 Person
https://orcid.org/0000-0001-7398-9268 Person
https://orcid.org/0000-0002-1033-4712 Person
https://orcid.org/0000-0002-6436-5266 Person
https://orcid.org/0000-0002-7439-0393 Person
#creator8 Person
https://orcid.org/0000-0002-1382-3407 Person
https://orcid.org/0000-0002-7473-7916 Person
https://orcid.org/0000-0001-5704-9006 Person
https://orcid.org/0000-0003-4194-1416 Person
#creator13 Person
https://orcid.org/0000-0003-1142-4039 Person
https://orcid.org/0009-0003-9102-5413 Person
https://orcid.org/0000-0003-4465-2169 Person
https://orcid.org/0000-0002-9822-316X Person
https://orcid.org/0009-0004-4445-8272 Person
#creator19 Person
#creator20 Person
#creator21 Person
https://orcid.org/0000-0003-2987-488X Person
https://orcid.org/0000-0002-1847-3127 Person
https://orcid.org/0000-0003-1611-2042 Person
#creator25 Person
#creator26 Person
https://orcid.org/0000-0002-7426-1153 Person
#creator28 Person
https://orcid.org/0000-0002-6447-6740 Person
#creator30 Person
#creator31 Person
https://orcid.org/0000-0002-5820-7093 Person
#creator33 Person
https://orcid.org/0000-0002-8350-4823 Person
https://orcid.org/0000-0002-6233-8942 Person
#creator36 Person
https://orcid.org/0000-0001-8881-185X Person
#creator38 Person
https://orcid.org/0000-0002-1318-5281 Person
#fund0 Organization
#distribution0 DataDownload
https://ror.org/00pggkr55 Organization

Lets have a look at one of the contextual entities in more detail. We will look at the funder of the work. To get hold of this we need the id of the entity which in this case is #fund0 from the list of contextual entities above. We can then cycle through the funder entity and list its information. To get a specific entity we use the get command.

fund_entity = EIDC_crate.get('#fund0')
for item in fund_entity:
    print(item, fund_entity[item])
    

@id #fund0
@type Organization
name Natural Environment Research Council

From the above simple print out we can see that the type of entity is an organisation and its name is Natural Environment Research council so we can see that NERC funded this work. How about trying one of the authors of this data.

person_entity = EIDC_crate.get('#creator2')
for item in person_entity:
    print(item, person_entity[item])

@id #creator2
@type Person
name Askquith-Ellis, A.
familyName Askquith-Ellis
givenName A.
email enquiries@ceh.ac.uk
affiliation <https://ror.org/00pggkr55 Organization>

Here we can see the name of the author and how to contact regarding the dataset which is the generic UKCEH enquiries address. Under the affiliation entry we can see that we are linked to another contextual entity describing the organisation. Lets look at that and see which organisation they were at when working on this dataset.

org_entity = EIDC_crate.get('https://ror.org/00pggkr55')
for item in org_entity:
    print(item, org_entity[item])

@id https://ror.org/00pggkr55
@type Organization
name UK Centre for Ecology & Hydrology
identifier https://ror.org/00pggkr55

This tells us as expected that the person worked at UKCEH and this is an organisation. There is also a link to the Research Organization Registry (ROR) entry for UKCEH, which handily is also the id to link us to this particular organisation entry in the crate.

Lets get hold of some data.#

Now that we have explored the contextual element of our crate lets look at the actual data that it contains. Our first place to look will be just to print all the available entities in the crate. The first one will be the dataset entity itself which will tell us more detail about the dataset itself (keywords, access options) and more importantly which files it contains.

#List the contextual entities of the dataset.
#Also print the type and id.
for data_entity in EIDC_crate.get_entities():
    if (data_entity.type == 'Dataset'):
      print(data_entity.id, data_entity.type)

https://catalogue.ceh.ac.uk/id/2dce161d-2fab-47bb-9fe6-38e7ed1ae18a/ Dataset

This gives us the ID of the dataset which is also the URL of where it sits in the EIDC. We can use this ID to find out a little more about our dataset.

dataset_entity = EIDC_crate.get('https://catalogue.ceh.ac.uk/id/2dce161d-2fab-47bb-9fe6-38e7ed1ae18a/')
for item in dataset_entity:
    print(item)

@id
@type
datePublished
name
identifier
creditText
description
isAccessibleForFree
variableMeasured
keywords
creator
contactPoint
temporalCoverage
spatialCoverage
funder
license
distribution
publisher
provider
includedInDataCatalog
hasPart

A note on licencing: If you were to go to the entry in the EIDC catalogue for this dataset and try to download the data you will be presented with the following message and a link to the licence under which the data is distributed.

“By accessing or using this dataset, you agree to the terms of the relevant licence agreement(s). You will ensure that this dataset is cited in any publication that describes research in which the data have been used.”

“This dataset is available under the terms of the Open Government Licence”

You will note in the above list of items there is an entry labelled license. If we print out this entry, we get the following information which contains a URL linking to the conditions on which the data is shared and any conditions that you must agree to access and use it. This URL links to the same information that a user of the data would see if they had gone directly through the EIDC catalogue and as it is supplied as part of the metadata in the RO-CRATE it indicates that the same conditions apply when accessing the data using this programatic method. It is best practice when accessing any dataset to first check if there are any licensing conditions supplied with the data and what you need to do to satistfy them.

print(dataset_entity['license'])

https://eidc.ac.uk/licences/ogl/plain

You will also notice that the dataset specific information also has some crossover with the contextual data but also contains some extra metadata information about the dataset. Indeed some of the entries here will just point to the respective contextual entity. An example of printing out the funder is below and just points to the id of the corresponding contextual entity.

#Print the funder.
print(dataset_entity['funder'])

[<#fund0 Organization>]

Some of the entities such as keywords print us a list of relevant keywords to describe the dataset. Or we could look at some information describining the temporal coverage of the dataset as described in the ‘temporalCoverage’ variable. We can also look at the spatial coverage which links us back to the bounding box contextual entity which then links through to the geoshape contextual entity (This holds the actual bounding coordinates of the dataset). This all gets quite complex but shows you the links between all of the respective metadata and information describing the dataset.

#Print the keywords
print(dataset_entity['keywords'])

['Absolute humidity', 'Atmospheric pressure', 'Cosmic-Ray sensing probe', 'COSMOS-UK', 'Latent heat', 'Longwave radiation', 'Net radiation', 'Potential evapotranspiration', 'Rainfall', 'Relative humidity', 'Sensible heat', 'Shortwave radiation', 'Soil depth 10cm', 'Soil depth 20cm', 'Soil depth 2cm', 'Soil depth 50cm', 'Soil depth 5cm', 'Soil heat flux', 'Soil humidity', 'Soil temperature', 'Soil water content', 'Soil wetness', 'UK', 'Volumetric water content', 'VWC', 'Wind direction', 'Wind speed', 'http://onto.nerc.ac.uk/CEHMD/topic/1', 'http://onto.nerc.ac.uk/CEHMD/topic/12', 'http://onto.nerc.ac.uk/CEHMD/topic/17', 'http://onto.nerc.ac.uk/CEHMD/topic/4', 'http://onto.nerc.ac.uk/CEHMD/topic/6', 'http://onto.nerc.ac.uk/CEHMD/topic/7', 'http://onto.nerc.ac.uk/CEHMD/topic/9', 'http://www.eionet.europa.eu/gemet/concept/281', 'http://www.eionet.europa.eu/gemet/concept/3022', 'http://www.eionet.europa.eu/gemet/concept/626', 'http://www.eionet.europa.eu/gemet/concept/637', 'http://www.eionet.europa.eu/gemet/concept/7843', 'http://www.eionet.europa.eu/gemet/concept/7874']

print(dataset_entity['temporalCoverage'], dataset_entity['spatialCoverage'])

['2013-01-01/2024-12-31'] [<#bbox0 Place>]

place_entity = EIDC_crate.get('#bbox0')
for item in place_entity:
    print(item, place_entity[item])

@id #bbox0
@type Place
geo <#geoshape0 GeoShape>

shape_entity = EIDC_crate.get('#geoshape0')
for item in shape_entity:
    print(item, shape_entity[item])

@id #geoshape0
@type GeoShape
box -8.648 49.864, 1.768 60.861

Finally lets list the actual files that make up the dataset. This gives us a rather long list of the files that make up the dataset and their respective ids in the crate. For simplicty lets print out the first 10 entries only.

dataset_entity['hasPart'][0:10]

[<https://catalogue.ceh.ac.uk/datastore/eidchub/2dce161d-2fab-47bb-9fe6-38e7ed1ae18a/cosmos-uk_alic1_hydrosoil_daily_2015-2024.csv File>,
 <https://catalogue.ceh.ac.uk/datastore/eidchub/2dce161d-2fab-47bb-9fe6-38e7ed1ae18a/cosmos-uk_alic1_hydrosoil_daily_2015-2024_flags.csv File>,
 <https://catalogue.ceh.ac.uk/datastore/eidchub/2dce161d-2fab-47bb-9fe6-38e7ed1ae18a/cosmos-uk_alic1_hydrosoil_sh_2015-2024.csv File>,
 <https://catalogue.ceh.ac.uk/datastore/eidchub/2dce161d-2fab-47bb-9fe6-38e7ed1ae18a/cosmos-uk_alic1_hydrosoil_sh_2015-2024_flags.csv File>,
 <https://catalogue.ceh.ac.uk/datastore/eidchub/2dce161d-2fab-47bb-9fe6-38e7ed1ae18a/cosmos-uk_alic1_hydrosoil_sh_2015-2024_qc_flags.csv File>,
 <https://catalogue.ceh.ac.uk/datastore/eidchub/2dce161d-2fab-47bb-9fe6-38e7ed1ae18a/cosmos-uk_balrd_hydrosoil_daily_2014-2024.csv File>,
 <https://catalogue.ceh.ac.uk/datastore/eidchub/2dce161d-2fab-47bb-9fe6-38e7ed1ae18a/cosmos-uk_balrd_hydrosoil_daily_2014-2024_flags.csv File>,
 <https://catalogue.ceh.ac.uk/datastore/eidchub/2dce161d-2fab-47bb-9fe6-38e7ed1ae18a/cosmos-uk_balrd_hydrosoil_sh_2014-2024.csv File>,
 <https://catalogue.ceh.ac.uk/datastore/eidchub/2dce161d-2fab-47bb-9fe6-38e7ed1ae18a/cosmos-uk_balrd_hydrosoil_sh_2014-2024_flags.csv File>,
 <https://catalogue.ceh.ac.uk/datastore/eidchub/2dce161d-2fab-47bb-9fe6-38e7ed1ae18a/cosmos-uk_balrd_hydrosoil_sh_2014-2024_qc_flags.csv File>]

Now that we can see we have data files lets extract one and look at our data.#

Lets pick the first file from the list which should give us daily values of soil hydrology information for the first site in the list which happens to be Alice Holt. We can see from the list above that the type of this data entity is a file and that from its id it appears to be a csv file. It should be noted here that the list above is just a list of data entities that make up the dataset held in the crate. If we want to know more information about a particular entity we can extract this directly and double check that it actually is a csv file we are expecting.

#Read in the correct data entity and print the available information about it.
data_entity = EIDC_crate.get(dataset_entity['hasPart'][0].id)
for item in data_entity:
    print(item, data_entity[item])

@id https://catalogue.ceh.ac.uk/datastore/eidchub/2dce161d-2fab-47bb-9fe6-38e7ed1ae18a/cosmos-uk_alic1_hydrosoil_daily_2015-2024.csv
@type File
name cosmos-uk_alic1_hydrosoil_daily_2015-2024.csv
encodingFormat text/csv
sha256 277e1b794455e4bb98e6d38a9f54bbb5f0fca9daa5a61adabaa682888fc57d26
lastModified 2025-06-24T08:54:15
bytes 1040867
contentUrl https://catalogue.ceh.ac.uk/datastore/eidchub/2dce161d-2fab-47bb-9fe6-38e7ed1ae18a/cosmos-uk_alic1_hydrosoil_daily_2015-2024.csv

The above confirms that the dataset entity is a csv file, its size (~1MB), when it was last modified and the URL where it can be accessed (which in this case happens to be the same as the ID.)

Now its time to get hold of the actual data which is as simple as just reading it in using the contentURL and pandas.

df_in = pd.read_csv(data_entity['contentUrl'])
df_in

	DATE_TIME	SITE_ID	LWIN	LWOUT	SWIN	SWOUT	RN	PRECIP	PRECIP_TIPPING	PRECIP_RAINE	...	STP_TSOIL50	COSMOS_VWC	CTS_MOD_CORR	D86_75M	SNOW	SNOW_DEPTH	SWE	ALBEDO	PE	GCC
0	2015-03-06	ALIC1	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	-9999.0	-9999	-9999	...	-9999.0	-9999.0	-9999.00000	-9999.00000	-9999	-9999	-9999	-9999.000	-9999.0	-9999
1	2015-03-07	ALIC1	23.7	30.5	13.3	1.9	4.6	0.0	-9999	-9999	...	5.6	41.2	1127.86170	22.12836	0	-9999	-9999	0.126	1.7	-9999
2	2015-03-08	ALIC1	27.1	30.5	4.8	0.7	0.7	0.0	-9999	-9999	...	5.7	43.2	1118.28578	21.68494	0	-9999	-9999	0.116	0.6	-9999
3	2015-03-09	ALIC1	28.3	29.7	2.4	0.4	0.7	0.1	-9999	-9999	...	5.9	39.1	1138.72068	22.63179	0	-9999	-9999	0.111	0.3	-9999
4	2015-03-10	ALIC1	23.5	29.8	12.4	1.8	4.3	0.0	-9999	-9999	...	5.9	46.3	1105.15844	21.05842	0	-9999	-9999	0.121	1.6	-9999
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3584	2024-12-27	ALIC1	29.5	29.5	1.2	0.1	1.0	0.5	-9999	-9999	...	8.4	40.8	1129.76166	22.22113	0	-9999	-9999	0.101	0.3	-9999
3585	2024-12-28	ALIC1	29.1	29.2	1.1	0.1	0.9	0.0	-9999	-9999	...	8.4	40.3	1132.54062	22.33910	0	-9999	-9999	0.113	0.3	-9999
3586	2024-12-29	ALIC1	29.2	29.3	1.7	0.2	1.4	0.0	-9999	-9999	...	8.3	40.2	1132.84181	22.36297	0	-9999	-9999	0.112	0.3	-9999
3587	2024-12-30	ALIC1	29.5	30.4	1.8	0.2	0.6	0.0	-9999	-9999	...	8.2	39.0	1138.92442	22.65681	0	-9999	-9999	0.121	0.4	-9999
3588	2024-12-31	ALIC1	29.5	31.4	1.5	0.2	-0.6	0.0	-9999	-9999	...	8.2	39.9	1134.20340	22.43514	0	-9999	-9999	0.128	0.4	-9999

3589 rows × 52 columns

What about information on the quality of the data?#

Handily if we inspect our list of data files again we can see that it also contains a corresponding flags file for that site. Lets read that in and inspect. We will use the same approach as before.

df_in_flags = pd.read_csv(EIDC_crate.get(dataset_entity['hasPart'][1].id)['contentUrl'])
df_in_flags

	DATE_TIME	SITE_ID	LWIN_FLAG	LWOUT_FLAG	SWIN_FLAG	SWOUT_FLAG	RN_FLAG	PRECIP_FLAG	PRECIP_TIPPING_FLAG	PRECIP_RAINE_FLAG	...	STP_TSOIL50_FLAG	COSMOS_VWC_FLAG	CTS_MOD_CORR_FLAG	D86_75M_FLAG	SNOW_FLAG	SNOW_DEPTH_FLAG	SWE_FLAG	ALBEDO_FLAG	PE_FLAG	GCC_FLAG
0	2015-03-06	ALIC1	M	M	M	M	M	M	M	M	...	M	M	M	M	M	NaN	NaN	M	M	M
1	2015-03-07	ALIC1	NaN	NaN	NaN	NaN	NaN	NaN	M	M	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	E	M
2	2015-03-08	ALIC1	NaN	NaN	NaN	NaN	NaN	NaN	M	M	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	E	M
3	2015-03-09	ALIC1	NaN	NaN	NaN	NaN	NaN	NaN	M	M	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	E	M
4	2015-03-10	ALIC1	NaN	NaN	NaN	NaN	NaN	NaN	M	M	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	E	M
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3584	2024-12-27	ALIC1	NaN	NaN	NaN	NaN	NaN	NaN	M	M	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	E	M
3585	2024-12-28	ALIC1	NaN	NaN	NaN	NaN	NaN	NaN	M	M	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	E	M
3586	2024-12-29	ALIC1	NaN	NaN	NaN	NaN	NaN	NaN	M	M	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	E	M
3587	2024-12-30	ALIC1	NaN	NaN	NaN	NaN	NaN	NaN	M	M	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	E	M
3588	2024-12-31	ALIC1	NaN	NaN	NaN	NaN	NaN	NaN	M	M	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	E	M

3589 rows × 52 columns

There we have it, we have showed how RO-Crate can be easily used to read in some data, inspect its corresponding metadata and understand all the contextual information of the data. This notebook can easily be adapted to other EIDC datasets and then its up to the user to do feed it into their data science workflow and do some interesting analysis!

RO-Crate Tutorial Notebook.

Contents

RO-Crate Tutorial Notebook.#

Retrieve the RO-Crate object from the EIDC#

Let take a quick look inside the crate#

Lets get hold of some data.#

Now that we can see we have data files lets extract one and look at our data.#

What about information on the quality of the data?#

Retrieve the RO-Crate object from the EIDC #

Let take a quick look inside the crate #