.. _Infrastructure: ****************************************** TOAR Data Centre Infrastructure Components ****************************************** This section details the infrastructure’s systems and services as well as software, backup and other implemented housekeeping functions. -------------------------------------------- Description of System and Service Components -------------------------------------------- The TOAR Data Centre infrastructure consists of four main products and services (:numref:`figure-toar-dc-components`). These are: * the **TOAR data portal** — a one-stop-shop to locate and access tropospheric ozone data from a large variety of measurement platforms (https://toar-data.org), * the **TOAR web services** — an interactive GUI (graphical user interface) for the online analysis of station-based surface ozone measurements and related variables (`https://toar-data.fz-juelich.de/gui/v1 `_ | v2) [#f20]_ together with a REST API (representational state transfer, application programming interface) (`https://toar-data.fz-juelich.de/api/v2 `_ resp. `https://toar-data.fz-juelich.de/api/v1 `_) which allows for machine access to ozone data and ozone analyses, * the **TOAR database** of station-based ground-level measurements of ozone, ozone precursors and meteorological variables. This database also contains meteorological variables from weather models. Access to the data is provided via the TOAR web services, * the **TOAR data publication service** enables the TOAR data curators to publish data sets to the external service `B2SHARE `_ at FZJ (https://b2share.fzjuelich.de/communities/TOAR). B2SHARE offers trusted long-term publication of ozone data sets as well as TOAR analysis products and it includes DOI (digital object identifier) assignment with Datacite. .. _figure-toar-dc-components: .. figure:: ./images/image005.jpg The TOAR Data Centre infrastructure and its main service components In addition to these four core services several other services are running in the background. These are local instances of EUDAT’s B2SHARE, OpenStreetMap, and different geolocation services. In addition, software has been developed for the data ingestion workflow which is at the heart of the TOAR database. The servers housing the services are Virtual Machines (VM) in OpenStack and VMware clusters at JSC or systems at third party providers. The following sections describe the individual service components, starting from the four core services followed by some additional or external services which are integrated into the TOAR infrastructure. ~~~~~~~~~~~~~~~~ TOAR Data Portal ~~~~~~~~~~~~~~~~ .. |pic1| image:: ./images/image006.png :width: 16% .. |pic2| image:: ./images/image007.png :width: 4% |pic1| |pic2| The TOAR data portal is a WordPress website hosted by `Xerb `_, the domain toar-data.org is owned by Forschungszentrum Jülich and linked to the instance at xerb.de. Xerb, as host, is taking care of backing up the WordPress instance with its database that is critical for the recovery of the website. ~~~~~~~~~~~~~ TOAR Database ~~~~~~~~~~~~~ .. |pic3| image:: ./images/image008.png :width: 16% .. |pic4| image:: ./images/image009.png :width: 4% .. |pic5| image:: ./images/image010.png :width: 4% .. |pic6| image:: ./images/image011.png :width: 4% |pic3| |pic4| |pic5| |pic6| The TOAR database is a PostgreSQL database with PostGIS extensions. The database server runs on a VM in the Helmholtz Data Federation (HDF) cloud (see :numref:`chapter-db-setup-operation` for details). The database structure (data model) is described in :external:ref:`metadata-reference`. The database model is also available as schema dump on gitlab and the installation instructions are given in the README there. Software for data ingestion consists of various Python programs which are available from the gitlab repository on request. The details of the ingestion workflow are given in :external:ref:`processing-workflow:the toar data processing workflow`. Data ingestion also makes use of geospatial information (station metadata). The management of geospatial data is described in :numref:`subsection-geopeas` - :numref:`subsection-openstreetmap`. *For administrators:* the TOAR database code is available from gitlab repository and the documentation from pages. ~~~~~~~~~~~~~~~~~ TOAR Web Services ~~~~~~~~~~~~~~~~~ .. |pic7| image:: ./images/image012.png :width: 16% .. |pic8| image:: ./images/image013.png :width: 4% |pic7| |pic8| The user-accessible web services of the TOAR Database Infrastructure consist of a REST API (`https://toar-data.fz-juelich.de/api/v1 `_ resp. `https://toar-data.fz-juelich.de/api/v2 `_) for machine access to the TOAR database and a GUI (`https://toar-data.fz-juelich.de/gui/v1 `_) [#f20]_ for interactive data analysis. The REST API is written using the Python package fastapi (https://fastapi.tiangolo.com/). The source code can be found at the gitlab repository. For the data processing and graphical analysis a standalone software package has been developed (toarstats) which is integrated with the TOAR V2 REST API (`https://toar-data.fz-juelich.de/api/v2 `_). A special REST API service for flux-based vegetation damage assessment due to ozone based on the DO3SE model is currently under development. The beta version of its API can be accessed at `https://toar-data.fz-juelich.de/do3se/api/v1/ `_. The TOAR V2 GUI is currently developed as a dashboard in Python with the help of plotly’s dash library. Leaflet and our local instance of the OpenStreetMap tile service described below are used for map displays. ~~~~~~~~~~~~~~~~~~~~~~ TOAR Data Publications ~~~~~~~~~~~~~~~~~~~~~~ .. |pic9| image:: ./images/image014.png :width: 16% .. |pic10| image:: ./images/image015.png :width: 4% |pic9| |pic10| The TOAR data publication service is realised as a python-tool for the TOAR data curators to prepare the data for publication, specifically to map the metadata to the schema used by B2SHARE. A specific community within the EUDAT B2SHARE instance has been created for TOAR publications (`https://b2share.fz-juelich.de/communities/TOAR `_) and only the TOAR data curators have the right to publish in this B2Share community. ~~~~~~~ B2SHARE ~~~~~~~ .. |pic11| image:: ./images/image016.png :width: 16% |pic11| **B2SHARE** is external to the TOAR Data Centre infrastructure and used by the data publication service. It is a trustworthy publication archive co-developed by leading European science institutions. The instance used by the TOAR Data Centre infrastructure is running on a VM also maintained at JSC. For the publications from the TOAR community a special community metadata profile has been generated and uploaded to the B2SHARE service. The metadata profile is available from `https://b2share.fz-juelich.de/communities/TOAR `_. For administrators: Information about B2SHARE can be found at `https://www.eudat.eu/services/b2share `_; the server source code can be obtained from `https://github.com/EUDAT-B2SHARE `_. Note, however, that registration as DOI registry is necessary with Datacite (`https://datacite.org/dois.html `_) to deliver the full functionality of TOAR data publications. .. _subsection-geopeas: ~~~~~~~~ GEO PEAS ~~~~~~~~ .. |pic12| image:: ./images/geopeas.png :width: 16% |pic12| **GEO PEAS** (GEOspatial Point Extraction and Aggregation Service) provides harmonised access to information from various geospatial datasets in aggregated form so that it can be included as metadata in the TOAR database or used in special analysis procedures. The GeoLocation service consists of a REST API which is written in Python with the Django web framework. The source code of the geolocation service can be found at the repository `https://gitlab.version.fz-juelich.de/esde/toar-data/geolocationservices `_. .. note:: This service runs behind a firewall and cannot be publicly exposed, because many earth observation (EO) datasets don’t allow open re-distribution. The geospatial datasets which are processed by the GeoLocation service include a variety of EO datasets and vector data from a local instance of OpenStreetMap’s Overpass service (see below). ~~~~~~~~~~~~~~~~~~~~~~~ Rasdaman Array Database ~~~~~~~~~~~~~~~~~~~~~~~ .. |pic13| image:: ./images/rasdaman.png :width: 16% |pic13| The EO data which is analysed by the GeoLocation service is stored in a **rasdaman array database** (geoCube) which is for internal use only. For the complete list of data sources, version information and processing instructions refer to https://gitlab.version.fz-juelich.de/esde/toar-data/geolocationservices/-/wikis/Rasdaman-Data (access with JSC account). .. _subsection-openstreetmap: ~~~~~~~~~~~~~~~~~~~~~ OpenStreetMap Service ~~~~~~~~~~~~~~~~~~~~~ .. |pic14| image:: ./images/image019.png :width: 16% .. |pic15| image:: ./images/image020.png :width: 8% |pic14| |pic15| A local instance of **OpenStreetMap**’s (OSM) tile service is set up on the HIFIS (Helmholtz Federated IT Services) OSM Service VM. It is used to provide the map tiles on the GUI of the TOAR phase I web service (JOIN). **Overpass API** is a read-only API that serves up custom selected parts of the OSM map data. It is implemented as part of the local OSM instance. **Nominatim geocoder** uses OSM data to find locations on earth by name and address (from lat/long to location, revers lookup). The TOAR database infrastructure uses it for identifying the country and state where the station is located. It is installed together with the OSM tile service and uses PostgreSQL and apache. ~~~~~~~~~~~~~~~~ Analysis Service ~~~~~~~~~~~~~~~~ The user-accessible web services for analysis of the TOAR Database Infrastructure consist of a REST API (https://toar-data.fz-juelich.de/api/v2/analysis/) for machine access to the TOAR database. The REST API is written using the Python package fastapi (https://fastapi.tiangolo.com/). The source code is currently not publicly available. For the data processing a standalone software package has been developed (toarstats) which is integrated with the TOAR V2 REST API. The source code for the standalone software package can be found at the gitlab repository (https://gitlab.jsc.fz-juelich.de/esde/toar-public/toarstats). The analysis services provide access to bulk time series downloads of time series as well as bulk aggregated time series downloads. ------------------ Server Environment ------------------ The TOAR database infrastructure is operated on different virtual machines hosted at JSC (:numref:`figure-toar-vms`). In addition to these VMs, the TOAR data infrastructure uses about >2 Terabytes disk space on the GPFS parallel file system of JSC and a similar amount of archive space on tapes. Backup copies are maintained at JSC and RWTH Aachen (see :numref:`section-backup`). Furthermore, the TOAR database infrastructure makes use of the B2SHARE service (administered outside of the TOAR Data Centre infrastructure, but also hosted at JSC). .. _figure-toar-vms: .. figure:: ./images/image021.jpg Summary of TOAR-related VMs operated at JSC and the service relations between them; VMs accessible from outside JSC are in dark blue, the others are internal VMs. Each VM is administered by a small number of JSC staff who are the only ones with ssh access to the system. The Openstack and VMware cloud infrastructures are also maintained by a small group of administrators, some of which also act as administrators for the TOAR VMs. The only systems which are accessible from outside JSC are the toar-data web server, the toar-data-testing instance, and the HIFIS OSM service. The following table summarises the tasks of the virtual machines in the TOAR data infrastructure: .. _table-vm-tasks: .. csv-table:: Tasks of the VMs in the TOAR Database Infrastructure :class: longtable :file: csv/components_table.csv :widths: 25 75 :delim: | A list of the installed software stack on each of the VMs listed above can be found in the Annex :numref:`annex-sw-stack`. .. _section-backup: ------------------------------------ Data Locations and Backup Facilities ------------------------------------ Each of the VMs is set up such that daily incremental backups are automatically created from all data belonging to the respective VM. JSC is using IBM Spectrum Prospect (has been known as Tivoli Storage Manager) for backup. The backup of TOAR DB is replicated to RWTH Aachen University’s computing centre. In addition, database dumps of the operational TOAR database (on VM TOAR DB) are taken annually and at certain events, for example when the database state is frozen to provide a consistent analysis base for the TOAR phase II assessment report. All database dumps will be accessible via B2SHARE. The code for setting up a new database instance and loading database dumps into this instance are provided publicly in the `git repository `_ including a (step-by-step documentation in the `README.md file `_, see also :numref:`chapter-db-setup-operation` below). .. _figure-storage-backup: .. figure:: ./images/image022.jpg Storage and Backup System Raw data files downloaded from other air quality data archives or sent to the TOAR data centre by email are archived on tapes in the JSC archival storage. At least one additional copy is always available on a local PC or workstation. .. rubric:: Footnotes .. [#f20] At the time of writing the GUI is available for the TOAR V1 database only .. [#f21] Note that the integration of DO3SE is still work in progress at the time of writing of this documentation .. [#f22] https://airflow.apache.org/