5. TOAR Near Realtime Data Processing

Currently we collect near real-time data from two data providers: UBA (German Environment Agency 1 ) and OpenAQ (open air quality data 2). The corresponding data harvesting procedures are described below.

5.1. UBA Data Harvesting

Since 2001, the German Umweltbundesamt - UBA 1 - provides preliminary data from a growing number (currently 1004) of German surface stations. Basis for the data exchange is the manual „Luftqualitätsdaten- und Informationsaustausch in Deutschland“, Version V 5, April 2019 (in German).

At least ozone, SO2, PM10, PM2.5, NO2 and CO data for the current day are updated daily and provided continuously hourly up to a maximum of four previous days. Data is fetched from the UBA service 4 times per day (8 am,12 pm, 18 pm, and 22 pm (local time)).

The software for processing the data from UBA is available at https://gitlab.version.fz-juelich.de/esde/toar-data/toar-db-data/-/tree/master/toar_v2/harvesting/UBA_NRT. Data (StationparameterMeta.csv, StationMeta.csv, uba_%s.csv (%s denotes a date)) are harvested 4-times daily from http://www.luftdaten.umweltbundesamt.de/files/ (secured with access credentials).

_images/uba-snapshot.png

Fig. 5.1 Snapshot from 2020-09-05 17:00 CEST

Table 5.1 Mapping of data from daily files imported to the TOAR database variables

name of component in original file

name of component in TOAR database

Schwefeldioxid

so2

Ozon

o3

Stickstoffdioxid

no2

Stickstoffmonoxid

no

Kohlenmonoxid

co

Temperatur

temp

Windgeschwindigkeit

wspeed

Windrichtung

wdir

PM10

pm10

PM2_5

pm2p5

Relative Feuchte

relhum

Benzol

benzene

Ethan

ethane

Methan

ch4

Propan

propane

Toluol

toluene

o-Xylol

oxylene

mp-Xylol

mpxylene

Luftdruck

press

Table 5.2 Mapping of station_type

term of station_type in original file

term of station_type in TOAR database

Hintergrund

background

Industrie

industrial

Verkehr

traffic

Table 5.3 Mapping of station_type_of_area

term of station_type_of_area in original file

term of station_type_of_area in TOAR database

ländlich abgelegen

rural

ländliches Gebiet

rural

ländlich regional

rural

ländlich stadtnah

rural

städtisches Gebiet

urban

vorstädtisches Gebiet

suburban

Table 5.4 Mapping of units and unit conversions

component

original unit

unit in TOAR DB

unit conversion while ingesting

co

mg m-3

ppb

858.95

no

ug m-3

ppb

0.80182

no2

ug m-3

ppb

0.52297

o3

ug m-3

ppb

0.50124

so2

ug m-3

ppb

0.37555

benzene

ug m-3

ppb

0.30802

ethane

ug m-3

ppb

0.77698

ch4

ug m-3

ppb

1.49973

propane

ug m-3

ppb

0.52982

toluene

ug m-3

ppb

0.26113

oxylene

ug m-3

ppb

0.22662

mpxylene

ug m-3

ppb

0.22662

pm1

ug m-3

ug m-3

pm10

ug m-3

ug m-3

pm2p5

ug m-3

ug m-3

press

hPa

hPa

temp

degree celsius

degree celsius

wdir

degree

degree

wspeed

m s-1

m s-1

relhum

%

%

Validated data from the previous year is available at May 31st latest. This data is requested by email and then processed from the database dumps we receive. The validated data will supersede the preliminary near realtime data. The realtime data remains in the database but is hidden from the standard user access procedures via the data quality flag settings.

5.2. OpenAQ

OpenAQ 2 is collecting data in 93 different countries from real-time government and research grade sources. Starting on 26th November 2016, OpenAQ has already gathered more than one billion records, which has 306 Gigabyte in total size and covers the air quality relevant variables BC, CO, NO2, O3, PM10, PM2.5 and SO2.

5.2.1. Data Provision

OpenAQ provides real-time meteorological data on Amazon Web Service 3 in daily directories. Data files composed of records of meteorological measurement values are put into the directory of the current day at irregular intervals. The directories with their data files are stored on Amazon Web Service permanently.

Each data file contains up to hundreds of thousands of records. Records are JSON 4 objects in the same structure throughout the entire life cycle. The task of our real-time data harvesting procedure is to go through these records and save them into the TOAR database according to TOAR database scheme in about real time.

A key element for processing the OpenAQ data is a separate intermediate database, to help processing the data. Only after the data is ready to be stored in the TOAR database it will be uploaded.

The realised real-time data harvesting procedure consists of four steps, the first two download the data and store it in the intermediate database while the last two parse the fields and map them to the TOAR database scheme.

The first two steps (workflow1) are responsible for the action between Amazon Web Service and intermediate database, and the other two steps (workflow2) for the action between intermediate database and TOAR database.

Technically the open source software Apache Airflow is used for workflow automation, so that workflows are triggered in regular interval within a day.

_images/openaq-processing-steps.png

Fig. 5.2 Overview of Processing Steps

5.2.2. The Intermediate Database

The reason for introducing an intermediate database is to make data parsing and mapping easier and to enable pre-evaluation, statistics, and visualisation. Thereby we flatten the long lists of tree-structured records into a two-dimensional table.

5.2.3. The Harvesting Workflow

  • Workflow 1

    We use python and the boto3 5 python module for querying Amazon Web Service (AWS).

    First the newly created 6 sub directories on AWS have to be identified and retrieved which will then be inserted into the sub directory table and the data file table of the intermediate database.

    With that the current status of the intermediate database has been synchronised with the one of AWS and all unprocessed records are prepared in the intermediate database for importing into the TOAR database.

  • Workflow 2

    The second workflow identifies the station and the timeseries in the TOAR database a new record belongs to in the way described in Steps 10 to 17 in Section 3.2.

    With the id of the identified time series, the value of the record will finally be inserted into the data table in TOAR database. In the end a record from the intermediate database is matched and saved into TOAR database (Fig. 5.3).

_images/intermediate-db.png

Fig. 5.3 Simplified model of mapping records from intermediate database (left) into TOAR database (right)

5.2.4. Workflow Automation with Apache Airflow

The data harvesting process described in the last subsection can be executed in one batch or divided into two isolated workflows. In both cases it is desired to be scheduled, executed and monitored automatically. To this end we use the Apache Airflow workflow management software 7 installed on the same server as the intermediate database. Apache Airflow is registered as a system service, so that it will be started automatically on system boot. We define two separate workflows in Apache Airflow as depicted in Fig. 5.2. Both workflows are scheduled hourly. On the web interface of Apache Airflow, we can monitor and manipulate the workflows with ease.

Footnotes

1(1,2)

https://www.umweltbundesamt.de/en

2(1,2)

https://www.openaq.org

3

https://openaq-fetches.s3.amazonaws.com/index.html

4

https://www.json.org

5

https://www.github.com/boto/boto3

6

Compared to the directories retrieved in the last run

7

https://airflow.apache.org