Exploring the openSenseMap Dataset

Norwin Roosen

2023-02-23

This package provides data ingestion functions for almost any data stored on the open data platform for environmental sensordata https://opensensemap.org. Its main goals are to provide means for:

Exploring the dataset

Before we look at actual observations, lets get a grasp of the openSenseMap datasets’ structure.

library(magrittr)
library(opensensmapr)

all_sensors = osem_boxes()
summary(all_sensors)
## boxes total: 11367
## 
## boxes by exposure:
##  indoor  mobile outdoor unknown 
##    2344     591    8413      19 
## 
## boxes by model:
##                   custom          hackair_home_v2             homeEthernet 
##                     2776                       73                       73 
##    homeEthernetFeinstaub           homeV2Ethernet  homeV2EthernetFeinstaub 
##                       55                       21                       40 
##               homeV2Lora               homeV2Wifi      homeV2WifiFeinstaub 
##                      246                      578                      743 
##                 homeWifi        homeWifiFeinstaub        luftdaten_pms1003 
##                      215                      222                        9 
## luftdaten_pms1003_bme280        luftdaten_pms3003 luftdaten_pms3003_bme280 
##                       10                        1                        7 
##        luftdaten_pms5003 luftdaten_pms5003_bme280        luftdaten_pms7003 
##                        7                       60                        6 
## luftdaten_pms7003_bme280         luftdaten_sds011  luftdaten_sds011_bme280 
##                       78                      285                     3060 
##  luftdaten_sds011_bmp180   luftdaten_sds011_dht11   luftdaten_sds011_dht22 
##                      114                      135                     2553 
## 
## $last_measurement_within
##    1h    1d   30d  365d never 
##  3601  3756  4252  5938  2052 
## 
## oldest box: 2016-08-09 19:34:42 (OBS Bohmte UK_02)
## newest box: 2023-02-23 07:56:59 (Steinbrink 29)
## 
## sensors per box:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   5.000   4.981   5.000  76.000

This gives a good overview already: As of writing this, there are more than 700 sensor stations, of which ~50% are currently running. Most of them are placed outdoors and have around 5 sensors each. The oldest station is from May 2014, while the latest station was registered a couple of minutes ago.

Another feature of interest is the spatial distribution of the boxes: plot() can help us out here. This function requires a bunch of optional dependencies though.

if (!require('maps'))     install.packages('maps')
if (!require('maptools')) install.packages('maptools')
if (!require('rgeos'))    install.packages('rgeos')

plot(all_sensors)

It seems we have to reduce our area of interest to Germany.

But what do these sensor stations actually measure? Lets find out. osem_phenomena() gives us a named list of of the counts of each observed phenomenon for the given set of sensor stations:

phenoms = osem_phenomena(all_sensors)
str(phenoms)
## List of 3289
##  $ Temperatur                                                   : int 9385
##  $ rel. Luftfeuchte                                             : int 8317
##  $ PM10                                                         : int 8147
##  $ PM2.5                                                        : int 8135
##  $ Luftdruck                                                    : int 5667
##  $ Beleuchtungsstärke                                           : int 1674
##  $ UV-Intensität                                                : int 1665
##  $ Temperature                                                  : int 643
##  $ Humidity                                                     : int 473
##  $ VOC                                                          : int 422
##  $ Luftfeuchte                                                  : int 362
##  $ Lufttemperatur                                               : int 356
##  $ CO₂                                                          : int 304
##  $ Pressure                                                     : int 293
##  $ Bodenfeuchte                                                 : int 284
##  $ Luftfeuchtigkeit                                             : int 272
##  $ atm. Luftdruck                                               : int 245
##  $ Lautstärke                                                   : int 240
##  $ PM01                                                         : int 206
##  $ IAQ                                                          : int 162
##  $ Kalibrierungswert                                            : int 156
##  $ rel. Luftfeuchte SCD30                                       : int 156
##  $ Bodentemperatur                                              : int 155
##  $ Temperatur SCD30                                             : int 154
##  $ CO2eq                                                        : int 153
##  $ Windgeschwindigkeit                                          : int 152
##  $ pH-Wert                                                      : int 123
##  $ Gesamthärte                                                  : int 122
##  $ Blei                                                         : int 120
##  $ Eisen                                                        : int 120
##  $ GesamthaerteLabor                                            : int 120
##  $ Gesamthärte 2                                                : int 120
##  $ Kupfer C                                                     : int 120
##  $ Kupfer D                                                     : int 120
##  $ Kupfer1                                                      : int 120
##  $ Kupfer2                                                      : int 120
##  $ Nitrat                                                       : int 120
##  $ Nitrit                                                       : int 120
##  $ CO2                                                          : int 112
##  $ Feinstaub PM10                                               : int 98
##  $ Windrichtung                                                 : int 82
##  $ rel. Luftfeuchte (HECA)                                      : int 74
##  $ Temperatur (HECA)                                            : int 72
##  $ Temperatura                                                  : int 69
##  $ Helligkeit                                                   : int 67
##  $ Feinstaub PM2.5                                              : int 65
##  $ Taupunkt                                                     : int 62
##  $ Latitude                                                     : int 61
##  $ Longtitude                                                   : int 58
##  $ Durchschnitt Umgebungslautstärke                             : int 51
##  $ Minimum Umgebungslautstärke                                  : int 51
##  $ UV-Index                                                     : int 49
##  $ temperature                                                  : int 46
##  $ Batterie                                                     : int 45
##  $ Feinstaub PM1.0                                              : int 41
##  $ Umgebungslautstärke                                          : int 41
##  $ UV                                                           : int 40
##  $ humidity                                                     : int 38
##  $ Abstand nach links                                           : int 34
##  $ Beschleunigung Z-Achse                                       : int 34
##  $ Beschleunigung X-Achse                                       : int 33
##  $ Beschleunigung Y-Achse                                       : int 33
##  $ Geschwindigkeit                                              : int 33
##  $ Niederschlag                                                 : int 33
##  $ Feinstaub PM25                                               : int 32
##  $ PM1                                                          : int 32
##  $ Abstand nach rechts                                          : int 31
##  $ PM1.0                                                        : int 30
##  $ rel. Luftfeuchtigkeit                                        : int 30
##  $ Relative Humidity                                            : int 29
##  $ Sonnenstrahlung                                              : int 29
##  $ Luftdruck relativ                                            : int 28
##  $ Luftdruck absolut                                            : int 26
##  $ Rain                                                         : int 26
##  $ Regenrate                                                    : int 26
##  $ CO2 Konzentration                                            : int 25
##  $ RSSI                                                         : int 22
##  $ gefühlte Temperatur                                          : int 22
##  $ PM 2.5                                                       : int 21
##  $ Battery                                                      : int 20
##  $ Ciśnienie                                                    : int 20
##  $ Air Pressure                                                 : int 19
##  $ Regen                                                        : int 19
##  $ Schall                                                       : int 19
##  $ Signal                                                       : int 19
##  $ Ilmanpaine                                                   : int 18
##  $ Lämpötila                                                    : int 18
##  $ UV Index                                                     : int 18
##  $ Wind speed                                                   : int 18
##  $ PM 10                                                        : int 17
##  $ PM4                                                          : int 17
##  $ Air pressure                                                 : int 16
##  $ Temperatur DHT22                                             : int 16
##  $ Wind Direction                                               : int 16
##  $ Altitude                                                     : int 15
##  $ Illuminance                                                  : int 15
##  $ Speed                                                        : int 15
##  $ Wind Speed                                                   : int 15
##  $ pressure                                                     : int 15
##   [list output truncated]

Thats quite some noise there, with many phenomena being measured by a single sensor only, or many duplicated phenomena due to slightly different spellings. We should clean that up, but for now let’s just filter out the noise and find those phenomena with high sensor numbers:

phenoms[phenoms > 20]
## $Temperatur
## [1] 9385
## 
## $`rel. Luftfeuchte`
## [1] 8317
## 
## $PM10
## [1] 8147
## 
## $PM2.5
## [1] 8135
## 
## $Luftdruck
## [1] 5667
## 
## $Beleuchtungsstärke
## [1] 1674
## 
## $`UV-Intensität`
## [1] 1665
## 
## $Temperature
## [1] 643
## 
## $Humidity
## [1] 473
## 
## $VOC
## [1] 422
## 
## $Luftfeuchte
## [1] 362
## 
## $Lufttemperatur
## [1] 356
## 
## $`CO₂`
## [1] 304
## 
## $Pressure
## [1] 293
## 
## $Bodenfeuchte
## [1] 284
## 
## $Luftfeuchtigkeit
## [1] 272
## 
## $`atm. Luftdruck`
## [1] 245
## 
## $Lautstärke
## [1] 240
## 
## $PM01
## [1] 206
## 
## $IAQ
## [1] 162
## 
## $Kalibrierungswert
## [1] 156
## 
## $`rel. Luftfeuchte SCD30`
## [1] 156
## 
## $Bodentemperatur
## [1] 155
## 
## $`Temperatur SCD30`
## [1] 154
## 
## $CO2eq
## [1] 153
## 
## $Windgeschwindigkeit
## [1] 152
## 
## $`pH-Wert`
## [1] 123
## 
## $Gesamthärte
## [1] 122
## 
## $Blei
## [1] 120
## 
## $Eisen
## [1] 120
## 
## $GesamthaerteLabor
## [1] 120
## 
## $`Gesamthärte 2`
## [1] 120
## 
## $`Kupfer C`
## [1] 120
## 
## $`Kupfer D`
## [1] 120
## 
## $Kupfer1
## [1] 120
## 
## $Kupfer2
## [1] 120
## 
## $Nitrat
## [1] 120
## 
## $Nitrit
## [1] 120
## 
## $CO2
## [1] 112
## 
## $`Feinstaub PM10`
## [1] 98
## 
## $Windrichtung
## [1] 82
## 
## $`rel. Luftfeuchte (HECA)`
## [1] 74
## 
## $`Temperatur (HECA)`
## [1] 72
## 
## $Temperatura
## [1] 69
## 
## $Helligkeit
## [1] 67
## 
## $`Feinstaub PM2.5`
## [1] 65
## 
## $Taupunkt
## [1] 62
## 
## $Latitude
## [1] 61
## 
## $Longtitude
## [1] 58
## 
## $`Durchschnitt Umgebungslautstärke`
## [1] 51
## 
## $`Minimum Umgebungslautstärke`
## [1] 51
## 
## $`UV-Index`
## [1] 49
## 
## $temperature
## [1] 46
## 
## $Batterie
## [1] 45
## 
## $`Feinstaub PM1.0`
## [1] 41
## 
## $Umgebungslautstärke
## [1] 41
## 
## $UV
## [1] 40
## 
## $humidity
## [1] 38
## 
## $`Abstand nach links`
## [1] 34
## 
## $`Beschleunigung Z-Achse`
## [1] 34
## 
## $`Beschleunigung X-Achse`
## [1] 33
## 
## $`Beschleunigung Y-Achse`
## [1] 33
## 
## $Geschwindigkeit
## [1] 33
## 
## $Niederschlag
## [1] 33
## 
## $`Feinstaub PM25`
## [1] 32
## 
## $PM1
## [1] 32
## 
## $`Abstand nach rechts`
## [1] 31
## 
## $PM1.0
## [1] 30
## 
## $`rel. Luftfeuchtigkeit`
## [1] 30
## 
## $`Relative Humidity`
## [1] 29
## 
## $Sonnenstrahlung
## [1] 29
## 
## $`Luftdruck relativ`
## [1] 28
## 
## $`Luftdruck absolut`
## [1] 26
## 
## $Rain
## [1] 26
## 
## $Regenrate
## [1] 26
## 
## $`CO2 Konzentration`
## [1] 25
## 
## $RSSI
## [1] 22
## 
## $`gefühlte Temperatur`
## [1] 22
## 
## $`PM 2.5`
## [1] 21

Alright, temperature it is! Fine particulate matter (PM2.5) seems to be more interesting to analyze though. We should check how many sensor stations provide useful data: We want only those boxes with a PM2.5 sensor, that are placed outdoors and are currently submitting measurements:

pm25_sensors = osem_boxes(
  exposure = 'outdoor',
  date = Sys.time(), # ±4 hours
  phenomenon = 'PM2.5'
)
summary(pm25_sensors)
## boxes total: 3002
## 
## boxes by exposure:
## outdoor 
##    3002 
## 
## boxes by model:
##                   custom          hackair_home_v2    homeEthernetFeinstaub 
##                      174                        8                       12 
##  homeV2EthernetFeinstaub               homeV2Lora               homeV2Wifi 
##                       10                       21                        2 
##      homeV2WifiFeinstaub                 homeWifi        homeWifiFeinstaub 
##                      126                        3                       30 
##        luftdaten_pms1003 luftdaten_pms1003_bme280        luftdaten_pms5003 
##                        1                        2                        3 
## luftdaten_pms5003_bme280        luftdaten_pms7003 luftdaten_pms7003_bme280 
##                       11                        2                       26 
##         luftdaten_sds011  luftdaten_sds011_bme280  luftdaten_sds011_bmp180 
##                      115                     1365                       59 
##   luftdaten_sds011_dht11   luftdaten_sds011_dht22 
##                       45                      987 
## 
## $last_measurement_within
##    1h    1d   30d  365d never 
##  2977  3002  3002  3002     0 
## 
## oldest box: 2017-03-03 18:20:43 (Witten Heven Dorf)
## newest box: 2023-02-23 07:56:59 (Steinbrink 29)
## 
## sensors per box:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   4.000   5.000   4.838   5.000  26.000
plot(pm25_sensors)

Thats still more than 200 measuring stations, we can work with that.

Analyzing sensor data

Having analyzed the available data sources, let’s finally get some measurements. We could call osem_measurements(pm25_sensors) now, however we are focusing on a restricted area of interest, the city of Berlin. Luckily we can get the measurements filtered by a bounding box:

library(sf)
## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
library(units)
## udunits database from C:/Software/RPackages/units/share/udunits/udunits2.xml
library(lubridate)
library(dplyr)

# construct a bounding box: 12 kilometers around Berlin
berlin = st_point(c(13.4034, 52.5120)) %>%
  st_sfc(crs = 4326) %>%
  st_transform(3857) %>% # allow setting a buffer in meters
  st_buffer(set_units(12, km)) %>%
  st_transform(4326) %>% # the opensensemap expects WGS 84
  st_bbox()
pm25 = osem_measurements(
  berlin,
  phenomenon = 'PM2.5',
  from = now() - days(3), # defaults to 2 days
  to = now()
)

plot(pm25)

Now we can get started with actual spatiotemporal data analysis. First, lets mask the seemingly uncalibrated sensors:

outliers = filter(pm25, value > 100)$sensorId
bad_sensors = outliers[, drop = T] %>% levels()

pm25 = mutate(pm25, invalid = sensorId %in% bad_sensors)

Then plot the measuring locations, flagging the outliers:

st_as_sf(pm25) %>% st_geometry() %>% plot(col = factor(pm25$invalid), axes = T)

Removing these sensors yields a nicer time series plot:

pm25 %>% filter(invalid == FALSE) %>% plot()

Further analysis: comparison with LANUV data TODO