Analyzing environmental sensor data from openSenseMap.org in R

This package provides data ingestion functions for almost any data stored on the open data platform for environemental sensordata https://opensensemap.org. Its main goals are to provide means for:

Exploring the dataset

Before we look at actual observations, lets get a grasp of the openSenseMap datasets' structure.

library(magrittr)
library(opensensmapr)

all_sensors = osem_boxes()
summary(all_sensors)
## boxes total: 1779
## 
## boxes by exposure:
##  indoor  mobile outdoor unknown 
##     288      55    1416      20 
## 
## boxes by model:
##                   custom             homeEthernet    homeEthernetFeinstaub 
##                      335                       92                       49 
##                 homeWifi        homeWifiFeinstaub        luftdaten_pms1003 
##                      192                      144                        1 
## luftdaten_pms1003_bme280 luftdaten_pms5003_bme280 luftdaten_pms7003_bme280 
##                        1                        5                        2 
##         luftdaten_sds011  luftdaten_sds011_bme280  luftdaten_sds011_bmp180 
##                       57                      197                       19 
##   luftdaten_sds011_dht11   luftdaten_sds011_dht22 
##                       46                      639 
## 
## $last_measurement_within
##    1h    1d   30d  365d never 
##   921   960  1089  1427   235 
## 
## oldest box: 2014-05-28 15:36:14 (CALIMERO)
## newest box: 2018-05-24 20:29:50 (Stadthalle)
## 
## sensors per box:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   4.000   4.601   5.000  33.000

This gives a good overview already: As of writing this, there are more than 700 sensor stations, of which ~50% are currently running. Most of them are placed outdoors and have around 5 sensors each. The oldest station is from May 2014, while the latest station was registered a couple of minutes ago.

Another feature of interest is the spatial distribution of the boxes: plot() can help us out here. This function requires a bunch of optional dependencies though.

if (!require('maps'))     install.packages('maps')
if (!require('maptools')) install.packages('maptools')
if (!require('rgeos'))    install.packages('rgeos')

plot(all_sensors)

plot of chunk unnamed-chunk-3

It seems we have to reduce our area of interest to Germany.

But what do these sensor stations actually measure? Lets find out. osem_phenomena() gives us a named list of of the counts of each observed phenomenon for the given set of sensor stations:

phenoms = osem_phenomena(all_sensors)
str(phenoms)
## List of 432
##  $ Temperatur                                      : int 1607
##  $ rel. Luftfeuchte                                : int 1421
##  $ PM10                                            : int 1200
##  $ PM2.5                                           : int 1198
##  $ Luftdruck                                       : int 824
##  $ Beleuchtungsstärke                              : int 480
##  $ UV-Intensität                                   : int 471
##  $ Luftfeuchtigkeit                                : int 84
##  $ Temperature                                     : int 49
##  $ Humidity                                        : int 42
##  $ Helligkeit                                      : int 25
##  $ Lautstärke                                      : int 21
##  $ Schall                                          : int 20
##  $ UV                                              : int 20
##  $ Pressure                                        : int 19
##  $ Licht                                           : int 18
##  $ Luftfeuchte                                     : int 14
##  $ Umgebungslautstärke                             : int 14
##  $ Lämpötila                                       : int 13
##  $ Ilmanpaine                                      : int 12
##  $ Signal                                          : int 12
##  $ Feinstaub PM10                                  : int 10
##  $ Feinstaub PM2.5                                 : int 9
##  $ Kosteus                                         : int 8
##  $ Valonmäärä                                      : int 8
##  $ temperature                                     : int 8
##  $ PM01                                            : int 7
##  $ Temperatur DHT22                                : int 7
##  $ UV-säteily                                      : int 7
##  $ Niederschlag                                    : int 6
##  $ UV-Strahlung                                    : int 6
##  $ Wind speed                                      : int 6
##  $ Windgeschwindigkeit                             : int 6
##  $ humidity                                        : int 6
##  $ Ilmankosteus                                    : int 5
##  $ Wassertemperatur                                : int 5
##  $ Windrichtung                                    : int 5
##  $ rel. Luftfeuchtigkeit                           : int 5
##  $ Druck                                           : int 4
##  $ Light                                           : int 4
##  $ Temperature 1                                   : int 4
##  $ UV Index                                        : int 4
##  $ UV-Säteily                                      : int 4
##  $ lautstärke                                      : int 4
##  $ rel. Luftfeuchte 1                              : int 4
##  $ relative Luftfeuchtigkeit                       : int 4
##  $ Air pressure                                    : int 3
##  $ Batterie                                        : int 3
##  $ Battery                                         : int 3
##  $ DS18B20_Probe01                                 : int 3
##  $ DS18B20_Probe02                                 : int 3
##  $ DS18B20_Probe03                                 : int 3
##  $ DS18B20_Probe04                                 : int 3
##  $ DS18B20_Probe05                                 : int 3
##  $ Licht (digital)                                 : int 3
##  $ Luftdruck (BME280)                              : int 3
##  $ PM 10                                           : int 3
##  $ PM 2.5                                          : int 3
##  $ Temp                                            : int 3
##  $ Temperatur (BME280)                             : int 3
##  $ Temperatur HDC1008                              : int 3
##  $ Temperatura                                     : int 3
##  $ Temperature 2                                   : int 3
##  $ UV-Index                                        : int 3
##  $ Valoisuus                                       : int 3
##  $ Wind Gust                                       : int 3
##  $ pressure                                        : int 3
##  $ rel. Luftfeuchte DHT22                          : int 3
##  $ 1                                               : int 2
##  $ 10                                              : int 2
##  $ 2                                               : int 2
##  $ 3                                               : int 2
##  $ 4                                               : int 2
##  $ 5                                               : int 2
##  $ 6                                               : int 2
##  $ 7                                               : int 2
##  $ 8                                               : int 2
##  $ 9                                               : int 2
##  $ Air Pressure                                    : int 2
##  $ Anderer                                         : int 2
##  $ Battery voltage                                 : int 2
##  $ CO2                                             : int 2
##  $ Feuchte                                         : int 2
##  $ Illuminance                                     : int 2
##  $ Intensity                                       : int 2
##  $ Leitfähigkeit                                   : int 2
##  $ Lichtintensität                                 : int 2
##  $ Luftdruck BMP180                                : int 2
##  $ Luftfeuchte (BME280)                            : int 2
##  $ Luftqualität                                    : int 2
##  $ Lufttemperatur                                  : int 2
##  $ PM25                                            : int 2
##  $ Radioactivity                                   : int 2
##  $ Radioaktivität                                  : int 2
##  $ Regen                                           : int 2
##  $ Relative Humidity                               : int 2
##  $ Sound                                           : int 2
##  $ Temperatur (DHT22)                              : int 2
##  $ Temperatur BMP180                               : int 2
##   [list output truncated]

Thats quite some noise there, with many phenomena being measured by a single sensor only, or many duplicated phenomena due to slightly different spellings. We should clean that up, but for now let's just filter out the noise and find those phenomena with high sensor numbers:

phenoms[phenoms > 20]
## $Temperatur
## [1] 1607
## 
## $`rel. Luftfeuchte`
## [1] 1421
## 
## $PM10
## [1] 1200
## 
## $PM2.5
## [1] 1198
## 
## $Luftdruck
## [1] 824
## 
## $Beleuchtungsstärke
## [1] 480
## 
## $`UV-Intensität`
## [1] 471
## 
## $Luftfeuchtigkeit
## [1] 84
## 
## $Temperature
## [1] 49
## 
## $Humidity
## [1] 42
## 
## $Helligkeit
## [1] 25
## 
## $Lautstärke
## [1] 21

Alright, temperature it is! Fine particulate matter (PM2.5) seems to be more interesting to analyze though. We should check how many sensor stations provide useful data: We want only those boxes with a PM2.5 sensor, that are placed outdoors and are currently submitting measurements:

pm25_sensors = osem_boxes(
  exposure = 'outdoor',
  date = Sys.time(), # ±4 hours
  phenomenon = 'PM2.5'
)
summary(pm25_sensors)
## boxes total: 788
## 
## boxes by exposure:
## outdoor 
##     788 
## 
## boxes by model:
##                   custom    homeEthernetFeinstaub                 homeWifi 
##                       28                       37                        6 
##        homeWifiFeinstaub luftdaten_pms1003_bme280 luftdaten_pms5003_bme280 
##                       57                        1                        2 
## luftdaten_pms7003_bme280         luftdaten_sds011  luftdaten_sds011_bme280 
##                        2                       33                      135 
##  luftdaten_sds011_bmp180   luftdaten_sds011_dht11   luftdaten_sds011_dht22 
##                       14                       31                      442 
## 
## $last_measurement_within
##    1h    1d   30d  365d never 
##   764   777   780   785     3 
## 
## oldest box: 2016-06-02 12:09:47 (BalkonBox Mindener Str.)
## newest box: 2018-05-24 20:29:50 (Stadthalle)
## 
## sensors per box:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   4.000   4.000   4.615   5.000  12.000
plot(pm25_sensors)

plot of chunk unnamed-chunk-7

Thats still more than 200 measuring stations, we can work with that.

Analyzing sensor data

Having analyzed the available data sources, let's finally get some measurements. We could call osem_measurements(pm25_sensors) now, however we are focussing on a restricted area of interest, the city of Berlin. Luckily we can get the measurements filtered by a bounding box:

library(sf)
## Linking to GEOS 3.6.1, GDAL 2.1.4, proj.4 4.9.3
library(units)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## The following objects are masked from 'package:rgeos':
## 
##     intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# construct a bounding box: 12 kilometers around Berlin
berlin = st_point(c(13.4034, 52.5120)) %>%
  st_sfc(crs = 4326) %>%
  st_transform(3857) %>% # allow setting a buffer in meters
  st_buffer(set_units(12, km)) %>%
  st_transform(4326) %>% # the opensensemap expects WGS 84
  st_bbox()
pm25 = osem_measurements(
  berlin,
  phenomenon = 'PM2.5',
  from = now() - days(20), # defaults to 2 days
  to = now()
)

plot(pm25)

plot of chunk unnamed-chunk-9

Now we can get started with actual spatiotemporal data analysis. First, lets mask the seemingly uncalibrated sensors:

outliers = filter(pm25, value > 100)$sensorId
bad_sensors = outliers[, drop = T] %>% levels()

pm25 = mutate(pm25, invalid = sensorId %in% bad_sensors)

Then plot the measuring locations, flagging the outliers:

st_as_sf(pm25) %>% st_geometry() %>% plot(col = factor(pm25$invalid), axes = T)

plot of chunk unnamed-chunk-11

Removing these sensors yields a nicer time series plot:

pm25 %>% filter(invalid == FALSE) %>% plot()

plot of chunk unnamed-chunk-12

Further analysis: comparison with LANUV data TODO