This vignette serves as an example on data wrangling & visualization with opensensmapr, dplyr and ggplot2.

# required packages:
library(opensensmapr) # data download
library(dplyr)        # data wrangling
library(ggplot2)      # plotting
library(lubridate)    # date arithmetic
library(zoo)          # rollmean()

openSenseMap.org has grown quite a bit in the last years; it would be interesting to see how we got to the current 11367 sensor stations, split up by various attributes of the boxes.

While opensensmapr provides extensive methods of filtering boxes by attributes on the server, we do the filtering within R to save time and gain flexibility.

So the first step is to retrieve all the boxes.

# if you want to see results for a specific subset of boxes,
# just specify a filter such as grouptag='ifgi' here
boxes_all = osem_boxes()
boxes = boxes_all

Introduction

In the following we just want to have a look at the boxes created in 2022, so we filter for them.

boxes = filter(boxes, locationtimestamp >= "2022-01-01" & locationtimestamp <="2022-12-31")
summary(boxes) -> summary.data.frame

## boxes total: 2108
## 
## boxes by exposure:
##  indoor  mobile outdoor unknown 
##     524     200    1383       1 
## 
## boxes by model:
##                   custom          hackair_home_v2             homeEthernet 
##                      932                        5                        5 
##    homeEthernetFeinstaub           homeV2Ethernet  homeV2EthernetFeinstaub 
##                        2                        5                        5 
##               homeV2Lora               homeV2Wifi      homeV2WifiFeinstaub 
##                       58                      220                      114 
##                 homeWifi        homeWifiFeinstaub        luftdaten_pms1003 
##                       14                       17                        0 
## luftdaten_pms1003_bme280        luftdaten_pms3003 luftdaten_pms3003_bme280 
##                        1                        0                        0 
##        luftdaten_pms5003 luftdaten_pms5003_bme280        luftdaten_pms7003 
##                        2                        8                        0 
## luftdaten_pms7003_bme280         luftdaten_sds011  luftdaten_sds011_bme280 
##                        8                       28                      462 
##  luftdaten_sds011_bmp180   luftdaten_sds011_dht11   luftdaten_sds011_dht22 
##                       27                        7                      188 
## 
## $last_measurement_within
##    1h    1d   30d  365d never 
##   708   758   889  1559   498 
## 
## oldest box: 2020-02-29 23:00:31 (Kirchardt 1)
## newest box: 2022-12-30 09:19:46 (Balkon)
## 
## sensors per box:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   5.000   4.978   6.000  29.000

Another feature of interest is the spatial distribution of the boxes: plot() can help us out here. This function requires a bunch of optional dependencies though.

if (!require('maps'))     install.packages('maps')
if (!require('maptools')) install.packages('maptools')
if (!require('rgeos'))    install.packages('rgeos')

plot(boxes)

But what do these sensor stations actually measure? Lets find out. osem_phenomena() gives us a named list of of the counts of each observed phenomenon for the given set of sensor stations:

phenoms = osem_phenomena(boxes)
str(phenoms)

## List of 966
##  $ Temperatur                             : int 1520
##  $ rel. Luftfeuchte                       : int 1289
##  $ PM10                                   : int 1070
##  $ PM2.5                                  : int 1069
##  $ Luftdruck                              : int 994
##  $ Beleuchtungsstärke                     : int 282
##  $ UV-Intensität                          : int 282
##  $ VOC                                    : int 224
##  $ Lufttemperatur                         : int 208
##  $ CO₂                                    : int 178
##  $ Bodenfeuchte                           : int 170
##  $ Temperature                            : int 168
##  $ Luftfeuchte                            : int 134
##  $ Lautstärke                             : int 131
##  $ Humidity                               : int 124
##  $ atm. Luftdruck                         : int 114
##  $ Kalibrierungswert                      : int 108
##  $ CO2eq                                  : int 107
##  $ IAQ                                    : int 107
##  $ Temperatur SCD30                       : int 107
##  $ rel. Luftfeuchte SCD30                 : int 107
##  $ Pressure                               : int 98
##  $ Luftfeuchtigkeit                       : int 62
##  $ PM01                                   : int 58
##  $ Bodentemperatur                        : int 55
##  $ Windgeschwindigkeit                    : int 46
##  $ Feinstaub PM10                         : int 41
##  $ Feinstaub PM2.5                        : int 32
##  $ Feinstaub PM1.0                        : int 28
##  $ Batterie                               : int 22
##  $ Taupunkt                               : int 19
##  $ Windrichtung                           : int 19
##  $ rel. Luftfeuchte (HECA)                : int 16
##  $ Temperatur (HECA)                      : int 15
##  $ Temperatura                            : int 15
##  $ Durchschnitt Umgebungslautstärke       : int 14
##  $ Helligkeit                             : int 14
##  $ Latitude                               : int 14
##  $ Longtitude                             : int 14
##  $ Minimum Umgebungslautstärke            : int 14
##  $ RSSI                                   : int 14
##  $ UV-Index                               : int 11
##  $ CO2                                    : int 10
##  $ PM1                                    : int 10
##  $ rel-. Luftfeuchte                      : int 10
##  $ temperature                            : int 10
##  $ Abstand nach links                     : int 9
##  $ Abstand nach rechts                    : int 9
##  $ Beschleunigung X-Achse                 : int 9
##  $ Beschleunigung Y-Achse                 : int 9
##  $ Beschleunigung Z-Achse                 : int 9
##  $ Feinstaub PM25                         : int 9
##  $ Luftdruck absolut                      : int 9
##  $ Luftdruck relativ                      : int 9
##  $ Regenrate                              : int 9
##  $ Sonnenstrahlung                        : int 9
##  $ gefühlte Temperatur                    : int 9
##  $ Geschwindigkeit                        : int 8
##  $ humidity                               : int 8
##  $ Bodenfeuchtigkeit                      : int 7
##  $ Baterie                                : int 6
##  $ Bodenfeuchte 10cm                      : int 6
##  $ Bodenfeuchte 30cm                      : int 6
##  $ Bodentemperatur 10cm                   : int 6
##  $ Bodentemperatur 30cm                   : int 6
##  $ Lumina                                 : int 6
##  $ Taupunktdifferenz (Spread)             : int 6
##  $ UV                                     : int 6
##  $ Umiditate                              : int 6
##  $ Air pressure                           : int 5
##  $ Battery                                : int 5
##  $ PM4                                    : int 5
##  $ Pegel                                  : int 5
##  $ Prezenta-foc                           : int 5
##  $ Regenmenge                             : int 5
##  $ W-LAN                                  : int 5
##  $ Bodenfeuchte 1                         : int 4
##  $ Bodenfeuchte 2                         : int 4
##  $ Bodentemperatur 1                      : int 4
##  $ Bodentemperatur 2                      : int 4
##  $ CO2 Konzentration                      : int 4
##  $ Gesamt Gewicht                         : int 4
##  $ PM 2.5                                 : int 4
##  $ Température                            : int 4
##  $ Wilgotność                             : int 4
##  $ Windspeed                              : int 4
##  $ absolute Luftfeuchtigkeit              : int 4
##  $ Battery Voltage                        : int 3
##  $ Befehl                                 : int 3
##  $ Bodenfeuchte 50cm                      : int 3
##  $ Bodentemperatur 60cm                   : int 3
##  $ Ciśnienie                              : int 3
##  $ Feinstaub                              : int 3
##  $ Humedad                                : int 3
##  $ Höhe                                   : int 3
##  $ Laufzeit                               : int 3
##  $ LuftfeuchteBME                         : int 3
##  $ Luftfeuchtigkeit_01                    : int 3
##  $ LufttemperaturBME                      : int 3
##   [list output truncated]

Thats quite some noise there, with many phenomena being measured by a single sensor only, or many duplicated phenomena due to slightly different spellings. We should clean that up, but for now let’s just filter out the noise and find those phenomena with high sensor numbers:

phenoms[phenoms > 50]

## $Temperatur
## [1] 1520
## 
## $`rel. Luftfeuchte`
## [1] 1289
## 
## $PM10
## [1] 1070
## 
## $PM2.5
## [1] 1069
## 
## $Luftdruck
## [1] 994
## 
## $Beleuchtungsstärke
## [1] 282
## 
## $`UV-Intensität`
## [1] 282
## 
## $VOC
## [1] 224
## 
## $Lufttemperatur
## [1] 208
## 
## $`CO₂`
## [1] 178
## 
## $Bodenfeuchte
## [1] 170
## 
## $Temperature
## [1] 168
## 
## $Luftfeuchte
## [1] 134
## 
## $Lautstärke
## [1] 131
## 
## $Humidity
## [1] 124
## 
## $`atm. Luftdruck`
## [1] 114
## 
## $Kalibrierungswert
## [1] 108
## 
## $CO2eq
## [1] 107
## 
## $IAQ
## [1] 107
## 
## $`Temperatur SCD30`
## [1] 107
## 
## $`rel. Luftfeuchte SCD30`
## [1] 107
## 
## $Pressure
## [1] 98
## 
## $Luftfeuchtigkeit
## [1] 62
## 
## $PM01
## [1] 58
## 
## $Bodentemperatur
## [1] 55

Plot count of boxes by time

By looking at the createdAt attribute of each box we know the exact time a box was registered. Because of some database migration issues the createdAt values are mostly wrong (~80% of boxes created 2022-03-30), so we are using the timestamp attribute of the currentlocation which should in most cases correspond to the creation date.

With this approach we have no information about boxes that were deleted in the meantime, but that’s okay for now.

…and exposure

exposure_counts = boxes %>%
  group_by(exposure) %>%
  mutate(count = row_number(locationtimestamp))

exposure_colors = c(indoor = 'red', outdoor = 'lightgreen', mobile = 'blue', unknown = 'darkgrey')
ggplot(exposure_counts, aes(x = locationtimestamp, y = count, colour = exposure)) +
  geom_line() +
  scale_colour_manual(values = exposure_colors) +
  xlab('Registration Date') + ylab('senseBox count')

Outdoor boxes are growing fast! We can also see the introduction of mobile sensor “stations” in 2017.

Let’s have a quick summary:

exposure_counts %>%
  summarise(
    oldest = min(locationtimestamp),
    newest = max(locationtimestamp),
    count = max(count)
  ) %>%
  arrange(desc(count))

exposure	oldest	newest	count
outdoor	2022-01-01 11:59:16	2022-12-30 09:19:46	1383
indoor	2022-01-02 11:06:08	2022-12-23 20:46:45	524
mobile	2022-01-06 13:20:00	2022-12-21 21:35:16	200
unknown	2022-03-01 07:04:30	2022-03-01 07:04:30	1

…and grouptag

We can try to find out where the increases in growth came from, by analysing the box count by grouptag.

Caveats: Only a small subset of boxes has a grouptag, and we should assume that these groups are actually bigger. Also, we can see that grouptag naming is inconsistent (Luftdaten, luftdaten.info, …)

grouptag_counts = boxes %>%
  group_by(grouptag) %>%
  # only include grouptags with 15 or more members
  filter(length(grouptag) >= 15 & !is.na(grouptag) & grouptag != '') %>%
  mutate(count = row_number(locationtimestamp))

# helper for sorting the grouptags by boxcount
sortLvls = function(oldFactor, ascending = TRUE) {
  lvls = table(oldFactor) %>% sort(., decreasing = !ascending) %>% names()
  factor(oldFactor, levels = lvls)
}
grouptag_counts$grouptag = sortLvls(grouptag_counts$grouptag, ascending = FALSE)

ggplot(grouptag_counts, aes(x = locationtimestamp, y = count, colour = grouptag)) +
  geom_line(aes(group = grouptag)) +
  xlab('Registration Date') + ylab('senseBox count')

grouptag_counts %>%
  summarise(
    oldest = min(locationtimestamp),
    newest = max(locationtimestamp),
    count = max(count)
  ) %>%
  arrange(desc(count))

grouptag	oldest	newest	count
edu	2022-01-02 11:06:08	2022-12-18 12:38:27	127
HU Explorers	2022-04-01 09:07:41	2022-12-14 10:11:34	123
321heiss	2022-07-09 01:29:37	2022-09-01 06:27:35	91
Captographies	2022-06-03 11:25:27	2022-11-16 13:26:39	58
SUGUCS	2022-11-30 15:25:32	2022-12-03 10:11:41	23
SekSeeland	2022-03-14 13:17:17	2022-03-22 20:23:58	19
BurgerMeetnet	2022-01-24 15:33:19	2022-05-10 21:22:35	16
AGIN	2022-11-28 17:33:12	2022-11-28 17:42:18	15
BRGL	2022-11-06 19:23:43	2022-11-06 22:08:36	15
BRGW	2022-11-02 10:28:52	2022-11-02 13:32:12	15
Burgermeetnet	2022-01-15 20:43:16	2022-02-11 17:59:05	15
HTLJ	2022-11-21 22:04:17	2022-11-21 22:05:47	15
MSGB	2022-11-14 09:08:57	2022-11-14 10:19:24	15
MSHO	2022-12-20 09:28:40	2022-12-20 10:01:38	15
MSIN	2022-11-21 17:02:39	2022-11-21 23:06:22	15
Mikroprojekt Mitmachklima	2022-02-09 10:28:40	2022-08-23 13:14:11	15

Plot rate of growth and inactivity per week

First we group the boxes by locationtimestamp into bins of one week:

bins = 'week'
mvavg_bins = 6

growth = boxes %>%
  mutate(week = cut(as.Date(locationtimestamp), breaks = bins)) %>%
  group_by(week) %>%
  summarize(count = length(week)) %>%
  mutate(event = 'registered')

We can do the same for updatedAt, which informs us about the last change to a box, including uploaded measurements. As a lot of boxes were “updated” by the database migration, many of them are updated at 2022-03-30, so we try to use the lastMeasurement attribute instead of updatedAt. This leads to fewer boxes but also automatically excludes boxes which were created but never made a measurement.

This method of determining inactive boxes is fairly inaccurate and should be considered an approximation, because we have no information about intermediate inactive phases. Also deleted boxes would probably have a big impact here.

inactive = boxes %>%
  # remove boxes that were updated in the last two days,
  # b/c any box becomes inactive at some point by definition of updatedAt
  filter(lastMeasurement < now() - days(2)) %>%
  mutate(week = cut(as.Date(lastMeasurement), breaks = bins)) %>%
  filter(as.Date(week) > as.Date("2021-12-31")) %>%
  group_by(week) %>%
  summarize(count = length(week)) %>%
  mutate(event = 'inactive')

Now we can combine both datasets for plotting:

boxes_by_date = bind_rows(growth, inactive) %>% group_by(event)

ggplot(boxes_by_date, aes(x = as.Date(week), colour = event)) +
  xlab('Time') + ylab(paste('rate per ', bins)) +
  scale_x_date(date_breaks="years", date_labels="%Y") +
  scale_colour_manual(values = c(registered = 'lightgreen', inactive = 'grey')) +
  geom_point(aes(y = count), size = 0.5) +
  # moving average, make first and last value NA (to ensure identical length of vectors)
  geom_line(aes(y = rollmean(count, mvavg_bins, fill = list(NA, NULL, NA))))

And see in which weeks the most boxes become (in)active:

boxes_by_date %>%
  filter(count > 50) %>%
  arrange(desc(count))

week	count	event
2022-11-21	92	registered
2022-06-06	77	registered
2022-08-29	76	registered
2022-10-31	71	registered
2022-11-14	67	registered
2022-11-28	65	registered
2022-08-22	61	registered
2022-02-28	57	registered
2022-08-29	55	inactive
2022-03-21	54	registered
2022-12-12	54	registered
2022-01-24	51	registered

Plot duration of boxes being active

While we are looking at locationtimestamp and lastMeasurement, we can also extract the duration of activity of each box, and look at metrics by exposure and grouptag once more:

…by exposure

durations = boxes %>%
  group_by(exposure) %>%
  filter(!is.na(lastMeasurement)) %>%
  mutate(duration = difftime(lastMeasurement, locationtimestamp, units='days')) %>%
  filter(duration >= 0)

ggplot(durations, aes(x = exposure, y = duration)) +
  geom_boxplot() +
  coord_flip() + ylab('Duration active in Days')

The time of activity averages at only 145 days, though there are boxes with 418 days of activity, spanning a large chunk of openSenseMap’s existence.

…by grouptag

durations = boxes %>%
  filter(!is.na(lastMeasurement)) %>%
  group_by(grouptag) %>%
  # only include grouptags with 20 or more members
  filter(length(grouptag) >= 15 & !is.na(grouptag) & !is.na(lastMeasurement)) %>%
  mutate(duration = difftime(lastMeasurement, locationtimestamp, units='days')) %>%
  filter(duration >= 0)
  
ggplot(durations, aes(x = grouptag, y = duration)) +
  geom_boxplot() +
  coord_flip() + ylab('Duration active in Days')

durations %>%
  summarize(
    duration_avg = round(mean(duration)),
    duration_min = round(min(duration)),
    duration_max = round(max(duration)),
    oldest_box = round(max(difftime(now(), locationtimestamp, units='days')))
  ) %>%
  arrange(desc(duration_avg))

grouptag	duration_avg	duration_min	duration_max	oldest_box
Burgermeetnet	236 days	0 days	404 days	404 days
BurgerMeetnet	169 days	0 days	395 days	395 days
Captographies	122 days	0 days	261 days	263 days
BRGL	107 days	85 days	109 days	109 days
MSGB	87 days	39 days	101 days	101 days
edu	87 days	0 days	408 days	408 days
AGIN	86 days	86 days	86 days	87 days
HTLJ	84 days	58 days	94 days	94 days
MSHO	57 days	36 days	65 days	65 days
HU Explorers	44 days	0 days	319 days	328 days
321heiss	0 days	0 days	0 days	229 days

The time of activity averages at only 70 days, though there are boxes with 408 days of activity, spanning a large chunk of openSenseMap’s existence.

…by year of registration

This is less useful, as older boxes are active for a longer time by definition. If you have an idea how to compensate for that, please send a Pull Request!

# NOTE: boxes older than 2016 missing due to missing updatedAt in database
duration = boxes %>%
  mutate(year = cut(as.Date(locationtimestamp), breaks = 'year')) %>%
  group_by(year) %>%
  filter(!is.na(lastMeasurement)) %>%
  mutate(duration = difftime(lastMeasurement, locationtimestamp, units='days')) %>%
  filter(duration >= 0)

ggplot(duration, aes(x = substr(as.character(year), 0, 4), y = duration)) +
  geom_boxplot() +
  coord_flip() + ylab('Duration active in Days') + xlab('Year of Registration')

More Visualisations

Other visualisations come to mind, and are left as an exercise to the reader. If you implemented some, feel free to add them to this vignette via a Pull Request.

growth by phenomenon
growth by location -> (interactive) map
set inactive rate in relation to total box count
filter timespans with big dips in growth rate, and extrapolate the amount of senseBoxes that could be on the platform today, assuming there were no production issues ;)

Visualising the Development of openSenseMap.org in 2022

Jan Stenkamp

2023-02-23