This vignette serves as an example on data wrangling & visualization with
opensensmapr
,dplyr
andggplot2
.
# required packages:
library(opensensmapr) # data download
library(dplyr) # data wrangling
library(ggplot2) # plotting
library(lubridate) # date arithmetic
library(zoo) # rollmean()
openSenseMap.org has grown quite a bit in the last years; it would be interesting to see how we got to the current 11448 sensor stations, split up by various attributes of the boxes.
While opensensmapr
provides extensive methods of
filtering boxes by attributes on the server, we do the filtering within
R to save time and gain flexibility.
So the first step is to retrieve all the boxes.
# if you want to see results for a specific subset of boxes,
# just specify a filter such as grouptag='ifgi' here
# boxes = osem_boxes(cache = '.')
boxes = readRDS('boxes_precomputed.rds') # read precomputed file to save resources
In the following we just want to have a look at the boxes created in 2022, so we filter for them.
boxes = filter(boxes, locationtimestamp >= "2022-01-01" & locationtimestamp <="2022-12-31")
summary(boxes) -> summary.data.frame
## boxes total: 2107
##
## boxes by exposure:
## indoor mobile outdoor unknown
## 524 200 1382 1
##
## boxes by model:
## custom hackair_home_v2 homeEthernet
## 931 5 5
## homeEthernetFeinstaub homeV2Ethernet homeV2EthernetFeinstaub
## 2 5 5
## homeV2Lora homeV2Wifi homeV2WifiFeinstaub
## 58 220 114
## homeWifi homeWifiFeinstaub luftdaten_pms1003
## 14 17 0
## luftdaten_pms1003_bme280 luftdaten_pms3003 luftdaten_pms3003_bme280
## 1 0 0
## luftdaten_pms5003 luftdaten_pms5003_bme280 luftdaten_pms7003
## 2 8 0
## luftdaten_pms7003_bme280 luftdaten_sds011 luftdaten_sds011_bme280
## 8 28 462
## luftdaten_sds011_bmp180 luftdaten_sds011_dht11 luftdaten_sds011_dht22
## 27 7 188
##
## $last_measurement_within
## 1h 1d 30d 365d never
## 0 0 851 1542 497
##
## oldest box: 2020-02-29 23:00:31 (Kirchardt 1)
## newest box: 2022-12-30 09:19:46 (Balkon)
##
## sensors per box:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 5.000 4.978 6.000 29.000
Another feature of interest is the spatial distribution of the boxes:
plot()
can help us out here. This function requires a bunch
of optional dependencies though.
plot(boxes)
But what do these sensor stations actually measure? Lets find out.
osem_phenomena()
gives us a named list of of the counts of
each observed phenomenon for the given set of sensor stations:
phenoms = osem_phenomena(boxes)
str(phenoms)
## List of 966
## $ Temperatur : int 1520
## $ rel. Luftfeuchte : int 1289
## $ PM10 : int 1069
## $ PM2.5 : int 1068
## $ Luftdruck : int 994
## $ Beleuchtungsstärke : int 282
## $ UV-Intensität : int 282
## $ VOC : int 224
## $ Lufttemperatur : int 208
## $ CO₂ : int 178
## $ Bodenfeuchte : int 170
## $ Temperature : int 167
## $ Luftfeuchte : int 134
## $ Lautstärke : int 131
## $ Humidity : int 123
## $ atm. Luftdruck : int 114
## $ Kalibrierungswert : int 108
## $ CO2eq : int 107
## $ IAQ : int 107
## $ Temperatur SCD30 : int 107
## $ rel. Luftfeuchte SCD30 : int 107
## $ Pressure : int 97
## $ Luftfeuchtigkeit : int 62
## $ PM01 : int 57
## $ Bodentemperatur : int 55
## $ Windgeschwindigkeit : int 46
## $ Feinstaub PM10 : int 41
## $ Feinstaub PM2.5 : int 32
## $ Feinstaub PM1.0 : int 28
## $ Batterie : int 22
## $ Taupunkt : int 19
## $ Windrichtung : int 19
## $ rel. Luftfeuchte (HECA) : int 16
## $ Temperatur (HECA) : int 15
## $ Temperatura : int 15
## $ Durchschnitt Umgebungslautstärke : int 14
## $ Helligkeit : int 14
## $ Latitude : int 14
## $ Longtitude : int 14
## $ Minimum Umgebungslautstärke : int 14
## $ RSSI : int 14
## $ UV-Index : int 11
## $ CO2 : int 10
## $ PM1 : int 10
## $ rel-. Luftfeuchte : int 10
## $ temperature : int 10
## $ Abstand nach links : int 9
## $ Abstand nach rechts : int 9
## $ Beschleunigung X-Achse : int 9
## $ Beschleunigung Y-Achse : int 9
## $ Beschleunigung Z-Achse : int 9
## $ Feinstaub PM25 : int 9
## $ Luftdruck absolut : int 9
## $ Luftdruck relativ : int 9
## $ Regenrate : int 9
## $ Sonnenstrahlung : int 9
## $ gefühlte Temperatur : int 9
## $ Geschwindigkeit : int 8
## $ humidity : int 8
## $ Bodenfeuchtigkeit : int 7
## $ Baterie : int 6
## $ Bodenfeuchte 10cm : int 6
## $ Bodenfeuchte 30cm : int 6
## $ Bodentemperatur 10cm : int 6
## $ Bodentemperatur 30cm : int 6
## $ Lumina : int 6
## $ Taupunktdifferenz (Spread) : int 6
## $ UV : int 6
## $ Umiditate : int 6
## $ Air pressure : int 5
## $ Battery : int 5
## $ PM4 : int 5
## $ Pegel : int 5
## $ Prezenta-foc : int 5
## $ Regenmenge : int 5
## $ W-LAN : int 5
## $ Bodenfeuchte 1 : int 4
## $ Bodenfeuchte 2 : int 4
## $ Bodentemperatur 1 : int 4
## $ Bodentemperatur 2 : int 4
## $ CO2 Konzentration : int 4
## $ Gesamt Gewicht : int 4
## $ PM 2.5 : int 4
## $ Température : int 4
## $ Wilgotność : int 4
## $ Windspeed : int 4
## $ absolute Luftfeuchtigkeit : int 4
## $ Battery Voltage : int 3
## $ Befehl : int 3
## $ Bodenfeuchte 50cm : int 3
## $ Bodentemperatur 60cm : int 3
## $ Ciśnienie : int 3
## $ Feinstaub : int 3
## $ Humedad : int 3
## $ Höhe : int 3
## $ Laufzeit : int 3
## $ LuftfeuchteBME : int 3
## $ Luftfeuchtigkeit_01 : int 3
## $ LufttemperaturBME : int 3
## [list output truncated]
Thats quite some noise there, with many phenomena being measured by a single sensor only, or many duplicated phenomena due to slightly different spellings. We should clean that up, but for now let’s just filter out the noise and find those phenomena with high sensor numbers:
phenoms[phenoms > 50]
## $Temperatur
## [1] 1520
##
## $`rel. Luftfeuchte`
## [1] 1289
##
## $PM10
## [1] 1069
##
## $PM2.5
## [1] 1068
##
## $Luftdruck
## [1] 994
##
## $Beleuchtungsstärke
## [1] 282
##
## $`UV-Intensität`
## [1] 282
##
## $VOC
## [1] 224
##
## $Lufttemperatur
## [1] 208
##
## $`CO₂`
## [1] 178
##
## $Bodenfeuchte
## [1] 170
##
## $Temperature
## [1] 167
##
## $Luftfeuchte
## [1] 134
##
## $Lautstärke
## [1] 131
##
## $Humidity
## [1] 123
##
## $`atm. Luftdruck`
## [1] 114
##
## $Kalibrierungswert
## [1] 108
##
## $CO2eq
## [1] 107
##
## $IAQ
## [1] 107
##
## $`Temperatur SCD30`
## [1] 107
##
## $`rel. Luftfeuchte SCD30`
## [1] 107
##
## $Pressure
## [1] 97
##
## $Luftfeuchtigkeit
## [1] 62
##
## $PM01
## [1] 57
##
## $Bodentemperatur
## [1] 55
By looking at the createdAt
attribute of each box we
know the exact time a box was registered. Because of some database
migration issues the createdAt
values are mostly wrong
(~80% of boxes created 2022-03-30), so we are using the
timestamp
attribute of the currentlocation
which should in most cases correspond to the creation date.
With this approach we have no information about boxes that were deleted in the meantime, but that’s okay for now.
exposure_counts = boxes %>%
group_by(exposure) %>%
mutate(count = row_number(locationtimestamp))
exposure_colors = c(indoor = 'red', outdoor = 'lightgreen', mobile = 'blue', unknown = 'darkgrey')
ggplot(exposure_counts, aes(x = locationtimestamp, y = count, colour = exposure)) +
geom_line() +
scale_colour_manual(values = exposure_colors) +
xlab('Registration Date') + ylab('senseBox count')
Outdoor boxes are growing fast! We can also see the
introduction of mobile
sensor “stations” in 2017.
Let’s have a quick summary:
exposure_counts %>%
summarise(
oldest = min(locationtimestamp),
newest = max(locationtimestamp),
count = max(count)
) %>%
arrange(desc(count))
exposure | oldest | newest | count |
---|---|---|---|
outdoor | 2022-01-01 11:59:16 | 2022-12-30 09:19:46 | 1382 |
indoor | 2022-01-02 11:06:08 | 2022-12-23 20:46:45 | 524 |
mobile | 2022-01-06 13:20:00 | 2022-12-21 21:35:16 | 200 |
unknown | 2022-03-01 07:04:30 | 2022-03-01 07:04:30 | 1 |
We can try to find out where the increases in growth came from, by analysing the box count by grouptag.
Caveats: Only a small subset of boxes has a grouptag, and we should
assume that these groups are actually bigger. Also, we can see that
grouptag naming is inconsistent (Luftdaten
,
luftdaten.info
, …)
grouptag_counts = boxes %>%
group_by(grouptag) %>%
# only include grouptags with 15 or more members
filter(length(grouptag) >= 15 & !is.na(grouptag) & grouptag != '') %>%
mutate(count = row_number(locationtimestamp))
# helper for sorting the grouptags by boxcount
sortLvls = function(oldFactor, ascending = TRUE) {
lvls = table(oldFactor) %>% sort(., decreasing = !ascending) %>% names()
factor(oldFactor, levels = lvls)
}
grouptag_counts$grouptag = sortLvls(grouptag_counts$grouptag, ascending = FALSE)
ggplot(grouptag_counts, aes(x = locationtimestamp, y = count, colour = grouptag)) +
geom_line(aes(group = grouptag)) +
xlab('Registration Date') + ylab('senseBox count')
grouptag_counts %>%
summarise(
oldest = min(locationtimestamp),
newest = max(locationtimestamp),
count = max(count)
) %>%
arrange(desc(count))
grouptag | oldest | newest | count |
---|---|---|---|
edu | 2022-01-02 11:06:08 | 2022-12-18 12:38:27 | 127 |
HU Explorers | 2022-04-01 09:07:41 | 2022-12-14 10:11:34 | 123 |
321heiss | 2022-07-09 01:29:37 | 2022-09-01 06:27:35 | 91 |
Captographies | 2022-06-03 11:25:27 | 2022-11-16 13:26:39 | 57 |
SUGUCS | 2022-11-30 15:25:32 | 2022-12-03 10:11:41 | 23 |
SekSeeland | 2022-03-14 13:17:17 | 2022-03-22 20:23:58 | 19 |
BurgerMeetnet | 2022-01-24 15:33:19 | 2022-05-10 21:22:35 | 16 |
AGIN | 2022-11-28 17:33:12 | 2022-11-28 17:42:18 | 15 |
BRGL | 2022-11-06 19:23:43 | 2022-11-06 22:08:36 | 15 |
BRGW | 2022-11-02 10:28:52 | 2022-11-02 13:32:12 | 15 |
Burgermeetnet | 2022-01-15 20:43:16 | 2022-02-11 17:59:05 | 15 |
HTLJ | 2022-11-21 22:04:17 | 2022-11-21 22:05:47 | 15 |
Mikroprojekt Mitmachklima | 2022-02-09 10:28:40 | 2022-08-23 13:14:11 | 15 |
MSGB | 2022-11-14 09:08:57 | 2022-11-14 10:19:24 | 15 |
MSHO | 2022-12-20 09:28:40 | 2022-12-20 10:01:38 | 15 |
MSIN | 2022-11-21 17:02:39 | 2022-11-21 23:06:22 | 15 |
First we group the boxes by locationtimestamp
into bins
of one week:
bins = 'week'
mvavg_bins = 6
growth = boxes %>%
mutate(week = cut(as.Date(locationtimestamp), breaks = bins)) %>%
group_by(week) %>%
summarize(count = length(week)) %>%
mutate(event = 'registered')
We can do the same for updatedAt
, which informs us about
the last change to a box, including uploaded measurements. As a lot of
boxes were “updated” by the database migration, many of them are updated
at 2022-03-30, so we try to use the lastMeasurement
attribute instead of updatedAt
. This leads to fewer boxes
but also automatically excludes boxes which were created but never made
a measurement.
This method of determining inactive boxes is fairly inaccurate and should be considered an approximation, because we have no information about intermediate inactive phases. Also deleted boxes would probably have a big impact here.
inactive = boxes %>%
# remove boxes that were updated in the last two days,
# b/c any box becomes inactive at some point by definition of updatedAt
filter(lastMeasurement < now() - days(2)) %>%
mutate(week = cut(as.Date(lastMeasurement), breaks = bins)) %>%
filter(as.Date(week) > as.Date("2021-12-31")) %>%
group_by(week) %>%
summarize(count = length(week)) %>%
mutate(event = 'inactive')
Now we can combine both datasets for plotting:
boxes_by_date = bind_rows(growth, inactive) %>% group_by(event)
ggplot(boxes_by_date, aes(x = as.Date(week), colour = event)) +
xlab('Time') + ylab(paste('rate per ', bins)) +
scale_x_date(date_breaks="years", date_labels="%Y") +
scale_colour_manual(values = c(registered = 'lightgreen', inactive = 'grey')) +
geom_point(aes(y = count), size = 0.5) +
# moving average, make first and last value NA (to ensure identical length of vectors)
geom_line(aes(y = rollmean(count, mvavg_bins, fill = list(NA, NULL, NA))))
And see in which weeks the most boxes become (in)active:
boxes_by_date %>%
filter(count > 50) %>%
arrange(desc(count))
week | count | event |
---|---|---|
2023-02-27 | 769 | inactive |
2022-11-21 | 92 | registered |
2022-06-06 | 76 | registered |
2022-08-29 | 76 | registered |
2022-10-31 | 71 | registered |
2022-11-14 | 67 | registered |
2022-11-28 | 65 | registered |
2022-08-22 | 61 | registered |
2022-02-28 | 57 | registered |
2022-08-29 | 55 | inactive |
2022-03-21 | 54 | registered |
2022-12-12 | 54 | registered |
2022-01-24 | 51 | registered |
While we are looking at locationtimestamp
and
lastMeasurement
, we can also extract the duration of
activity of each box, and look at metrics by exposure and grouptag once
more:
durations = boxes %>%
group_by(exposure) %>%
filter(!is.na(lastMeasurement)) %>%
mutate(duration = difftime(lastMeasurement, locationtimestamp, units='days')) %>%
filter(duration >= 0)
ggplot(durations, aes(x = exposure, y = duration)) +
geom_boxplot() +
coord_flip() + ylab('Duration active in Days')
The time of activity averages at only 148 days, though there are boxes with 423 days of activity, spanning a large chunk of openSenseMap’s existence.
durations = boxes %>%
filter(!is.na(lastMeasurement)) %>%
group_by(grouptag) %>%
# only include grouptags with 20 or more members
filter(length(grouptag) >= 15 & !is.na(grouptag) & !is.na(lastMeasurement)) %>%
mutate(duration = difftime(lastMeasurement, locationtimestamp, units='days')) %>%
filter(duration >= 0)
ggplot(durations, aes(x = grouptag, y = duration)) +
geom_boxplot() +
coord_flip() + ylab('Duration active in Days')
durations %>%
summarize(
duration_avg = round(mean(duration)),
duration_min = round(min(duration)),
duration_max = round(max(duration)),
oldest_box = round(max(difftime(now(), locationtimestamp, units='days')))
) %>%
arrange(desc(duration_avg))
grouptag | duration_avg | duration_min | duration_max | oldest_box |
---|---|---|---|---|
Burgermeetnet | 238 days | 0 days | 409 days | 417 days |
BurgerMeetnet | 171 days | 0 days | 400 days | 408 days |
Captographies | 125 days | 0 days | 266 days | 276 days |
BRGL | 113 days | 109 days | 114 days | 122 days |
MSGB | 92 days | 39 days | 106 days | 114 days |
AGIN | 91 days | 87 days | 92 days | 100 days |
HTLJ | 91 days | 58 days | 99 days | 107 days |
edu | 89 days | 0 days | 413 days | 421 days |
MSHO | 61 days | 36 days | 70 days | 78 days |
HU Explorers | 44 days | 0 days | 319 days | 341 days |
321heiss | 0 days | 0 days | 0 days | 243 days |
The time of activity averages at only 72 days, though there are boxes with 413 days of activity, spanning a large chunk of openSenseMap’s existence.
This is less useful, as older boxes are active for a longer time by definition. If you have an idea how to compensate for that, please send a Pull Request!
# NOTE: boxes older than 2016 missing due to missing updatedAt in database
duration = boxes %>%
mutate(year = cut(as.Date(locationtimestamp), breaks = 'year')) %>%
group_by(year) %>%
filter(!is.na(lastMeasurement)) %>%
mutate(duration = difftime(lastMeasurement, locationtimestamp, units='days')) %>%
filter(duration >= 0)
ggplot(duration, aes(x = substr(as.character(year), 0, 4), y = duration)) +
geom_boxplot() +
coord_flip() + ylab('Duration active in Days') + xlab('Year of Registration')
Other visualisations come to mind, and are left as an exercise to the reader. If you implemented some, feel free to add them to this vignette via a Pull Request.