diff --git a/inst/doc/osem-history_revised.Rmd b/inst/doc/osem-history_revised.Rmd new file mode 100644 index 0000000..1820d6c --- /dev/null +++ b/inst/doc/osem-history_revised.Rmd @@ -0,0 +1,302 @@ +--- +title: "Visualising the Develpment of openSenseMap.org in 2022" +author: "Jan Stenkamp" +date: '`r Sys.Date()`' +output: + html_document: + code_folding: hide + df_print: kable + theme: lumen + toc: yes + toc_float: yes + rmarkdown::html_vignette: + df_print: kable + fig_height: 5 + fig_width: 7 + toc: yes +vignette: > + %\VignetteIndexEntry{Visualising the History of openSenseMap.org} + %\VignetteEncoding{UTF-8} + %\VignetteEngine{knitr::rmarkdown} +--- + +> This vignette serves as an example on data wrangling & visualization with +`opensensmapr`, `dplyr` and `ggplot2`. + +```{r setup, results='hide', message=FALSE, warning=FALSE} +# required packages: +# library(opensensmapr) # data download +library(devtools) +load_all(".") +library(dplyr) # data wrangling +library(ggplot2) # plotting +library(lubridate) # date arithmetic +library(zoo) # rollmean() +``` + +openSenseMap.org has grown quite a bit in the last years; it would be interesting +to see how we got to the current `r osem_counts()$boxes` sensor stations, +split up by various attributes of the boxes. + +While `opensensmapr` provides extensive methods of filtering boxes by attributes +on the server, we do the filtering within R to save time and gain flexibility. + + +So the first step is to retrieve *all the boxes*. + +```{r download, results='hide', message=FALSE, warning=FALSE} +# if you want to see results for a specific subset of boxes, +# just specify a filter such as grouptag='ifgi' here +boxes_all = osem_boxes() +boxes = boxes_all +``` +# Introduction +In the following we just want to have a look at the boxes created in 2022, so we filter for them. + +```{r} +boxes = filter(boxes, locationtimestamp >= "2022-01-01" & locationtimestamp <="2022-12-31") +summary(boxes) -> summary.data.frame +``` + + + + + + + +Another feature of interest is the spatial distribution of the boxes: `plot()` +can help us out here. This function requires a bunch of optional dependencies though. + +```{r message=F, warning=F} +if (!require('maps')) install.packages('maps') +if (!require('maptools')) install.packages('maptools') +if (!require('rgeos')) install.packages('rgeos') + +plot(boxes) +``` + +But what do these sensor stations actually measure? Lets find out. +`osem_phenomena()` gives us a named list of of the counts of each observed +phenomenon for the given set of sensor stations: + +```{r} +phenoms = osem_phenomena(boxes) +str(phenoms) +``` + +Thats quite some noise there, with many phenomena being measured by a single +sensor only, or many duplicated phenomena due to slightly different spellings. +We should clean that up, but for now let's just filter out the noise and find +those phenomena with high sensor numbers: + +```{r} +phenoms[phenoms > 50] +``` + + +# Plot count of boxes by time {.tabset} +By looking at the `createdAt` attribute of each box we know the exact time a box +was registered. Because of some database migration issues the `createdAt` values are mostly wrong (~80% of boxes created 2022-03-30), so we are using the `timestamp` attribute of the `currentlocation` which should in most cases correspond to the creation date. + +With this approach we have no information about boxes that were deleted in the +meantime, but that's okay for now. + +## ...and exposure +```{r exposure_counts, message=FALSE} +exposure_counts = boxes %>% + group_by(exposure) %>% + mutate(count = row_number(locationtimestamp)) + +exposure_colors = c(indoor = 'red', outdoor = 'lightgreen', mobile = 'blue', unknown = 'darkgrey') +ggplot(exposure_counts, aes(x = locationtimestamp, y = count, colour = exposure)) + + geom_line() + + scale_colour_manual(values = exposure_colors) + + xlab('Registration Date') + ylab('senseBox count') +``` + +Outdoor boxes are growing *fast*! +We can also see the introduction of `mobile` sensor "stations" in 2017. + +Let's have a quick summary: +```{r exposure_summary} +exposure_counts %>% + summarise( + oldest = min(locationtimestamp), + newest = max(locationtimestamp), + count = max(count) + ) %>% + arrange(desc(count)) +``` + +## ...and grouptag +We can try to find out where the increases in growth came from, by analysing the +box count by grouptag. + +Caveats: Only a small subset of boxes has a grouptag, and we should assume +that these groups are actually bigger. Also, we can see that grouptag naming is +inconsistent (`Luftdaten`, `luftdaten.info`, ...) + +```{r grouptag_counts, message=FALSE} +grouptag_counts = boxes %>% + group_by(grouptag) %>% + # only include grouptags with 15 or more members + filter(length(grouptag) >= 15 && !is.na(grouptag) && grouptag != '') %>% + mutate(count = row_number(locationtimestamp)) + +# helper for sorting the grouptags by boxcount +sortLvls = function(oldFactor, ascending = TRUE) { + lvls = table(oldFactor) %>% sort(., decreasing = !ascending) %>% names() + factor(oldFactor, levels = lvls) +} +grouptag_counts$grouptag = sortLvls(grouptag_counts$grouptag, ascending = FALSE) + +ggplot(grouptag_counts, aes(x = locationtimestamp, y = count, colour = grouptag)) + + geom_line(aes(group = grouptag)) + + xlab('Registration Date') + ylab('senseBox count') +``` + +```{r grouptag_summary} +grouptag_counts %>% + summarise( + oldest = min(locationtimestamp), + newest = max(locationtimestamp), + count = max(count) + ) %>% + arrange(desc(count)) +``` + +# Plot rate of growth and inactivity per week +First we group the boxes by `locationtimestamp` into bins of one week: +```{r growthrate_registered, warning=FALSE, message=FALSE, results='hide'} +bins = 'week' +mvavg_bins = 6 + +growth = boxes %>% + mutate(week = cut(as.Date(locationtimestamp), breaks = bins)) %>% + group_by(week) %>% + summarize(count = length(week)) %>% + mutate(event = 'registered') +``` + +We can do the same for `updatedAt`, which informs us about the last change to +a box, including uploaded measurements. As a lot of boxes were "updated" by the database +migration, many of them are updated at 2022-03-30, so we try to use the `lastMeasurement` +attribute instead of `updatedAt`. This leads to fewer boxes but also automatically excludes +boxes which were created but never made a measurement. + +This method of determining inactive boxes is fairly inaccurate and should be +considered an approximation, because we have no information about intermediate +inactive phases. +Also deleted boxes would probably have a big impact here. +```{r growthrate_inactive, warning=FALSE, message=FALSE, results='hide'} +inactive = boxes %>% + # remove boxes that were updated in the last two days, + # b/c any box becomes inactive at some point by definition of updatedAt + filter(lastMeasurement < now() - days(2)) %>% + mutate(week = cut(as.Date(lastMeasurement), breaks = bins)) %>% + filter(as.Date(week) > as.Date("2021-12-31")) %>% + group_by(week) %>% + summarize(count = length(week)) %>% + mutate(event = 'inactive') +``` + +Now we can combine both datasets for plotting: +```{r growthrate, warning=FALSE, message=FALSE, results='hide'} +boxes_by_date = bind_rows(growth, inactive) %>% group_by(event) + +ggplot(boxes_by_date, aes(x = as.Date(week), colour = event)) + + xlab('Time') + ylab(paste('rate per ', bins)) + + scale_x_date(date_breaks="years", date_labels="%Y") + + scale_colour_manual(values = c(registered = 'lightgreen', inactive = 'grey')) + + geom_point(aes(y = count), size = 0.5) + + # moving average, make first and last value NA (to ensure identical length of vectors) + geom_line(aes(y = rollmean(count, mvavg_bins, fill = list(NA, NULL, NA)))) +``` + +And see in which weeks the most boxes become (in)active: +```{r table_mostregistrations} +boxes_by_date %>% + filter(count > 50) %>% + arrange(desc(count)) +``` + +# Plot duration of boxes being active {.tabset} +While we are looking at `locationtimestamp` and `lastMeasurement`, we can also extract the duration of activity +of each box, and look at metrics by exposure and grouptag once more: + +## ...by exposure +```{r exposure_duration, message=FALSE} +durations = boxes %>% + group_by(exposure) %>% + filter(!is.na(lastMeasurement)) %>% + mutate(duration = difftime(lastMeasurement, locationtimestamp, units='days')) %>% + filter(duration >= 0) + +ggplot(durations, aes(x = exposure, y = duration)) + + geom_boxplot() + + coord_flip() + ylab('Duration active in Days') +``` + +The time of activity averages at only `r round(mean(durations$duration))` days, +though there are boxes with `r round(max(durations$duration))` days of activity, +spanning a large chunk of openSenseMap's existence. + +## ...by grouptag +```{r grouptag_duration, message=FALSE} +durations = boxes %>% + filter(!is.na(lastMeasurement)) %>% + group_by(grouptag) %>% + # only include grouptags with 20 or more members + filter(length(grouptag) >= 15 & !is.na(grouptag) & !is.na(lastMeasurement)) %>% + mutate(duration = difftime(lastMeasurement, locationtimestamp, units='days')) %>% + filter(duration >= 0) + +ggplot(durations, aes(x = grouptag, y = duration)) + + geom_boxplot() + + coord_flip() + ylab('Duration active in Days') + +durations %>% + summarize( + duration_avg = round(mean(duration)), + duration_min = round(min(duration)), + duration_max = round(max(duration)), + oldest_box = round(max(difftime(now(), locationtimestamp, units='days'))) + ) %>% + arrange(desc(duration_avg)) +``` + +The time of activity averages at only `r round(mean(durations$duration))` days, +though there are boxes with `r round(max(durations$duration))` days of activity, +spanning a large chunk of openSenseMap's existence. + +## ...by year of registration +This is less useful, as older boxes are active for a longer time by definition. +If you have an idea how to compensate for that, please send a [Pull Request][PR]! + +```{r year_duration, message=FALSE} +# NOTE: boxes older than 2016 missing due to missing updatedAt in database +duration = boxes %>% + mutate(year = cut(as.Date(locationtimestamp), breaks = 'year')) %>% + group_by(year) %>% + filter(!is.na(lastMeasurement)) %>% + mutate(duration = difftime(lastMeasurement, locationtimestamp, units='days')) %>% + filter(duration >= 0) + +ggplot(duration, aes(x = substr(as.character(year), 0, 4), y = duration)) + + geom_boxplot() + + coord_flip() + ylab('Duration active in Days') + xlab('Year of Registration') +``` + +# More Visualisations +Other visualisations come to mind, and are left as an exercise to the reader. +If you implemented some, feel free to add them to this vignette via a [Pull Request][PR]. + +* growth by phenomenon +* growth by location -> (interactive) map +* set inactive rate in relation to total box count +* filter timespans with big dips in growth rate, and extrapolate the amount of + senseBoxes that could be on the platform today, assuming there were no production issues ;) + +[PR]: https://github.com/sensebox/opensensmapr/pulls + + diff --git a/inst/doc/osem-history_revised.html b/inst/doc/osem-history_revised.html new file mode 100644 index 0000000..2fd728f --- /dev/null +++ b/inst/doc/osem-history_revised.html @@ -0,0 +1,2493 @@ + + + + +
+ + + + + + + + + + +++This vignette serves as an example on data wrangling & +visualization with
+opensensmapr
,dplyr
and +ggplot2
.
# required packages:
+#library(opensensmapr) # data download
+library(devtools)
+load_all(".")
+library(dplyr) # data wrangling
+library(ggplot2) # plotting
+library(lubridate) # date arithmetic
+library(zoo) # rollmean()
+openSenseMap.org has grown quite a bit in the last years; it would be +interesting to see how we got to the current 11307 sensor stations, +split up by various attributes of the boxes.
+While opensensmapr
provides extensive methods of
+filtering boxes by attributes on the server, we do the filtering within
+R to save time and gain flexibility.
So the first step is to retrieve all the boxes.
+# if you want to see results for a specific subset of boxes,
+# just specify a filter such as grouptag='ifgi' here
+boxes_all = osem_boxes()
+boxes = boxes_all
+In the following we just want to have a look at the boxes created in +2022, so we filter for them.
+boxes = filter(boxes, locationtimestamp >= "2022-01-01" & locationtimestamp <="2022-12-31")
+summary(boxes) -> summary.data.frame
+## boxes total: 2132
+##
+## boxes by exposure:
+## indoor mobile outdoor unknown
+## 532 201 1398 1
+##
+## boxes by model:
+## custom hackair_home_v2 homeEthernet
+## 939 5 5
+## homeEthernetFeinstaub homeV2Ethernet homeV2EthernetFeinstaub
+## 2 5 5
+## homeV2Lora homeV2Wifi homeV2WifiFeinstaub
+## 62 226 116
+## homeWifi homeWifiFeinstaub luftdaten_pms1003
+## 14 17 0
+## luftdaten_pms1003_bme280 luftdaten_pms3003 luftdaten_pms3003_bme280
+## 1 0 0
+## luftdaten_pms5003 luftdaten_pms5003_bme280 luftdaten_pms7003
+## 2 8 0
+## luftdaten_pms7003_bme280 luftdaten_sds011 luftdaten_sds011_bme280
+## 8 29 465
+## luftdaten_sds011_bmp180 luftdaten_sds011_dht11 luftdaten_sds011_dht22
+## 27 7 189
+##
+## $last_measurement_within
+## 1h 1d 30d 365d never
+## 731 765 874 1571 522
+##
+## oldest box: 2020-02-29 23:00:31 (Kirchardt 1)
+## newest box: 2022-12-30 09:19:46 (Balkon)
+##
+## sensors per box:
+## Min. 1st Qu. Median Mean 3rd Qu. Max.
+## 1.000 3.000 5.000 4.959 6.000 29.000
+
+
+
+
+
+Another feature of interest is the spatial distribution of the boxes:
+plot()
can help us out here. This function requires a bunch
+of optional dependencies though.
if (!require('maps')) install.packages('maps')
+if (!require('maptools')) install.packages('maptools')
+if (!require('rgeos')) install.packages('rgeos')
+
+plot(boxes)
+
+But what do these sensor stations actually measure? Lets find out.
+osem_phenomena()
gives us a named list of of the counts of
+each observed phenomenon for the given set of sensor stations:
phenoms = osem_phenomena(boxes)
+str(phenoms)
+## List of 969
+## $ Temperatur : int 1543
+## $ rel. Luftfeuchte : int 1306
+## $ PM10 : int 1080
+## $ PM2.5 : int 1079
+## $ Luftdruck : int 1004
+## $ Beleuchtungsstärke : int 290
+## $ UV-Intensität : int 290
+## $ VOC : int 224
+## $ Lufttemperatur : int 208
+## $ CO₂ : int 179
+## $ Bodenfeuchte : int 173
+## $ Temperature : int 167
+## $ Lautstärke : int 136
+## $ Luftfeuchte : int 134
+## $ Humidity : int 124
+## $ atm. Luftdruck : int 114
+## $ Kalibrierungswert : int 108
+## $ CO2eq : int 107
+## $ IAQ : int 107
+## $ rel. Luftfeuchte SCD30 : int 107
+## $ Temperatur SCD30 : int 107
+## $ Pressure : int 98
+## $ Luftfeuchtigkeit : int 61
+## $ Bodentemperatur : int 58
+## $ PM01 : int 58
+## $ Windgeschwindigkeit : int 46
+## $ Feinstaub PM10 : int 33
+## $ Feinstaub PM2.5 : int 24
+## $ Batterie : int 22
+## $ Feinstaub PM1.0 : int 20
+## $ Taupunkt : int 19
+## $ Windrichtung : int 19
+## $ rel. Luftfeuchte (HECA) : int 16
+## $ Temperatur (HECA) : int 15
+## $ Temperatura : int 15
+## $ Durchschnitt Umgebungslautstärke : int 14
+## $ Helligkeit : int 14
+## $ Latitude : int 14
+## $ Longtitude : int 14
+## $ Minimum Umgebungslautstärke : int 14
+## $ RSSI : int 14
+## $ UV-Index : int 11
+## $ CO2 : int 10
+## $ PM1 : int 10
+## $ rel-. Luftfeuchte : int 10
+## $ temperature : int 10
+## $ Abstand nach links : int 9
+## $ Abstand nach rechts : int 9
+## $ Beschleunigung X-Achse : int 9
+## $ Beschleunigung Y-Achse : int 9
+## $ Beschleunigung Z-Achse : int 9
+## $ Feinstaub PM25 : int 9
+## $ gefühlte Temperatur : int 9
+## $ Luftdruck absolut : int 9
+## $ Luftdruck relativ : int 9
+## $ Regenrate : int 9
+## $ Sonnenstrahlung : int 9
+## $ Geschwindigkeit : int 8
+## $ humidity : int 8
+## $ Bodenfeuchtigkeit : int 7
+## $ Baterie : int 6
+## $ Bodenfeuchte 10cm : int 6
+## $ Bodenfeuchte 30cm : int 6
+## $ Bodentemperatur 10cm : int 6
+## $ Bodentemperatur 30cm : int 6
+## $ Lumina : int 6
+## $ Taupunktdifferenz (Spread) : int 6
+## $ Umiditate : int 6
+## $ UV : int 6
+## $ Air pressure : int 5
+## $ Battery : int 5
+## $ Höhe (barometrisch) : int 5
+## $ Pegel : int 5
+## $ PM4 : int 5
+## $ Prezenta-foc : int 5
+## $ Regenmenge : int 5
+## $ W-LAN : int 5
+## $ absolute Luftfeuchtigkeit : int 4
+## $ Bodenfeuchte 1 : int 4
+## $ Bodenfeuchte 2 : int 4
+## $ Bodentemperatur 1 : int 4
+## $ Bodentemperatur 2 : int 4
+## $ CO2 Konzentration : int 4
+## $ Gesamt Gewicht : int 4
+## $ PM 2.5 : int 4
+## $ Température : int 4
+## $ Wilgotność : int 4
+## $ Windspeed : int 4
+## $ air pressure : int 3
+## $ Battery Voltage : int 3
+## $ Befehl : int 3
+## $ Bodenfeuchte 50cm : int 3
+## $ Bodentemperatur 60cm : int 3
+## $ Ciśnienie : int 3
+## $ Feinstaub : int 3
+## $ Höhe : int 3
+## $ Humedad : int 3
+## $ Laufzeit : int 3
+## $ LuftfeuchteBME : int 3
+## [list output truncated]
+Thats quite some noise there, with many phenomena being measured by a +single sensor only, or many duplicated phenomena due to slightly +different spellings. We should clean that up, but for now let’s just +filter out the noise and find those phenomena with high sensor +numbers:
+phenoms[phenoms > 50]
+## $Temperatur
+## [1] 1543
+##
+## $`rel. Luftfeuchte`
+## [1] 1306
+##
+## $PM10
+## [1] 1080
+##
+## $PM2.5
+## [1] 1079
+##
+## $Luftdruck
+## [1] 1004
+##
+## $Beleuchtungsstärke
+## [1] 290
+##
+## $`UV-Intensität`
+## [1] 290
+##
+## $VOC
+## [1] 224
+##
+## $Lufttemperatur
+## [1] 208
+##
+## $`CO₂`
+## [1] 179
+##
+## $Bodenfeuchte
+## [1] 173
+##
+## $Temperature
+## [1] 167
+##
+## $Lautstärke
+## [1] 136
+##
+## $Luftfeuchte
+## [1] 134
+##
+## $Humidity
+## [1] 124
+##
+## $`atm. Luftdruck`
+## [1] 114
+##
+## $Kalibrierungswert
+## [1] 108
+##
+## $CO2eq
+## [1] 107
+##
+## $IAQ
+## [1] 107
+##
+## $`rel. Luftfeuchte SCD30`
+## [1] 107
+##
+## $`Temperatur SCD30`
+## [1] 107
+##
+## $Pressure
+## [1] 98
+##
+## $Luftfeuchtigkeit
+## [1] 61
+##
+## $Bodentemperatur
+## [1] 58
+##
+## $PM01
+## [1] 58
+By looking at the createdAt
attribute of each box we
+know the exact time a box was registered. Because of some database
+migration issues the createdAt
values are mostly wrong
+(~80% of boxes created 2022-03-30), so we are using the
+timestamp
attribute of the currentlocation
+which should in most cases correspond to the creation date.
With this approach we have no information about boxes that were +deleted in the meantime, but that’s okay for now.
+exposure_counts = boxes %>%
+ group_by(exposure) %>%
+ mutate(count = row_number(locationtimestamp))
+
+exposure_colors = c(indoor = 'red', outdoor = 'lightgreen', mobile = 'blue', unknown = 'darkgrey')
+ggplot(exposure_counts, aes(x = locationtimestamp, y = count, colour = exposure)) +
+ geom_line() +
+ scale_colour_manual(values = exposure_colors) +
+ xlab('Registration Date') + ylab('senseBox count')
+
+Outdoor boxes are growing fast! We can also see the
+introduction of mobile
sensor “stations” in 2017.
Let’s have a quick summary:
+exposure_counts %>%
+ summarise(
+ oldest = min(locationtimestamp),
+ newest = max(locationtimestamp),
+ count = max(count)
+ ) %>%
+ arrange(desc(count))
+exposure | +oldest | +newest | +count | +
---|---|---|---|
outdoor | +2022-01-01 11:59:16 | +2022-12-30 09:19:46 | +1398 | +
indoor | +2022-01-02 11:06:08 | +2022-12-23 20:46:45 | +532 | +
mobile | +2022-01-06 13:20:00 | +2022-12-21 21:35:16 | +201 | +
unknown | +2022-03-01 07:04:30 | +2022-03-01 07:04:30 | +1 | +
We can try to find out where the increases in growth came from, by +analysing the box count by grouptag.
+Caveats: Only a small subset of boxes has a grouptag, and we should
+assume that these groups are actually bigger. Also, we can see that
+grouptag naming is inconsistent (Luftdaten
,
+luftdaten.info
, …)
grouptag_counts = boxes %>%
+ group_by(grouptag) %>%
+ # only include grouptags with 20 or more members
+ filter(length(grouptag) >= 15 && !is.na(grouptag) && grouptag != '') %>%
+ mutate(count = row_number(locationtimestamp))
+## Warning: There were 33 warnings in `filter()`.
+## The first warning was:
+## ℹ In argument: `length(grouptag) >= 15 && !is.na(grouptag) && grouptag != ""`.
+## ℹ In group 11: `grouptag = "321heiss"`.
+## Caused by warning in `length(grouptag) >= 15 && !is.na(grouptag)`:
+## ! 'length(x) = 91 > 1' in coercion to 'logical(1)'
+## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 32 remaining warnings.
+# helper for sorting the grouptags by boxcount
+sortLvls = function(oldFactor, ascending = TRUE) {
+ lvls = table(oldFactor) %>% sort(., decreasing = !ascending) %>% names()
+ factor(oldFactor, levels = lvls)
+}
+grouptag_counts$grouptag = sortLvls(grouptag_counts$grouptag, ascending = FALSE)
+
+ggplot(grouptag_counts, aes(x = locationtimestamp, y = count, colour = grouptag)) +
+ geom_line(aes(group = grouptag)) +
+ xlab('Registration Date') + ylab('senseBox count')
+
+grouptag_counts %>%
+ summarise(
+ oldest = min(locationtimestamp),
+ newest = max(locationtimestamp),
+ count = max(count)
+ ) %>%
+ arrange(desc(count))
+grouptag | +oldest | +newest | +count | +
---|---|---|---|
edu | +2022-01-02 11:06:08 | +2022-12-18 12:38:27 | +130 | +
HU Explorers | +2022-04-01 09:07:41 | +2022-12-14 10:11:34 | +128 | +
321heiss | +2022-07-09 01:29:37 | +2022-09-01 06:27:35 | +91 | +
Captographies | +2022-06-03 11:25:27 | +2022-11-16 13:26:39 | +58 | +
SUGUCS | +2022-11-30 15:25:32 | +2022-12-03 10:11:41 | +23 | +
SekSeeland | +2022-03-14 13:17:17 | +2022-03-22 20:23:58 | +19 | +
BurgerMeetnet | +2022-01-24 15:33:19 | +2022-05-10 21:22:35 | +16 | +
AGIN | +2022-11-28 17:33:12 | +2022-11-28 17:42:18 | +15 | +
BRGL | +2022-11-06 19:23:43 | +2022-11-06 22:08:36 | +15 | +
BRGW | +2022-11-02 10:28:52 | +2022-11-02 13:32:12 | +15 | +
Burgermeetnet | +2022-01-15 20:43:16 | +2022-02-11 17:59:05 | +15 | +
HTLJ | +2022-11-21 22:04:17 | +2022-11-21 22:05:47 | +15 | +
Mikroprojekt Mitmachklima | +2022-02-09 10:28:40 | +2022-08-23 13:14:11 | +15 | +
MSGB | +2022-11-14 09:08:57 | +2022-11-14 10:19:24 | +15 | +
MSHO | +2022-12-20 09:28:40 | +2022-12-20 10:01:38 | +15 | +
MSIN | +2022-11-21 17:02:39 | +2022-11-21 23:06:22 | +15 | +
First we group the boxes by locationtimestamp
into bins
+of one week:
bins = 'week'
+mvavg_bins = 6
+
+growth = boxes %>%
+ mutate(week = cut(as.Date(locationtimestamp), breaks = bins)) %>%
+ group_by(week) %>%
+ summarize(count = length(week)) %>%
+ mutate(event = 'registered')
+We can do the same for updatedAt
, which informs us about
+the last change to a box, including uploaded measurements. As a lot of
+boxes were “updated” by the database migration, many of them are updated
+at 2022-03-30, so we try to use the lastMeasurement
+attribute instead of updatedAt
. This leads to fewer boxes
+but also automatically excludes boxes which were created but never made
+a measurement.
This method of determining inactive boxes is fairly inaccurate and +should be considered an approximation, because we have no information +about intermediate inactive phases. Also deleted boxes would probably +have a big impact here.
+inactive = boxes %>%
+ # remove boxes that were updated in the last two days,
+ # b/c any box becomes inactive at some point by definition of updatedAt
+ filter(lastMeasurement < now() - days(2)) %>%
+ mutate(week = cut(as.Date(lastMeasurement), breaks = bins)) %>%
+ filter(as.Date(week) > as.Date("2021-12-31")) %>%
+ group_by(week) %>%
+ summarize(count = length(week)) %>%
+ mutate(event = 'inactive')
+Now we can combine both datasets for plotting:
+boxes_by_date = bind_rows(growth, inactive) %>% group_by(event)
+
+ggplot(boxes_by_date, aes(x = as.Date(week), colour = event)) +
+ xlab('Time') + ylab(paste('rate per ', bins)) +
+ scale_x_date(date_breaks="years", date_labels="%Y") +
+ scale_colour_manual(values = c(registered = 'lightgreen', inactive = 'grey')) +
+ geom_point(aes(y = count), size = 0.5) +
+ # moving average, make first and last value NA (to ensure identical length of vectors)
+ geom_line(aes(y = rollmean(count, mvavg_bins, fill = list(NA, NULL, NA))))
+
+And see in which weeks the most boxes become (in)active:
+boxes_by_date %>%
+ filter(count > 50) %>%
+ arrange(desc(count))
+week | +count | +event | +
---|---|---|
2022-11-21 | +93 | +registered | +
2022-06-06 | +77 | +registered | +
2022-08-29 | +76 | +registered | +
2022-10-31 | +72 | +registered | +
2022-11-14 | +68 | +registered | +
2022-11-28 | +66 | +registered | +
2022-08-22 | +61 | +registered | +
2022-02-28 | +57 | +registered | +
2022-12-12 | +56 | +registered | +
2022-08-29 | +56 | +inactive | +
2022-03-21 | +54 | +registered | +
2022-01-24 | +51 | +registered | +
2022-03-07 | +51 | +registered | +
While we are looking at locationtimestamp
and
+lastMeasurement
, we can also extract the duration of
+activity of each box, and look at metrics by exposure and grouptag once
+more:
durations = boxes %>%
+ group_by(exposure) %>%
+ filter(!is.na(lastMeasurement)) %>%
+ mutate(duration = difftime(lastMeasurement, locationtimestamp, units='days')) %>%
+ filter(duration >= 0)
+
+ggplot(durations, aes(x = exposure, y = duration)) +
+ geom_boxplot() +
+ coord_flip() + ylab('Duration active in Days')
+
+The time of activity averages at only 130 days, though there are +boxes with 395 days of activity, spanning a large chunk of +openSenseMap’s existence.
+durations = boxes %>%
+ filter(!is.na(lastMeasurement)) %>%
+ group_by(grouptag) %>%
+ # only include grouptags with 20 or more members
+ filter(length(grouptag) >= 15 && !is.na(grouptag) && !is.na(lastMeasurement)) %>%
+ mutate(duration = difftime(lastMeasurement, locationtimestamp, units='days')) %>%
+ filter(duration >= 0)
+## Warning: There were 21 warnings in `filter()`.
+## The first warning was:
+## ℹ In argument: `length(grouptag) >= 15 && !is.na(grouptag) && ...`.
+## ℹ In group 11: `grouptag = "321heiss"`.
+## Caused by warning in `length(grouptag) >= 15 && !is.na(grouptag)`:
+## ! 'length(x) = 81 > 1' in coercion to 'logical(1)'
+## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 20 remaining warnings.
+ggplot(durations, aes(x = grouptag, y = duration)) +
+ geom_boxplot() +
+ coord_flip() + ylab('Duration active in Days')
+
+durations %>%
+ summarize(
+ duration_avg = round(mean(duration)),
+ duration_min = round(min(duration)),
+ duration_max = round(max(duration)),
+ oldest_box = round(max(difftime(now(), locationtimestamp, units='days')))
+ ) %>%
+ arrange(desc(duration_avg))
+grouptag | +duration_avg | +duration_min | +duration_max | +oldest_box | +
---|---|---|---|---|
Burgermeetnet | +225 days | +0 days | +381 days | +381 days | +
BurgerMeetnet | +160 days | +0 days | +372 days | +372 days | +
Captographies | +109 days | +0 days | +238 days | +240 days | +
BRGL | +85 days | +81 days | +86 days | +86 days | +
edu | +83 days | +0 days | +385 days | +389 days | +
MSGB | +72 days | +39 days | +78 days | +78 days | +
HTLJ | +65 days | +30 days | +71 days | +71 days | +
MSHO | +41 days | +36 days | +42 days | +42 days | +
HU Explorers | +28 days | +0 days | +189 days | +305 days | +
321heiss | +0 days | +0 days | +0 days | +207 days | +
The time of activity averages at only 61 days, though there are boxes +with 385 days of activity, spanning a large chunk of openSenseMap’s +existence.
+This is less useful, as older boxes are active for a longer time by +definition. If you have an idea how to compensate for that, please send +a Pull +Request!
+# NOTE: boxes older than 2016 missing due to missing updatedAt in database
+duration = boxes %>%
+ mutate(year = cut(as.Date(locationtimestamp), breaks = 'year')) %>%
+ group_by(year) %>%
+ filter(!is.na(lastMeasurement)) %>%
+ mutate(duration = difftime(lastMeasurement, locationtimestamp, units='days')) %>%
+ filter(duration >= 0)
+
+ggplot(duration, aes(x = substr(as.character(year), 0, 4), y = duration)) +
+ geom_boxplot() +
+ coord_flip() + ylab('Duration active in Days') + xlab('Year of Registration')
+
+Other visualisations come to mind, and are left as an exercise to the +reader. If you implemented some, feel free to add them to this vignette +via a Pull +Request.
+