diff --git a/.gitignore b/.gitignore index 4cad8a4..5c9569f 100644 --- a/.gitignore +++ b/.gitignore @@ -7,3 +7,4 @@ *.log opensensmapr_*.tar.gz +inst/doc diff --git a/vignettes/osem-serialization.Rmd b/vignettes/osem-serialization.Rmd new file mode 100644 index 0000000..a56d196 --- /dev/null +++ b/vignettes/osem-serialization.Rmd @@ -0,0 +1,136 @@ +--- +title: "opensensmapr reproducibility: Loading openSenseMap Data from Files" +author: "Norwin Roosen" +date: "`r Sys.Date()`" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{opensensmapr reproducibility: Loading openSenseMap Data from Files} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +It may be useful to download data from openSenseMap only once. +For reproducible results, the data could be saved to disk, and reloaded at a +later point. + +This avoids.. + +- changed results for queries without date parameters, +- unnecessary wait times, +- risk of API changes / API unavailability, +- stress on the openSenseMap-server. + +```{r setup, results='hide'} +# this vignette requires: +library(opensensmapr) +library(jsonlite) +library(readr) + +# first get our example data: +boxes = osem_boxes(grouptag = 'ifgi') +measurements = osem_measurements(boxes, phenomenon = 'PM10') +``` + +## (De-) Serializing Data +The standard way of serialization in R is through the custom binary `.rds` (single object) +or `.RData` (full environment) formats: +```{r serialize_rds} +# serializing measurements to RDS, and loading it from the file again: +saveRDS(measurements, 'measurements.rds') +measurements_from_file = readRDS('measurements.rds') +``` + +Or, if you are paranoid and worry about `.rds` files not being decodable anymore +in the (distant) future, you could serialize to a plain text format such as JSON. +This of course comes at the cost of storage space and performance. +```{r serialize_json} +# serializing senseBoxes to JSON, and loading from file again: +write(jsonlite::serializeJSON(measurements), 'boxes.json') +boxes_from_file = jsonlite::unserializeJSON(readr::read_file('boxes.json')) +``` + +Both methods also persist the R object metadata (classes, attributes). +If you were to use a serialization method that can't persist object metadata, you +could re-apply it with the following functions: + +```{r serialize_attrs} +# note the toJSON call +write(jsonlite::toJSON(measurements), 'boxes_bad.json') +boxes_without_attrs = jsonlite::fromJSON('boxes_bad.json') + +boxes_with_attrs = osem_as_sensebox(boxes_without_attrs) +class(boxes_with_attrs) +``` +The same goes for measurements via `osem_as_measurements()`. + +## Workflow for reproducible code +For truly reproducible code you want it to work and return the same results -- +no matter if you run it the first time or a consecutive time, and without making +changes to it. + +Therefore we need a wrapper around the save-to-file & load-from-file logic. +The following examples show a way to do just that, and where inspired by +[this reproducible analysis by Daniel Nuest](https://github.com/nuest/sensebox-binder). + +```{r osem_offline} +# offline logic +osem_offline = function (func, file, format='rds', ...) { + # deserialize if file exists, otherwise download and serialize + if (file.exists(file)) { + if (format == 'json') + jsonlite::unserializeJSON(readr::read_file(file)) + else + readRDS(file) + } else { + data = func(...) + if (format == 'json') + write(jsonlite::serializeJSON(data), file = file) + else + saveRDS(data, file) + data + } +} + +# wrappers for each download function +osem_measurements_offline = function (file, ...) { + osem_offline(opensensmapr::osem_measurements, file, ...) +} +osem_boxes_offline = function (file, ...) { + osem_offline(opensensmapr::osem_boxes, file, ...) +} +osem_box_offline = function (file, ...) { + osem_offline(opensensmapr::osem_box, file, ...) +} +osem_counts_offline = function (file, ...) { + osem_offline(opensensmapr::osem_counts, file, ...) +} +``` + +Thats it! Now let's try it out: + +```{r test} +# first run; will download and save to disk +b1 = osem_boxes_offline('mobileboxes.rds', exposure='mobile') + +# consecutive runs; will read from disk +b2 = osem_boxes_offline('mobileboxes.rds', exposure='mobile') +class(b1) == class(b2) + +# we can even omit the arguments now (though thats not really the point here) +b3 = osem_boxes_offline('mobileboxes.rds') +nrow(b1) == nrow(b3) + +# verify that the custom sensebox methods are still working +summary(b2) +plot(b3) +``` + +To re-download the data, just clear the files that were created in the process: +```{r cleanup, results='hide'} +file.remove('mobileboxes.rds', 'boxes_bad.json', 'boxes.json', 'measurements.rds') +``` + +A possible extension to this scheme comes to mind: Omit the specification of a +filename, and assign a unique ID to the request instead. +For example, one could calculate the SHA-1 hash of the parameters, and use it +as filename.