add vignette osem-serialization

#13
measurements_archive
noerw 6 years ago
parent 8975cbc664
commit b79f3dff8b

1
.gitignore vendored

@ -7,3 +7,4 @@
*.log
opensensmapr_*.tar.gz
inst/doc

@ -0,0 +1,136 @@
---
title: "opensensmapr reproducibility: Loading openSenseMap Data from Files"
author: "Norwin Roosen"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{opensensmapr reproducibility: Loading openSenseMap Data from Files}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
It may be useful to download data from openSenseMap only once.
For reproducible results, the data could be saved to disk, and reloaded at a
later point.
This avoids..
- changed results for queries without date parameters,
- unnecessary wait times,
- risk of API changes / API unavailability,
- stress on the openSenseMap-server.
```{r setup, results='hide'}
# this vignette requires:
library(opensensmapr)
library(jsonlite)
library(readr)
# first get our example data:
boxes = osem_boxes(grouptag = 'ifgi')
measurements = osem_measurements(boxes, phenomenon = 'PM10')
```
## (De-) Serializing Data
The standard way of serialization in R is through the custom binary `.rds` (single object)
or `.RData` (full environment) formats:
```{r serialize_rds}
# serializing measurements to RDS, and loading it from the file again:
saveRDS(measurements, 'measurements.rds')
measurements_from_file = readRDS('measurements.rds')
```
Or, if you are paranoid and worry about `.rds` files not being decodable anymore
in the (distant) future, you could serialize to a plain text format such as JSON.
This of course comes at the cost of storage space and performance.
```{r serialize_json}
# serializing senseBoxes to JSON, and loading from file again:
write(jsonlite::serializeJSON(measurements), 'boxes.json')
boxes_from_file = jsonlite::unserializeJSON(readr::read_file('boxes.json'))
```
Both methods also persist the R object metadata (classes, attributes).
If you were to use a serialization method that can't persist object metadata, you
could re-apply it with the following functions:
```{r serialize_attrs}
# note the toJSON call
write(jsonlite::toJSON(measurements), 'boxes_bad.json')
boxes_without_attrs = jsonlite::fromJSON('boxes_bad.json')
boxes_with_attrs = osem_as_sensebox(boxes_without_attrs)
class(boxes_with_attrs)
```
The same goes for measurements via `osem_as_measurements()`.
## Workflow for reproducible code
For truly reproducible code you want it to work and return the same results --
no matter if you run it the first time or a consecutive time, and without making
changes to it.
Therefore we need a wrapper around the save-to-file & load-from-file logic.
The following examples show a way to do just that, and where inspired by
[this reproducible analysis by Daniel Nuest](https://github.com/nuest/sensebox-binder).
```{r osem_offline}
# offline logic
osem_offline = function (func, file, format='rds', ...) {
# deserialize if file exists, otherwise download and serialize
if (file.exists(file)) {
if (format == 'json')
jsonlite::unserializeJSON(readr::read_file(file))
else
readRDS(file)
} else {
data = func(...)
if (format == 'json')
write(jsonlite::serializeJSON(data), file = file)
else
saveRDS(data, file)
data
}
}
# wrappers for each download function
osem_measurements_offline = function (file, ...) {
osem_offline(opensensmapr::osem_measurements, file, ...)
}
osem_boxes_offline = function (file, ...) {
osem_offline(opensensmapr::osem_boxes, file, ...)
}
osem_box_offline = function (file, ...) {
osem_offline(opensensmapr::osem_box, file, ...)
}
osem_counts_offline = function (file, ...) {
osem_offline(opensensmapr::osem_counts, file, ...)
}
```
Thats it! Now let's try it out:
```{r test}
# first run; will download and save to disk
b1 = osem_boxes_offline('mobileboxes.rds', exposure='mobile')
# consecutive runs; will read from disk
b2 = osem_boxes_offline('mobileboxes.rds', exposure='mobile')
class(b1) == class(b2)
# we can even omit the arguments now (though thats not really the point here)
b3 = osem_boxes_offline('mobileboxes.rds')
nrow(b1) == nrow(b3)
# verify that the custom sensebox methods are still working
summary(b2)
plot(b3)
```
To re-download the data, just clear the files that were created in the process:
```{r cleanup, results='hide'}
file.remove('mobileboxes.rds', 'boxes_bad.json', 'boxes.json', 'measurements.rds')
```
A possible extension to this scheme comes to mind: Omit the specification of a
filename, and assign a unique ID to the request instead.
For example, one could calculate the SHA-1 hash of the parameters, and use it
as filename.
Loading…
Cancel
Save