clean up osem-serialization, #22

2025-07-19 02:40:14 +02:00 · 2018-06-05 20:22:22 +02:00 · 2018-06-05 20:22:22 +02:00 · 97768e7cdb
commit 97768e7cdb
parent 54b0994671
1 changed files with 22 additions and 89 deletions
--- a/vignettes/osem-serialization.Rmd
+++ b/vignettes/osem-serialization.Rmd
@ -21,19 +21,24 @@ This avoids..
 - stress on the openSenseMap-server.
 This vignette shows how to use this built in `opensensmapr` feature, and
-how to do it yourself, if you want to store to other data formats.
+how to do it yourself, if you want to save to other data formats.
-## Using openSensMapr Caching Feature
+```{r setup, results='hide'}
 # this vignette requires:
 library(opensensmapr)
 library(jsonlite)
 library(readr)
 ```
 ## Using the opensensmapr Caching Feature
 All data retrieval functions of `opensensmapr` have a built in caching feature,
 which serializes an API response to disk.
 Subsequent identical requests will then return the serialized data instead of making
 another request.
 To do so, each request is given a unique ID based on its parameters.
 To use this feature, just add a path to a directory to the `cache` parameter:
 ```{r cache}
 b = osem_boxes(cache = tempdir())
 list.files(tempdir(), pattern = 'osemcache\\..*\\.rds')
 # the next identical request will hit the cache only!
 b = osem_boxes(cache = tempdir())
@ -42,8 +47,12 @@ b = osem_boxes(cache = tempdir())
 b = osem_boxes()
 ```
-You can maintain multiple caches simultaneously which allows to store only
+Looking at the cache directory we can see one file for each request, which is identified through a hash of the request URL:
-serialized data related to a script in its directory:
+```{r cachelisting}
 list.files(tempdir(), pattern = 'osemcache\\..*\\.rds')
 ```
 You can maintain multiple caches simultaneously which allows to only store data related to a script in the same directory:
 ```{r cache_custom}
 cacheDir = getwd() # current working directory
 b = osem_boxes(cache = cacheDir)
@ -62,15 +71,9 @@ osem_clear_cache(getwd()) # clears a custom cache
 If you want to roll your own serialization method to support custom data formats,
 here's how:
-```{r setup, results='hide'}
+```{r data, results='hide'}
 # this section requires:
 library(opensensmapr)
 library(jsonlite)
 library(readr)
 # first get our example data:
 boxes = osem_boxes(grouptag = 'ifgi')
 measurements = osem_measurements(boxes, phenomenon = 'PM10')
 ```
 If you are paranoid and worry about `.rds` files not being decodable anymore
@ -78,92 +81,22 @@ in the (distant) future, you could serialize to a plain text format such as JSON
 This of course comes at the cost of storage space and performance.
 ```{r serialize_json}
 # serializing senseBoxes to JSON, and loading from file again:
-write(jsonlite::serializeJSON(measurements), 'boxes.json')
+write(jsonlite::serializeJSON(boxes), 'boxes.json')
 boxes_from_file = jsonlite::unserializeJSON(readr::read_file('boxes.json'))
 class(boxes_from_file)
 ```
-Both methods also persist the R object metadata (classes, attributes).
+This method also persists the R object metadata (classes, attributes).
 If you were to use a serialization method that can't persist object metadata, you
 could re-apply it with the following functions:
 ```{r serialize_attrs}
-# note the toJSON call
+# note the toJSON call instead of serializeJSON
-write(jsonlite::toJSON(measurements), 'boxes_bad.json')
+write(jsonlite::toJSON(boxes), 'boxes_bad.json')
 boxes_without_attrs = jsonlite::fromJSON('boxes_bad.json')
 class(boxes_without_attrs)
 boxes_with_attrs = osem_as_sensebox(boxes_without_attrs)
 class(boxes_with_attrs)
 ```
 The same goes for measurements via `osem_as_measurements()`.
 ## Workflow for reproducible code
 For truly reproducible code you want it to work and return the same results --
 no matter if you run it the first time or a consecutive time, and without making
 changes to it.
 Therefore we need a wrapper around the save-to-file & load-from-file logic.
 The following examples show a way to do just that, and where inspired by
 [this reproducible analysis by Daniel Nuest](https://github.com/nuest/sensebox-binder).
 ```{r osem_offline}
 # offline logic
 osem_offline = function (func, file, format='rds', ...) {
  # deserialize if file exists, otherwise download and serialize
  if (file.exists(file)) {
    if (format == 'json')
      jsonlite::unserializeJSON(readr::read_file(file))
    else
      readRDS(file)
  } else {
    data = func(...)
    if (format == 'json')
      write(jsonlite::serializeJSON(data), file = file)
    else
      saveRDS(data, file)
    data
  }
 }
 # wrappers for each download function
 osem_measurements_offline = function (file, ...) {
  osem_offline(opensensmapr::osem_measurements, file, ...)
 }
 osem_boxes_offline = function (file, ...) {
  osem_offline(opensensmapr::osem_boxes, file, ...)
 }
 osem_box_offline = function (file, ...) {
  osem_offline(opensensmapr::osem_box, file, ...)
 }
 osem_counts_offline = function (file, ...) {
  osem_offline(opensensmapr::osem_counts, file, ...)
 }
 ```
 Thats it! Now let's try it out:
 ```{r test}
 # first run; will download and save to disk
 b1 = osem_boxes_offline('mobileboxes.rds', exposure='mobile')
 # consecutive runs; will read from disk
 b2 = osem_boxes_offline('mobileboxes.rds', exposure='mobile')
 class(b1) == class(b2)
 # we can even omit the arguments now (though thats not really the point here)
 b3 = osem_boxes_offline('mobileboxes.rds')
 nrow(b1) == nrow(b3)
 # verify that the custom sensebox methods are still working
 summary(b2)
 plot(b3)
 ```
 To re-download the data, just clear the files that were created in the process:
 ```{r cleanup, results='hide'}
 file.remove('mobileboxes.rds', 'boxes_bad.json', 'boxes.json', 'measurements.rds')
 ```
 A possible extension to this scheme comes to mind: Omit the specification of a
 filename, and assign a unique ID to the request instead.
 For example, one could calculate the SHA-1 hash of the parameters, and use it
 as filename.