mirror of
https://github.com/sensebox/opensensmapr
synced 2025-02-22 06:23:57 +01:00
clean up osem-serialization, #22
This commit is contained in:
parent
54b0994671
commit
97768e7cdb
1 changed files with 22 additions and 89 deletions
|
@ -21,19 +21,24 @@ This avoids..
|
||||||
- stress on the openSenseMap-server.
|
- stress on the openSenseMap-server.
|
||||||
|
|
||||||
This vignette shows how to use this built in `opensensmapr` feature, and
|
This vignette shows how to use this built in `opensensmapr` feature, and
|
||||||
how to do it yourself, if you want to store to other data formats.
|
how to do it yourself, if you want to save to other data formats.
|
||||||
|
|
||||||
## Using openSensMapr Caching Feature
|
```{r setup, results='hide'}
|
||||||
|
# this vignette requires:
|
||||||
|
library(opensensmapr)
|
||||||
|
library(jsonlite)
|
||||||
|
library(readr)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Using the opensensmapr Caching Feature
|
||||||
All data retrieval functions of `opensensmapr` have a built in caching feature,
|
All data retrieval functions of `opensensmapr` have a built in caching feature,
|
||||||
which serializes an API response to disk.
|
which serializes an API response to disk.
|
||||||
Subsequent identical requests will then return the serialized data instead of making
|
Subsequent identical requests will then return the serialized data instead of making
|
||||||
another request.
|
another request.
|
||||||
To do so, each request is given a unique ID based on its parameters.
|
|
||||||
|
|
||||||
To use this feature, just add a path to a directory to the `cache` parameter:
|
To use this feature, just add a path to a directory to the `cache` parameter:
|
||||||
```{r cache}
|
```{r cache}
|
||||||
b = osem_boxes(cache = tempdir())
|
b = osem_boxes(cache = tempdir())
|
||||||
list.files(tempdir(), pattern = 'osemcache\\..*\\.rds')
|
|
||||||
|
|
||||||
# the next identical request will hit the cache only!
|
# the next identical request will hit the cache only!
|
||||||
b = osem_boxes(cache = tempdir())
|
b = osem_boxes(cache = tempdir())
|
||||||
|
@ -42,8 +47,12 @@ b = osem_boxes(cache = tempdir())
|
||||||
b = osem_boxes()
|
b = osem_boxes()
|
||||||
```
|
```
|
||||||
|
|
||||||
You can maintain multiple caches simultaneously which allows to store only
|
Looking at the cache directory we can see one file for each request, which is identified through a hash of the request URL:
|
||||||
serialized data related to a script in its directory:
|
```{r cachelisting}
|
||||||
|
list.files(tempdir(), pattern = 'osemcache\\..*\\.rds')
|
||||||
|
```
|
||||||
|
|
||||||
|
You can maintain multiple caches simultaneously which allows to only store data related to a script in the same directory:
|
||||||
```{r cache_custom}
|
```{r cache_custom}
|
||||||
cacheDir = getwd() # current working directory
|
cacheDir = getwd() # current working directory
|
||||||
b = osem_boxes(cache = cacheDir)
|
b = osem_boxes(cache = cacheDir)
|
||||||
|
@ -62,15 +71,9 @@ osem_clear_cache(getwd()) # clears a custom cache
|
||||||
If you want to roll your own serialization method to support custom data formats,
|
If you want to roll your own serialization method to support custom data formats,
|
||||||
here's how:
|
here's how:
|
||||||
|
|
||||||
```{r setup, results='hide'}
|
```{r data, results='hide'}
|
||||||
# this section requires:
|
|
||||||
library(opensensmapr)
|
|
||||||
library(jsonlite)
|
|
||||||
library(readr)
|
|
||||||
|
|
||||||
# first get our example data:
|
# first get our example data:
|
||||||
boxes = osem_boxes(grouptag = 'ifgi')
|
boxes = osem_boxes(grouptag = 'ifgi')
|
||||||
measurements = osem_measurements(boxes, phenomenon = 'PM10')
|
|
||||||
```
|
```
|
||||||
|
|
||||||
If you are paranoid and worry about `.rds` files not being decodable anymore
|
If you are paranoid and worry about `.rds` files not being decodable anymore
|
||||||
|
@ -78,92 +81,22 @@ in the (distant) future, you could serialize to a plain text format such as JSON
|
||||||
This of course comes at the cost of storage space and performance.
|
This of course comes at the cost of storage space and performance.
|
||||||
```{r serialize_json}
|
```{r serialize_json}
|
||||||
# serializing senseBoxes to JSON, and loading from file again:
|
# serializing senseBoxes to JSON, and loading from file again:
|
||||||
write(jsonlite::serializeJSON(measurements), 'boxes.json')
|
write(jsonlite::serializeJSON(boxes), 'boxes.json')
|
||||||
boxes_from_file = jsonlite::unserializeJSON(readr::read_file('boxes.json'))
|
boxes_from_file = jsonlite::unserializeJSON(readr::read_file('boxes.json'))
|
||||||
|
class(boxes_from_file)
|
||||||
```
|
```
|
||||||
|
|
||||||
Both methods also persist the R object metadata (classes, attributes).
|
This method also persists the R object metadata (classes, attributes).
|
||||||
If you were to use a serialization method that can't persist object metadata, you
|
If you were to use a serialization method that can't persist object metadata, you
|
||||||
could re-apply it with the following functions:
|
could re-apply it with the following functions:
|
||||||
|
|
||||||
```{r serialize_attrs}
|
```{r serialize_attrs}
|
||||||
# note the toJSON call
|
# note the toJSON call instead of serializeJSON
|
||||||
write(jsonlite::toJSON(measurements), 'boxes_bad.json')
|
write(jsonlite::toJSON(boxes), 'boxes_bad.json')
|
||||||
boxes_without_attrs = jsonlite::fromJSON('boxes_bad.json')
|
boxes_without_attrs = jsonlite::fromJSON('boxes_bad.json')
|
||||||
|
class(boxes_without_attrs)
|
||||||
|
|
||||||
boxes_with_attrs = osem_as_sensebox(boxes_without_attrs)
|
boxes_with_attrs = osem_as_sensebox(boxes_without_attrs)
|
||||||
class(boxes_with_attrs)
|
class(boxes_with_attrs)
|
||||||
```
|
```
|
||||||
The same goes for measurements via `osem_as_measurements()`.
|
The same goes for measurements via `osem_as_measurements()`.
|
||||||
|
|
||||||
## Workflow for reproducible code
|
|
||||||
For truly reproducible code you want it to work and return the same results --
|
|
||||||
no matter if you run it the first time or a consecutive time, and without making
|
|
||||||
changes to it.
|
|
||||||
|
|
||||||
Therefore we need a wrapper around the save-to-file & load-from-file logic.
|
|
||||||
The following examples show a way to do just that, and where inspired by
|
|
||||||
[this reproducible analysis by Daniel Nuest](https://github.com/nuest/sensebox-binder).
|
|
||||||
|
|
||||||
```{r osem_offline}
|
|
||||||
# offline logic
|
|
||||||
osem_offline = function (func, file, format='rds', ...) {
|
|
||||||
# deserialize if file exists, otherwise download and serialize
|
|
||||||
if (file.exists(file)) {
|
|
||||||
if (format == 'json')
|
|
||||||
jsonlite::unserializeJSON(readr::read_file(file))
|
|
||||||
else
|
|
||||||
readRDS(file)
|
|
||||||
} else {
|
|
||||||
data = func(...)
|
|
||||||
if (format == 'json')
|
|
||||||
write(jsonlite::serializeJSON(data), file = file)
|
|
||||||
else
|
|
||||||
saveRDS(data, file)
|
|
||||||
data
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
# wrappers for each download function
|
|
||||||
osem_measurements_offline = function (file, ...) {
|
|
||||||
osem_offline(opensensmapr::osem_measurements, file, ...)
|
|
||||||
}
|
|
||||||
osem_boxes_offline = function (file, ...) {
|
|
||||||
osem_offline(opensensmapr::osem_boxes, file, ...)
|
|
||||||
}
|
|
||||||
osem_box_offline = function (file, ...) {
|
|
||||||
osem_offline(opensensmapr::osem_box, file, ...)
|
|
||||||
}
|
|
||||||
osem_counts_offline = function (file, ...) {
|
|
||||||
osem_offline(opensensmapr::osem_counts, file, ...)
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Thats it! Now let's try it out:
|
|
||||||
|
|
||||||
```{r test}
|
|
||||||
# first run; will download and save to disk
|
|
||||||
b1 = osem_boxes_offline('mobileboxes.rds', exposure='mobile')
|
|
||||||
|
|
||||||
# consecutive runs; will read from disk
|
|
||||||
b2 = osem_boxes_offline('mobileboxes.rds', exposure='mobile')
|
|
||||||
class(b1) == class(b2)
|
|
||||||
|
|
||||||
# we can even omit the arguments now (though thats not really the point here)
|
|
||||||
b3 = osem_boxes_offline('mobileboxes.rds')
|
|
||||||
nrow(b1) == nrow(b3)
|
|
||||||
|
|
||||||
# verify that the custom sensebox methods are still working
|
|
||||||
summary(b2)
|
|
||||||
plot(b3)
|
|
||||||
```
|
|
||||||
|
|
||||||
To re-download the data, just clear the files that were created in the process:
|
|
||||||
```{r cleanup, results='hide'}
|
|
||||||
file.remove('mobileboxes.rds', 'boxes_bad.json', 'boxes.json', 'measurements.rds')
|
|
||||||
```
|
|
||||||
|
|
||||||
A possible extension to this scheme comes to mind: Omit the specification of a
|
|
||||||
filename, and assign a unique ID to the request instead.
|
|
||||||
For example, one could calculate the SHA-1 hash of the parameters, and use it
|
|
||||||
as filename.
|
|
||||||
|
|
Loading…
Add table
Reference in a new issue