opensensmapR/vignettes/osem-intro.Rmd

---
title: "Exploring the openSenseMap Dataset"
author: "Norwin Roosen"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    fig_margin: 0
    fig_width: 6
    fig_height: 4
vignette: >
  %\VignetteIndexEntry{Exploring the openSenseMap Dataset}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

This package provides data ingestion functions for almost any data stored on the
open data platform for environmental sensordata <https://opensensemap.org>.
Its main goals are to provide means for:

- big data analysis of the measurements stored on the platform
- sensor metadata analysis (sensor counts, spatial distribution, temporal trends)

### Exploring the dataset
Before we look at actual observations, lets get a grasp of the openSenseMap
datasets' structure.

```{r results = FALSE}
library(magrittr)
library(opensensmapr)

# all_sensors = osem_boxes(cache = '.')
all_sensors = readRDS('boxes_precomputed.rds')  # read precomputed file to save resources 
```
```{r}
summary(all_sensors)
```

This gives a good overview already: As of writing this, there are more than 700
sensor stations, of which ~50% are currently running. Most of them are placed
outdoors and have around 5 sensors each.
The oldest station is from May 2014, while the latest station was registered a
couple of minutes ago.

Another feature of interest is the spatial distribution of the boxes: `plot()`
can help us out here. This function requires a bunch of optional dependencies though.

```{r, message=FALSE, warning=FALSE}
plot(all_sensors)
```

It seems we have to reduce our area of interest to Germany.

But what do these sensor stations actually measure? Lets find out.
`osem_phenomena()` gives us a named list of of the counts of each observed
phenomenon for the given set of sensor stations:

```{r}
phenoms = osem_phenomena(all_sensors)
str(phenoms)
```

Thats quite some noise there, with many phenomena being measured by a single
sensor only, or many duplicated phenomena due to slightly different spellings.
We should clean that up, but for now let's just filter out the noise and find
those phenomena with high sensor numbers:

```{r}
phenoms[phenoms > 20]
```

Alright, temperature it is! Fine particulate matter (PM2.5) seems to be more
interesting to analyze though. 
We should check how many sensor stations provide useful data: We want only those
boxes with a PM2.5 sensor, that are placed outdoors and are currently submitting
measurements:

```{r results = FALSE, eval=FALSE}
pm25_sensors = osem_boxes(
  exposure = 'outdoor',
  date = Sys.time(), # ±4 hours
  phenomenon = 'PM2.5'
)
```
```{r}
pm25_sensors = readRDS('pm25_sensors.rds') # read precomputed file to save resources 

summary(pm25_sensors)
plot(pm25_sensors)
```

Thats still more than 200 measuring stations, we can work with that.

### Analyzing sensor data
Having analyzed the available data sources, let's finally get some measurements.
We could call `osem_measurements(pm25_sensors)` now, however we are focusing on
a restricted area of interest, the city of Berlin.
Luckily we can get the measurements filtered by a bounding box:

```{r, results=FALSE, message=FALSE}
library(sf)
library(units)
library(lubridate)
library(dplyr)

```

Since the API takes quite long to response measurements, especially filtered on space and time, we do not run the following chunks for publication of the package on CRAN.
```{r bbox, results = FALSE, eval=FALSE}
# construct a bounding box: 12 kilometers around Berlin
berlin = st_point(c(13.4034, 52.5120)) %>%
  st_sfc(crs = 4326) %>%
  st_transform(3857) %>% # allow setting a buffer in meters
  st_buffer(set_units(12, km)) %>%
  st_transform(4326) %>% # the opensensemap expects WGS 84
  st_bbox()
pm25 = osem_measurements(
  berlin,
  phenomenon = 'PM2.5',
  from = now() - days(3), # defaults to 2 days
  to = now()
)

```

```{r}
pm25 = readRDS('pm25_berlin.rds') # read precomputed file to save resources 
plot(pm25)
```

Now we can get started with actual spatiotemporal data analysis.
First, lets mask the seemingly uncalibrated sensors:

```{r, warning=FALSE}
outliers = filter(pm25, value > 100)$sensorId
bad_sensors = outliers[, drop = TRUE] %>% levels()

pm25 = mutate(pm25, invalid = sensorId %in% bad_sensors)
```

Then plot the measuring locations, flagging the outliers:

```{r}
st_as_sf(pm25) %>% st_geometry() %>% plot(col = factor(pm25$invalid), axes = TRUE)
```

Removing these sensors yields a nicer time series plot:

```{r}
pm25 %>% filter(invalid == FALSE) %>% plot()
```

Further analysis: comparison with LANUV data `TODO`
add vignette 7 years ago			`---`
add/update vignette builds 7 years ago			`title: "Exploring the openSenseMap Dataset"`
add vignette 7 years ago			`author: "Norwin Roosen"`
			date: "`r Sys.Date()`"
add world map to plot.sensebox() (#1) 7 years ago			`output:`
			`rmarkdown::html_vignette:`
			`fig_margin: 0`
			`fig_width: 6`
			`fig_height: 4`
add vignette 7 years ago			`vignette: >`
add/update vignette builds 7 years ago			`%\VignetteIndexEntry{Exploring the openSenseMap Dataset}`
add vignette 7 years ago			`%\VignetteEngine{knitr::rmarkdown}`
			`%\VignetteEncoding{UTF-8}`
			`---`

			```{r setup, include=FALSE}
			`knitr::opts_chunk$set(echo = TRUE)`
			```

			`This package provides data ingestion functions for almost any data stored on the`
spellcheck 2 years ago			`open data platform for environmental sensordata <https://opensensemap.org>.`
add vignette 7 years ago			`Its main goals are to provide means for:`

			`- big data analysis of the measurements stored on the platform`
			`- sensor metadata analysis (sensor counts, spatial distribution, temporal trends)`

			`### Exploring the dataset`
			`Before we look at actual observations, lets get a grasp of the openSenseMap`
			`datasets' structure.`

changed T and F to TRUE and FALSE 2 years ago			```{r results = FALSE}
add vignette 7 years ago			`library(magrittr)`
			`library(opensensmapr)`

using precomputed data for vignettes 2 years ago			`# all_sensors = osem_boxes(cache = '.')`
			`all_sensors = readRDS('boxes_precomputed.rds') # read precomputed file to save resources`
remove progress output from vignette 7 years ago			```
			```{r}
add vignette 7 years ago			`summary(all_sensors)`
			```

remove note about #3, fix typos 7 years ago			`This gives a good overview already: As of writing this, there are more than 700`
lint 7 years ago			`sensor stations, of which ~50% are currently running. Most of them are placed`
			`outdoors and have around 5 sensors each.`
add vignette 7 years ago			`The oldest station is from May 2014, while the latest station was registered a`
			`couple of minutes ago.`

remove note about #3, fix typos 7 years ago			Another feature of interest is the spatial distribution of the boxes: `plot()`
			`can help us out here. This function requires a bunch of optional dependencies though.`
add world map to plot.sensebox() (#1) 7 years ago
changed T and F to TRUE and FALSE 2 years ago			```{r, message=FALSE, warning=FALSE}
add vignette 7 years ago			`plot(all_sensors)`
			```

lint 7 years ago			`It seems we have to reduce our area of interest to Germany.`
add vignette 7 years ago
			`But what do these sensor stations actually measure? Lets find out.`
			`osem_phenomena()` gives us a named list of of the counts of each observed
			`phenomenon for the given set of sensor stations:`

			```{r}
			`phenoms = osem_phenomena(all_sensors)`
			`str(phenoms)`
			```

			`Thats quite some noise there, with many phenomena being measured by a single`
			`sensor only, or many duplicated phenomena due to slightly different spellings.`
			`We should clean that up, but for now let's just filter out the noise and find`
lint 7 years ago			`those phenomena with high sensor numbers:`
add vignette 7 years ago
			```{r}
			`phenoms[phenoms > 20]`
			```

lint 7 years ago			`Alright, temperature it is! Fine particulate matter (PM2.5) seems to be more`
			`interesting to analyze though.`
add vignette 7 years ago			`We should check how many sensor stations provide useful data: We want only those`
			`boxes with a PM2.5 sensor, that are placed outdoors and are currently submitting`
			`measurements:`

changed T and F to TRUE and FALSE 2 years ago			```{r results = FALSE, eval=FALSE}
add vignette 7 years ago			`pm25_sensors = osem_boxes(`
			`exposure = 'outdoor',`
			`date = Sys.time(), # ±4 hours`
			`phenomenon = 'PM2.5'`
			`)`
remove progress output from vignette 7 years ago			```
			```{r}
using precomputed data for vignettes 2 years ago			`pm25_sensors = readRDS('pm25_sensors.rds') # read precomputed file to save resources`

add vignette 7 years ago			`summary(pm25_sensors)`
			`plot(pm25_sensors)`
			```

			`Thats still more than 200 measuring stations, we can work with that.`

			`### Analyzing sensor data`
			`Having analyzed the available data sources, let's finally get some measurements.`
build performance improvements 6 years ago			We could call `osem_measurements(pm25_sensors)` now, however we are focusing on
lint 7 years ago			`a restricted area of interest, the city of Berlin.`
			`Luckily we can get the measurements filtered by a bounding box:`
add vignette 7 years ago
changed T and F to TRUE and FALSE 2 years ago			```{r, results=FALSE, message=FALSE}
add vignette 7 years ago			`library(sf)`
lint 7 years ago			`library(units)`
add vignette 7 years ago			`library(lubridate)`
add outlier filtering in vignette 7 years ago			`library(dplyr)`
add vignette 7 years ago
using precomputed data for vignettes 2 years ago			```

			`Since the API takes quite long to response measurements, especially filtered on space and time, we do not run the following chunks for publication of the package on CRAN.`
changed T and F to TRUE and FALSE 2 years ago			```{r bbox, results = FALSE, eval=FALSE}
add vignette 7 years ago			`# construct a bounding box: 12 kilometers around Berlin`
			`berlin = st_point(c(13.4034, 52.5120)) %>%`
			`st_sfc(crs = 4326) %>%`
			`st_transform(3857) %>% # allow setting a buffer in meters`
remove note about #3, fix typos 7 years ago			`st_buffer(set_units(12, km)) %>%`
add vignette 7 years ago			`st_transform(4326) %>% # the opensensemap expects WGS 84`
			`st_bbox()`
			`pm25 = osem_measurements(`
			`berlin,`
			`phenomenon = 'PM2.5',`
build performance improvements 6 years ago			`from = now() - days(3), # defaults to 2 days`
add vignette 7 years ago			`to = now()`
			`)`

using precomputed data for vignettes 2 years ago			```

			```{r}
			`pm25 = readRDS('pm25_berlin.rds') # read precomputed file to save resources`
add vignette 7 years ago			`plot(pm25)`
			```

add outlier filtering in vignette 7 years ago			`Now we can get started with actual spatiotemporal data analysis.`
			`First, lets mask the seemingly uncalibrated sensors:`

changed T and F to TRUE and FALSE 2 years ago			```{r, warning=FALSE}
add outlier filtering in vignette 7 years ago			`outliers = filter(pm25, value > 100)$sensorId`
changed T and F to TRUE and FALSE 2 years ago			`bad_sensors = outliers[, drop = TRUE] %>% levels()`
add outlier filtering in vignette 7 years ago
			`pm25 = mutate(pm25, invalid = sensorId %in% bad_sensors)`
			```

			`Then plot the measuring locations, flagging the outliers:`
add vignette 7 years ago
using precomputed data for vignettes 2 years ago			```{r}
changed T and F to TRUE and FALSE 2 years ago			`st_as_sf(pm25) %>% st_geometry() %>% plot(col = factor(pm25$invalid), axes = TRUE)`
add outlier filtering in vignette 7 years ago			```

			`Removing these sensors yields a nicer time series plot:`

using precomputed data for vignettes 2 years ago			```{r}
add outlier filtering in vignette 7 years ago			`pm25 %>% filter(invalid == FALSE) %>% plot()`
add vignette 7 years ago			```

add outlier filtering in vignette 7 years ago			Further analysis: comparison with LANUV data `TODO`