You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
opensensmapR/vignettes/osem-intro.Rmd

157 lines
4.5 KiB
Plaintext

7 years ago
---
title: "Exploring the openSenseMap Dataset"
7 years ago
author: "Norwin Roosen"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
fig_margin: 0
fig_width: 6
fig_height: 4
7 years ago
vignette: >
%\VignetteIndexEntry{Exploring the openSenseMap Dataset}
7 years ago
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
This package provides data ingestion functions for almost any data stored on the
1 year ago
open data platform for environmental sensordata <https://opensensemap.org>.
7 years ago
Its main goals are to provide means for:
- big data analysis of the measurements stored on the platform
- sensor metadata analysis (sensor counts, spatial distribution, temporal trends)
### Exploring the dataset
Before we look at actual observations, lets get a grasp of the openSenseMap
datasets' structure.
```{r results = FALSE}
7 years ago
library(magrittr)
library(opensensmapr)
# all_sensors = osem_boxes(cache = '.')
all_sensors = readRDS('boxes_precomputed.rds') # read precomputed file to save resources
```
```{r}
7 years ago
summary(all_sensors)
```
This gives a good overview already: As of writing this, there are more than 700
7 years ago
sensor stations, of which ~50% are currently running. Most of them are placed
outdoors and have around 5 sensors each.
7 years ago
The oldest station is from May 2014, while the latest station was registered a
couple of minutes ago.
Another feature of interest is the spatial distribution of the boxes: `plot()`
can help us out here. This function requires a bunch of optional dependencies though.
```{r, message=FALSE, warning=FALSE}
7 years ago
plot(all_sensors)
```
7 years ago
It seems we have to reduce our area of interest to Germany.
7 years ago
But what do these sensor stations actually measure? Lets find out.
`osem_phenomena()` gives us a named list of of the counts of each observed
phenomenon for the given set of sensor stations:
```{r}
phenoms = osem_phenomena(all_sensors)
str(phenoms)
```
Thats quite some noise there, with many phenomena being measured by a single
sensor only, or many duplicated phenomena due to slightly different spellings.
We should clean that up, but for now let's just filter out the noise and find
7 years ago
those phenomena with high sensor numbers:
7 years ago
```{r}
phenoms[phenoms > 20]
```
7 years ago
Alright, temperature it is! Fine particulate matter (PM2.5) seems to be more
interesting to analyze though.
7 years ago
We should check how many sensor stations provide useful data: We want only those
boxes with a PM2.5 sensor, that are placed outdoors and are currently submitting
measurements:
```{r results = FALSE, eval=FALSE}
7 years ago
pm25_sensors = osem_boxes(
exposure = 'outdoor',
date = Sys.time(), # ±4 hours
phenomenon = 'PM2.5'
)
```
```{r}
pm25_sensors = readRDS('pm25_sensors.rds') # read precomputed file to save resources
7 years ago
summary(pm25_sensors)
plot(pm25_sensors)
```
Thats still more than 200 measuring stations, we can work with that.
### Analyzing sensor data
Having analyzed the available data sources, let's finally get some measurements.
We could call `osem_measurements(pm25_sensors)` now, however we are focusing on
7 years ago
a restricted area of interest, the city of Berlin.
Luckily we can get the measurements filtered by a bounding box:
7 years ago
```{r, results=FALSE, message=FALSE}
7 years ago
library(sf)
7 years ago
library(units)
7 years ago
library(lubridate)
library(dplyr)
7 years ago
```
Since the API takes quite long to response measurements, especially filtered on space and time, we do not run the following chunks for publication of the package on CRAN.
```{r bbox, results = FALSE, eval=FALSE}
7 years ago
# construct a bounding box: 12 kilometers around Berlin
berlin = st_point(c(13.4034, 52.5120)) %>%
st_sfc(crs = 4326) %>%
st_transform(3857) %>% # allow setting a buffer in meters
st_buffer(set_units(12, km)) %>%
7 years ago
st_transform(4326) %>% # the opensensemap expects WGS 84
st_bbox()
pm25 = osem_measurements(
berlin,
phenomenon = 'PM2.5',
from = now() - days(3), # defaults to 2 days
7 years ago
to = now()
)
```
```{r}
pm25 = readRDS('pm25_berlin.rds') # read precomputed file to save resources
7 years ago
plot(pm25)
```
Now we can get started with actual spatiotemporal data analysis.
First, lets mask the seemingly uncalibrated sensors:
```{r, warning=FALSE}
outliers = filter(pm25, value > 100)$sensorId
bad_sensors = outliers[, drop = TRUE] %>% levels()
pm25 = mutate(pm25, invalid = sensorId %in% bad_sensors)
```
Then plot the measuring locations, flagging the outliers:
7 years ago
```{r}
st_as_sf(pm25) %>% st_geometry() %>% plot(col = factor(pm25$invalid), axes = TRUE)
```
Removing these sensors yields a nicer time series plot:
```{r}
pm25 %>% filter(invalid == FALSE) %>% plot()
7 years ago
```
Further analysis: comparison with LANUV data `TODO`