padr
When getting time series data ready for analysis, you might be confronted by the following two challenges:
padr
aims to make light work of preparing time series data by offering the two main functions thicken
and pad
. A small example before we get into detail. Say I want to make a line plot of my daily expenses at the coffee place. The data for a few days might look like.
library(padr)
coffee
## time_stamp amount
## 1 2016-07-07 09:11:21 3.14
## 2 2016-07-07 09:46:48 2.98
## 3 2016-07-09 13:25:17 4.11
## 4 2016-07-10 10:45:11 3.14
Using padr
in combination with dplyr
this plot is made in the following way:
library(ggplot2); library(dplyr)
coffee %>%
thicken('day') %>%
group_by(time_stamp_day) %>%
summarise(day_amount = sum(amount)) %>%
pad() %>%
fill_by_value(day_amount) %>%
ggplot(aes(time_stamp_day, day_amount)) + geom_line()
Quite some stuff going on here, let’s go through the functions one by one to see what they do.
thicken
adds a column to a data frame that is of a higher interval than that of the original datetime variable. The interval in the padr
context is the heartbeat of the data, the recurrence of the observations.1 The original variable “time_stamp” had the interval second, the added variable was of interval day.
coffee2 <- coffee %>% thicken('day')
coffee2$time_stamp %>% get_interval
## [1] "sec"
coffee2$time_stamp_day %>% get_interval
## [1] "day"
thicken
does figure out some stuff for you. First it finds the datetime variable in your data frame (given there is only one). Next it will determine the interval of this variable, which is one of the following: year, quarter, month, week, day, hour, minute, or second. Finally it adds a variable to the data frame that is of a higher interval than the interval of the original datetime variable. The user can then use this variable to aggregate to the higher level, for instance using dplyr
’s group_by
and summarise
. By default the interval of the added variable is one level higher than the interval of the original variable, but if you want it to be of a different level you specify the interval argument.
data.frame(day_var = as.Date(c('2016-08-12', '2016-08-29'))) %>% thicken
## day_var day_var_week
## 1 2016-08-12 2016-08-07
## 2 2016-08-29 2016-08-28
So in the above example the interval of “day_var” was day. thicken
than added a variable of one interval higher, which is week. We did not specify the beginning of the week, so thicken goes to its default behavior, that is weeks starting on Sundays. In many situations the user will be content with thicken
’s defaults, however some flexibility is offered.
We use the emergency data set for further illustration. It contains 120,450 emergency calls in Montgomery County, PA, between 2015-12-10 and 2016-10-17. It has four columns that contain information about the location of the emergency, a title field indicating the type of the emergency, and a time stamp. The data set was created from a Google Api, thanks to Mike Chirico for maintaining this set.
head(emergency)
## # A tibble: 6 × 6
## lat lng zip title time_stamp
## <dbl> <dbl> <int> <chr> <dttm>
## 1 40.29788 -75.58129 19525 EMS: BACK PAINS/INJURY 2015-12-10 17:40:00
## 2 40.25806 -75.26468 19446 EMS: DIABETIC EMERGENCY 2015-12-10 17:40:00
## 3 40.12118 -75.35198 19401 Fire: GAS-ODOR/LEAK 2015-12-10 17:40:00
## 4 40.11615 -75.34351 19401 EMS: CARDIAC EMERGENCY 2015-12-10 17:40:01
## 5 40.25149 -75.60335 NA EMS: DIZZINESS 2015-12-10 17:40:01
## 6 40.25347 -75.28324 19446 EMS: HEAD INJURY 2015-12-10 17:40:01
## # ... with 1 more variables: twp <chr>
Say we are interested in the number of overdoses that occured daily. However, we don’t want incidents during the same night to be split into two days, what would have happened when we use the default behavior. Rather, we reset the count at 8 am, grouping all nightly cases to the same day. The interval is still day, but each new day starts at 8 am instead of midnight. The start_val
serves as an offset.
emergency %>% filter(title == 'EMS: OVERDOSE') %>%
thicken('day',
start_val = as.POSIXct('2015-12-11 08:00:00', tz = 'EST'),
colname = 'daystart') %>%
group_by(daystart) %>%
summarise(nr_od = n()) %>%
head
## # A tibble: 6 × 2
## daystart nr_od
## <dttm> <int>
## 1 2015-12-11 08:00:00 2
## 2 2015-12-12 08:00:00 6
## 3 2015-12-13 08:00:00 7
## 4 2015-12-14 08:00:00 8
## 5 2015-12-15 08:00:00 1
## 6 2015-12-16 08:00:00 4
Note also that we specified the column name of the added column. If we don’t, thicken
takes the column name of the original datetime variable and appends it with the interval of the thickened variable, separated by an underscore.
Two final points on intervals before we are going to pad
:
The second workhorse of padr
is pad
. It does date padding:
account <- data.frame(day = as.Date(c('2016-10-21', '2016-10-23', '2016-10-26')),
balance = c(304.46, 414.76, 378.98))
account %>% pad
## day balance
## 1 2016-10-21 304.46
## 2 2016-10-22 NA
## 3 2016-10-23 414.76
## 4 2016-10-24 NA
## 5 2016-10-25 NA
## 6 2016-10-26 378.98
The account dataframe has three observations on different days. Like thicken
the pad
function figured out what the datetime variable in the data frame is, and then assessed its interval. Next it noticed that within the interval, day in this case, rows are lacking between the first and last observation. It inserts a row in the data frame for every time point that is lacking from the data set. All non-datetime values will get missing values at the padded rows.
It is up to the user what to do with the missing records. In the case of the balance of an account we want to carry the last observation forward. It needs tidyr::fill
to arrive at the tidy data set.
account %>% pad %>% tidyr::fill(balance)
## day balance
## 1 2016-10-21 304.46
## 2 2016-10-22 304.46
## 3 2016-10-23 414.76
## 4 2016-10-24 414.76
## 5 2016-10-25 414.76
## 6 2016-10-26 378.98
Also pad
allows for deviations from its default behavior. By default it pads all observations between the first and the last observation, but you can use start_val
and end_val
to deviate from this. You can also specify a lower interval than the one of the variable, using pad
as the inverse of thicken
.
account %>% pad('hour', start_val = as.POSIXct('2016-10-20 22:00:00')) %>% head
## day balance
## 1 2016-10-20 22:00:00 NA
## 2 2016-10-20 23:00:00 NA
## 3 2016-10-21 00:00:00 304.46
## 4 2016-10-21 01:00:00 NA
## 5 2016-10-21 02:00:00 NA
## 6 2016-10-21 03:00:00 NA
Padding within groups is supported since version 0.2.0, when the group
parameter was added to pad
. This parameter takes one or several column names, that indicate the groups. Take the following example of the emergency
data.
padded_groups <- emergency %>% thicken('day') %>%
count(time_stamp_day, title) %>%
pad(group = 'title')
We already saw tidyr::fill
coming in handy for the filling of missing values after padding. padr
comes with three more fill functions: fill_by_value
, fill_by_function
, and fill_by_prevalent
. They fill missing values by respectively a single value, a function of the nonmissing values, and the most prevalent value among the nonmissing values.
counts <- data.frame(x = as.Date(c('2016-11-21', '2016-11-23', '2016-11-24')),
y = c(2, 4, 4))
counts %>% pad() %>% fill_by_value(y, value = 42)
## x y
## 1 2016-11-21 2
## 2 2016-11-22 42
## 3 2016-11-23 4
## 4 2016-11-24 4
counts %>% pad() %>% fill_by_function(y, fun = mean)
## x y
## 1 2016-11-21 2.000000
## 2 2016-11-22 3.333333
## 3 2016-11-23 4.000000
## 4 2016-11-24 4.000000
counts %>% pad() %>% fill_by_prevalent(y)
## x y
## 1 2016-11-21 2
## 2 2016-11-22 4
## 3 2016-11-23 4
## 4 2016-11-24 4
thicken
and pad
together make a strong pair, as was already seen in the coffee example. Lets go back to the emergency data set and see how many cases of dehydration there are each day. Would there be more cases in summer?
dehydration_day <- emergency %>%
filter(title == 'EMS: DEHYDRATION') %>%
thicken(interval = 'day') %>%
group_by(time_stamp_day) %>%
summarise(nr = n()) %>%
pad() %>%
fill_by_value(nr)
ggplot(dehydration_day, aes(time_stamp_day, nr)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess'
There is a second vignette, called padr_implementation
. Here you can find more information about how padr
handles daylight savings time, what it does with different time zones and how thicken
exactly is implemented.
Found a bug? Ideas for improving or expandig padr
. Your input is much appreciated. The code is maintained at https://github.com/EdwinTh/padr and you are most welcome to file an issue or do a pull request.
Many users that work with date and time variables will be using the lubridate
package. The definition of an interval in lubridate
is different from the definition in padr
. In lubridate
an interval is a period between two time points and has nothing to do with recurrence. Please keep this in mind.↩