The goal of this vignette is to illustrate how event data can be used for descriptive analysis in R. The data from the first municipality of the BPI Challenge 2015 will be used throughout this vignette. It is made available by the package under the name BPIC15_1
and already preprocessed to an object of the class eventlog
. For more information on the preprocessing of event data, look at the corresponding vignette.
library(edeaR)
data("BPIC15_1")
The most high-level way to describe an eventlog is to use the generic R
function summary
.
summary(BPIC15_1)
## Number of events: 52217
## Number of cases: 1199
## Number of traces: 1099
## Number of activities: 398
## Average trace length: 43.55046
##
## Start eventlog: 2010-10-04 22:00:00
## End eventlog: 2015-07-31 22:00:00
## case_concept.name event_question event_dateFinished
## Length:52217 Length:52217 Length:52217
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## event_dueDate event_action_code event_activityNameEN
## Length:52217 Length:52217 Length:52217
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## event_planned event_time.timestamp event_monitoringResource
## Length:52217 Min. :2010-10-04 22:00:00 Length:52217
## Class :character 1st Qu.:2011-11-07 09:32:10 Class :character
## Mode :character Median :2012-11-19 08:25:49 Mode :character
## Mean :2012-12-12 19:44:44
## 3rd Qu.:2014-01-15 23:00:00
## Max. :2015-07-31 22:00:00
## event_org.resource event_activityNameNL event_concept.name
## Length:52217 Length:52217 Length:52217
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## event_lifecycle.transition event_dateStop activity_instance
## Length:52217 Length:52217 Min. : 1
## Class :character Class :character 1st Qu.:13055
## Mode :character Mode :character Median :26109
## Mean :26109
## 3rd Qu.:39163
## Max. :52217
As can be observed above, the summary contains the number of events, activities, traces and cases, as well as the time span covered by the event log.
The cases
function returns a data.frame which contains general descriptives about each individual case.
case_information <- cases(BPIC15_1)
case_information
## # A tibble: 1,199 × 10
## case_concept.name trace_length number_of_activities start_timestamp
## <chr> <int> <int> <dttm>
## 1 10009138 45 45 2014-04-10 22:00:00
## 2 10051383 57 56 2014-04-16 22:00:00
## 3 10053042 57 56 2014-04-13 22:00:00
## 4 10083315 58 57 2014-04-16 22:00:00
## 5 10093171 46 46 2014-04-21 22:00:00
## 6 10128431 56 55 2014-04-24 22:00:00
## 7 10153084 58 57 2014-04-28 22:00:00
## 8 10154600 47 47 2014-04-29 22:00:00
## 9 10186016 71 70 2014-05-01 22:00:00
## 10 10186644 55 54 2014-04-30 22:00:00
## # ... with 1,189 more rows, and 6 more variables:
## # complete_timestamp <dttm>, trace <chr>, trace_id <dbl>,
## # duration_in_days <dbl>, first_activity <fctr>, last_activity <fctr>
For each case, the following values are reported
The resulting data.frame as such has little value, as there might be hunderds of cases. However, it can be further summarized and visualized. Below, the most common start and end activities of a case are shown. While almost all cases start with 01_HOOFD_010, there is much more variance in the last activity.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
summary(select(case_information, first_activity, last_activity))
## first_activity last_activity
## 01_HOOFD_010 :1182 01_HOOFD_530 :302
## 11_AH_II_040b : 7 01_HOOFD_510_2a:106
## 01_HOOFD_030_2: 2 01_HOOFD_820 : 95
## 01_HOOFD_065_2: 2 01_HOOFD_510_2 : 92
## 01_HOOFD_011 : 1 01_HOOFD_516 : 82
## 01_HOOFD_080 : 1 01_HOOFD_510_4 : 48
## (Other) : 4 (Other) :474
Using the package ggplot2
, we can also visalize this information. The next code will visualize the distribution of throughput time, i.e. duration.
library(ggplot2)
ggplot(case_information) +
geom_bar(aes(duration_in_days), binwidth = 30, fill = "#0072B2") +
scale_x_continuous(limits = c(0,500)) +
xlab("Duration (in days)") +
ylab("Number of cases")
## Warning: `geom_bar()` no longer has a `binwidth` parameter. Please use
## `geom_histogram()` instead.
## Warning: Removed 23 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
The activities
functions shows the frequencies of the different activities.
activity_information <- activities(BPIC15_1)
activity_information
## # A tibble: 398 × 3
## event_concept.name absolute_frequency relative_frequency
## <chr> <int> <dbl>
## 1 01_HOOFD_010 1199 0.02296187
## 2 01_HOOFD_015 1199 0.02296187
## 3 01_HOOFD_020 1194 0.02286612
## 4 01_HOOFD_180 1116 0.02137235
## 5 01_HOOFD_030_1 1111 0.02127660
## 6 01_HOOFD_030_2 1053 0.02016585
## 7 01_HOOFD_200 974 0.01865293
## 8 01_HOOFD_375 933 0.01786774
## 9 09_AH_I_010 933 0.01786774
## 10 01_HOOFD_380 930 0.01781029
## # ... with 388 more rows
The following graph shows an cumulative distribution function for the absolute frequency of activities. It shows that about 75% of the activities only occur less than a 100 times.
ggplot(activity_information) +
stat_ecdf(aes(absolute_frequency), lwd = 1, col = "#0072B2") +
scale_x_continuous(breaks = seq(0, 1000, by = 100)) +
xlab("Absolute activity frequencies") +
ylab("Cumulative percentage")
Next to the more general descriptives seen so far, a series of specific descriptives metrics have been defined. Three different analysis levels are distinguished, log, trace and activity. The metrics look at aspects of time as well as structuredness of the eventlog. Some of the metrics will be illustrated below.
The next piece of code will computed the number of selfloops at the level of activites.
activity_selfloops <- number_of_selfloops(BPIC15_1, level_of_analysis = "activity")
activity_selfloops
## event_concept.name absolute relative
## 1 01_HOOFD_205 86 0.565789474
## 2 01_HOOFD_100 31 0.086834734
## 3 01_HOOFD_190_2 9 0.068181818
## 4 08_AWB45_005 5 0.006684492
## 5 01_HOOFD_065_2 2 0.003067485
## 6 01_HOOFD_110 1 0.001858736
## 7 01_HOOFD_120 1 0.001972387
## 8 01_HOOFD_180 1 0.000896861
## 9 01_HOOFD_200 1 0.001027749
## 10 01_HOOFD_510_2 1 0.001108647
## 11 01_HOOFD_790 1 0.015873016
## 12 02_DRZ_030_2 1 0.200000000
## 13 10_UOV_065 1 0.076923077
The output shows that 13 activites sometimes occur in a selfloop. The activity 01_HOOFD_205 shows the most selfloops, i.e. 86.
Visualized:
ggplot(activity_selfloops) +
geom_bar(aes(reorder(event_concept.name, -absolute), absolute), stat = "identity", fill = "#0072B2") +
theme(axis.text.x = element_text(angle = 90)) +
xlab("Activity") +
ylab("Number of selfloops")
Complementary to selfloops are repetitions: activities which are repeated in a case, but not directly following each other.
activity_repetitions <- repetitions(BPIC15_1, level_of_analysis = "activity")
activity_repetitions
## event_concept.name relative_frequency absolute relative
## 1 01_HOOFD_180 0.021372350 78 0.0650542118
## 2 01_HOOFD_200 0.018652929 37 0.0308590492
## 3 01_HOOFD_510_2 0.017293219 3 0.0025020851
## 4 08_AWB45_005 0.014420591 143 0.1192660550
## 5 01_HOOFD_065_2 0.012524657 1 0.0008340284
## 6 01_HOOFD_110 0.010322309 71 0.0592160133
## 7 01_HOOFD_120 0.009728632 67 0.0558798999
## 8 01_HOOFD_100 0.007430530 156 0.1301084237
## 9 01_HOOFD_205 0.004557903 3 0.0025020851
## 10 01_HOOFD_190_2 0.002700270 10 0.0083402836
## 11 01_HOOFD_790 0.001225654 12 0.0100083403
Visualized:
ggplot(activity_repetitions) +
geom_bar(aes(reorder(event_concept.name, -absolute), absolute), stat = "identity", fill = "#0072B2") +
theme(axis.text.x = element_text(angle = 90)) +
xlab("Activity") +
ylab("Number of repetitions")
Using some data manipulation in R, we can plot both descriptives together, to easily see whether repetitions and selfloops occur often for the same activities.
data <- bind_rows(mutate(activity_selfloops, type = "selfloops"),
mutate(select(activity_repetitions, event_concept.name, absolute), type = "repetitions"))
ggplot(data) +
geom_bar(aes(reorder(event_concept.name, -absolute), absolute), stat = "identity", fill = "#0072B2") +
facet_grid(type ~ .) +
theme(axis.text.x = element_text(angle = 90)) +
xlab("Activity") +
ylab("Number of selfloops and repetitions")
Other available descriptives and the supported analysis levels are listed below: