Event data descriptives

Gert Janssenswillen

2/12/2015

The goal of this vignette is to illustrate how event data can be used for descriptive analysis in R. The data from the first municipality of the BPI Challenge 2015 will be used throughout this vignette. It is made available by the package under the name BPIC15_1 and already preprocessed to an object of the class eventlog. For more information on the preprocessing of event data, look at the corresponding vignette.

library(edeaR)
data("BPIC15_1")

Event log summary

The most high-level way to describe an eventlog is to use the generic R function summary.

summary(BPIC15_1)
## Number of events:  52217
## Number of cases:  1199
## Number of traces:  1099
## Number of activities:  398
## Average trace length:  43.55046
## 
## Start eventlog:  2010-10-04 22:00:00
## End eventlog:  2015-07-31 22:00:00
##  case_concept.name  event_question     event_dateFinished
##  Length:52217       Length:52217       Length:52217      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  event_dueDate      event_action_code  event_activityNameEN
##  Length:52217       Length:52217       Length:52217        
##  Class :character   Class :character   Class :character    
##  Mode  :character   Mode  :character   Mode  :character    
##                                                            
##                                                            
##                                                            
##  event_planned      event_time.timestamp          event_monitoringResource
##  Length:52217       Min.   :2010-10-04 22:00:00   Length:52217            
##  Class :character   1st Qu.:2011-11-07 09:32:10   Class :character        
##  Mode  :character   Median :2012-11-19 08:25:49   Mode  :character        
##                     Mean   :2012-12-12 19:44:44                           
##                     3rd Qu.:2014-01-15 23:00:00                           
##                     Max.   :2015-07-31 22:00:00                           
##  event_org.resource event_activityNameNL event_concept.name
##  Length:52217       Length:52217         Length:52217      
##  Class :character   Class :character     Class :character  
##  Mode  :character   Mode  :character     Mode  :character  
##                                                            
##                                                            
##                                                            
##  event_lifecycle.transition event_dateStop     activity_instance
##  Length:52217               Length:52217       Min.   :    1    
##  Class :character           Class :character   1st Qu.:13055    
##  Mode  :character           Mode  :character   Median :26109    
##                                                Mean   :26109    
##                                                3rd Qu.:39163    
##                                                Max.   :52217

As can be observed above, the summary contains the number of events, activities, traces and cases, as well as the time span covered by the event log.

Cases

The cases function returns a data.frame which contains general descriptives about each individual case.

case_information <- cases(BPIC15_1)
case_information
## # A tibble: 1,199 × 10
##    case_concept.name trace_length number_of_activities     start_timestamp
##                <chr>        <int>                <int>              <dttm>
## 1           10009138           45                   45 2014-04-10 22:00:00
## 2           10051383           57                   56 2014-04-16 22:00:00
## 3           10053042           57                   56 2014-04-13 22:00:00
## 4           10083315           58                   57 2014-04-16 22:00:00
## 5           10093171           46                   46 2014-04-21 22:00:00
## 6           10128431           56                   55 2014-04-24 22:00:00
## 7           10153084           58                   57 2014-04-28 22:00:00
## 8           10154600           47                   47 2014-04-29 22:00:00
## 9           10186016           71                   70 2014-05-01 22:00:00
## 10          10186644           55                   54 2014-04-30 22:00:00
## # ... with 1,189 more rows, and 6 more variables:
## #   complete_timestamp <dttm>, trace <chr>, trace_id <dbl>,
## #   duration_in_days <dbl>, first_activity <fctr>, last_activity <fctr>

For each case, the following values are reported

  1. Trace length
  2. Number of activities
  3. Start timestamp
  4. Complete timestamp
  5. Trace
  6. Duration (days)
  7. First activity
  8. Last activity

The resulting data.frame as such has little value, as there might be hunderds of cases. However, it can be further summarized and visualized. Below, the most common start and end activities of a case are shown. While almost all cases start with 01_HOOFD_010, there is much more variance in the last activity.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
summary(select(case_information, first_activity, last_activity))
##         first_activity         last_activity
##  01_HOOFD_010  :1182   01_HOOFD_530   :302  
##  11_AH_II_040b :   7   01_HOOFD_510_2a:106  
##  01_HOOFD_030_2:   2   01_HOOFD_820   : 95  
##  01_HOOFD_065_2:   2   01_HOOFD_510_2 : 92  
##  01_HOOFD_011  :   1   01_HOOFD_516   : 82  
##  01_HOOFD_080  :   1   01_HOOFD_510_4 : 48  
##  (Other)       :   4   (Other)        :474

Using the package ggplot2, we can also visalize this information. The next code will visualize the distribution of throughput time, i.e. duration.

library(ggplot2)
ggplot(case_information) + 
    geom_bar(aes(duration_in_days), binwidth = 30, fill = "#0072B2") + 
    scale_x_continuous(limits = c(0,500)) +
    xlab("Duration (in days)") + 
    ylab("Number of cases") 
## Warning: `geom_bar()` no longer has a `binwidth` parameter. Please use
## `geom_histogram()` instead.
## Warning: Removed 23 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

Activities

The activities functions shows the frequencies of the different activities.

activity_information <- activities(BPIC15_1)
activity_information
## # A tibble: 398 × 3
##    event_concept.name absolute_frequency relative_frequency
##                 <chr>              <int>              <dbl>
## 1        01_HOOFD_010               1199         0.02296187
## 2        01_HOOFD_015               1199         0.02296187
## 3        01_HOOFD_020               1194         0.02286612
## 4        01_HOOFD_180               1116         0.02137235
## 5      01_HOOFD_030_1               1111         0.02127660
## 6      01_HOOFD_030_2               1053         0.02016585
## 7        01_HOOFD_200                974         0.01865293
## 8        01_HOOFD_375                933         0.01786774
## 9         09_AH_I_010                933         0.01786774
## 10       01_HOOFD_380                930         0.01781029
## # ... with 388 more rows

The following graph shows an cumulative distribution function for the absolute frequency of activities. It shows that about 75% of the activities only occur less than a 100 times.

ggplot(activity_information) +
    stat_ecdf(aes(absolute_frequency), lwd = 1, col = "#0072B2") + 
    scale_x_continuous(breaks = seq(0, 1000, by = 100)) + 
    xlab("Absolute activity frequencies") +
    ylab("Cumulative percentage")

Predefined descriptive metrics

Next to the more general descriptives seen so far, a series of specific descriptives metrics have been defined. Three different analysis levels are distinguished, log, trace and activity. The metrics look at aspects of time as well as structuredness of the eventlog. Some of the metrics will be illustrated below.

Selfloops

The next piece of code will computed the number of selfloops at the level of activites.

activity_selfloops <- number_of_selfloops(BPIC15_1, level_of_analysis = "activity")
activity_selfloops
##    event_concept.name absolute    relative
## 1        01_HOOFD_205       86 0.565789474
## 2        01_HOOFD_100       31 0.086834734
## 3      01_HOOFD_190_2        9 0.068181818
## 4        08_AWB45_005        5 0.006684492
## 5      01_HOOFD_065_2        2 0.003067485
## 6        01_HOOFD_110        1 0.001858736
## 7        01_HOOFD_120        1 0.001972387
## 8        01_HOOFD_180        1 0.000896861
## 9        01_HOOFD_200        1 0.001027749
## 10     01_HOOFD_510_2        1 0.001108647
## 11       01_HOOFD_790        1 0.015873016
## 12       02_DRZ_030_2        1 0.200000000
## 13         10_UOV_065        1 0.076923077

The output shows that 13 activites sometimes occur in a selfloop. The activity 01_HOOFD_205 shows the most selfloops, i.e. 86.

Visualized:

ggplot(activity_selfloops) + 
    geom_bar(aes(reorder(event_concept.name, -absolute), absolute), stat = "identity", fill = "#0072B2") + 
    theme(axis.text.x = element_text(angle = 90)) + 
    xlab("Activity") + 
    ylab("Number of selfloops")

Repetitions

Complementary to selfloops are repetitions: activities which are repeated in a case, but not directly following each other.

activity_repetitions <- repetitions(BPIC15_1, level_of_analysis = "activity")
activity_repetitions
##    event_concept.name relative_frequency absolute     relative
## 1        01_HOOFD_180        0.021372350       78 0.0650542118
## 2        01_HOOFD_200        0.018652929       37 0.0308590492
## 3      01_HOOFD_510_2        0.017293219        3 0.0025020851
## 4        08_AWB45_005        0.014420591      143 0.1192660550
## 5      01_HOOFD_065_2        0.012524657        1 0.0008340284
## 6        01_HOOFD_110        0.010322309       71 0.0592160133
## 7        01_HOOFD_120        0.009728632       67 0.0558798999
## 8        01_HOOFD_100        0.007430530      156 0.1301084237
## 9        01_HOOFD_205        0.004557903        3 0.0025020851
## 10     01_HOOFD_190_2        0.002700270       10 0.0083402836
## 11       01_HOOFD_790        0.001225654       12 0.0100083403

Visualized:

ggplot(activity_repetitions) + 
    geom_bar(aes(reorder(event_concept.name, -absolute), absolute), stat = "identity", fill = "#0072B2") + 
    theme(axis.text.x = element_text(angle = 90)) + 
    xlab("Activity") + 
    ylab("Number of repetitions")

Combining descriptives

Using some data manipulation in R, we can plot both descriptives together, to easily see whether repetitions and selfloops occur often for the same activities.

data <- bind_rows(mutate(activity_selfloops, type = "selfloops"),
              mutate(select(activity_repetitions, event_concept.name, absolute), type = "repetitions"))

ggplot(data) + 
    geom_bar(aes(reorder(event_concept.name, -absolute), absolute), stat = "identity", fill = "#0072B2") + 
    facet_grid(type ~ .) +
    theme(axis.text.x = element_text(angle = 90)) + 
    xlab("Activity") + 
    ylab("Number of selfloops and repetitions")

Other descriptives

Other available descriptives and the supported analysis levels are listed below:

Time

Structuredness

Variance

  • Activity presence in cases (activity)
  • Activity type frequency (trace, activity)
  • Start activities (log, activity)
  • End activities (log, activity)
  • Trace length (log, trace)
  • Trace coverage (log)
  • Trace frequency (trace)
  • Number of traces (log)

Repetititons

  • Number of repetitions (log, trace, activity)

Selfloops

  • Size of selfloops (log, trace, activity)
  • Number of selfloops per traces (log, trace)
  • Number of traces with selfloop (log)