Not only does edeaR allow one to hande, describe and filter event data in a convenient way, it provides the analyst with a broad spectrum of analysis techniques which are already available within R. This vignette will highlight this by means of some simple though illustrative examples. The examples in this vignette will used the event log of the BPI Challenge 2014. More specifically, a sample of incident activities with their corresponding case attributes will be used. Both data sets are available within the packages.

library(edeaR)
library(dplyr)
library(ggplot2)
data("BPIC14_incident_log")
data("BPIC14_incident_case_attributes")
BPIC14_incident_log %>% print
## Event log consisting of:
## 39911 events
## 2430 traces
## 4000 cases
## 36 activities
## 39911 activity instances
## 
## Source: local data frame [39,911 x 8]
## 
##    incident_id          data_stamp incident_activity_number
##         (fctr)              (time)                   (fctr)
## 1    IM0000013 2013-02-13 12:03:35              001A3970723
## 2    IM0000013 2013-11-08 13:54:13              001A5891812
## 3    IM0000013 2013-11-08 13:50:18              001A5891796
## 4    IM0000013 2013-11-08 13:50:17              001A5891794
## 5    IM0000013 2013-11-08 13:50:18              001A5891795
## 6    IM0000013 2013-11-08 13:54:13              001A5891813
## 7    IM0000048 2013-02-06 13:10:52              001A3926349
## 8    IM0000048 2013-02-06 13:10:52              001A3926348
## 9    IM0000048 2013-02-06 15:15:03              001A3926671
## 10   IM0000048 2013-10-17 13:54:33              001A5709262
## ..         ...                 ...                      ...
## Variables not shown: incident_activity_type (fctr), assignment_group
##   (fctr), km_number (fctr), interaction_id (fctr), lifecycle (chr)
BPIC14_incident_case_attributes %>% print
## Source: local data frame [4,000 x 28]
## 
##    ci_name_aff    ci_type_aff           ci_subtype_aff
##         (fctr)         (fctr)                   (fctr)
## 1    WBA000124    application    Web Based Application
## 2    SBA000662    application Server Based Application
## 3    SBA000607    application Server Based Application
## 4    SAP000005    application                      SAP
## 5    SUB000508 subapplication    Web Based Application
## 6    WBA000124    application    Web Based Application
## 7    WBA000124    application    Web Based Application
## 8    WBA000133    application    Web Based Application
## 9    WSR001424       computer           Windows Server
## 10   WBA000124    application    Web Based Application
## ..         ...            ...                      ...
## Variables not shown: service_component_wbs_aff (fctr), incident_id (fctr),
##   status (fctr), impact (fctr), urgency (fctr), priority (fctr), category
##   (fctr), km_number (fctr), alert_status (fctr), x_reassignments (int),
##   open_time (time), reopen_time (time), resolved_time (time), close_time
##   (time), handle_time_hours (dbl), closure_code (fctr),
##   x_related_interactions (int), related_interaction (fctr),
##   x_related_incidents (int), x_related_changes (int), related_change
##   (fctr), ci_name_cby (fctr), ci_type_cby (fctr), ci_subtype_cby (fctr),
##   servicecomp_wbs_cby (fctr)

Suppose we are interested in inspecting the structuredness of this event log, and any possible relationship with performance. A first inspection is just to look at the number of different traces, i.e. variants in terms of activity sequences, which are present in the event log.

trace_coverage(BPIC14_incident_log, level_of_analysis = "trace") %>% print(width = Inf)
## Source: local data frame [2,028 x 4]
## 
##                                                                        trace
##                                                                        (chr)
## 1                                                   Open,Closed,Caused By CI
## 2                          Open,Assignment,Status Change,Closed,Caused By CI
## 3                                        Open,Assignment,Closed,Caused By CI
## 4                      Open,Assignment,Status Change,Mail to Customer,Closed
## 5  Open,Assignment,Status Change,Closed,Caused By CI,Quality Indicator Fixed
## 6          Open,Assignment,Status Change,Operator Update,Closed,Caused By CI
## 7        Open,Assignment,Status Change,Closed,Caused By CI,Quality Indicator
## 8                           Open,Closed,Caused By CI,Quality Indicator Fixed
## 9         Open,Assignment,Status Change,Mail to Customer,Closed,Caused By CI
## 10                       Open,Assignment,Operator Update,Closed,Caused By CI
## ..                                                                       ...
##    absolute relative cum_sum
##       (int)    (dbl)   (dbl)
## 1       293  0.07325 0.07325
## 2       232  0.05800 0.13125
## 3       192  0.04800 0.17925
## 4        88  0.02200 0.20125
## 5        76  0.01900 0.22025
## 6        66  0.01650 0.23675
## 7        58  0.01450 0.25125
## 8        55  0.01375 0.26500
## 9        47  0.01175 0.27675
## 10       34  0.00850 0.28525
## ..      ...      ...     ...

The output shows that the 10 most common activity sequences are able to cover about 20% of the cases in the log. However, there are a total of 2430 different traces in the event log. Thus, it is clear that there is a fair amount of unstructuredness. Moreover, it seems that less frequent traces are long, in terms of the number of activity instance. The graph below shows that the activity sequences which occur less that occur more than once, remain quite short. However, there exists a lot of exceptional traces which get very long, i.e. up to 170 activity instances.

trace_length(BPIC14_incident_log, "trace") %>% ggplot(aes(relative_trace_frequency, absolute)) + geom_jitter() + scale_x_continuous(limits = c(0,0.01)) + ylab("Trace length") + xlab("Relative trace frequency")
## Warning: Removed 9 rows containing missing values (geom_point).


An interesting analysis would therefore be to define a performance vector for each case in order to identify bad performance cases and examine them more closely, e.g., by looking at their case attributes. For the sake of simplicity, predefined measures will be used to quantify performance. However, any self-defined property of cases can be used. The case properties used in this example are trace length, throughput time, number of self-loops and the number of repetitions. The following piece of code will compute these measures and combine them in one table.

case_performance <- BPIC14_incident_log %>% throughput_time("case") %>% 
    left_join(BPIC14_incident_log %>% trace_length("case")) %>% 
    left_join(BPIC14_incident_log %>% repetitions("case") %>% select(incident_id, absolute) %>% rename(repetitions = absolute)) %>% 
    left_join(BPIC14_incident_log %>% number_of_selfloops("case") %>% select(incident_id, absolute) %>% rename(number_of_selfloops = absolute)) 
## Joining by: "incident_id"
## Joining by: "incident_id"
## Joining by: "incident_id"
case_performance %>% summary
##     incident_id   throughput_time     trace_length      repetitions     
##  IM0000013:   1   Min.   :  0.0002   Min.   :  2.000   Min.   :  0.000  
##  IM0000048:   1   1st Qu.:  0.0523   1st Qu.:  5.000   1st Qu.:  0.000  
##  IM0000049:   1   Median :  0.7648   Median :  7.000   Median :  0.000  
##  IM0000073:   1   Mean   :  5.1169   Mean   :  9.978   Mean   :  3.221  
##  IM0000074:   1   3rd Qu.:  3.9583   3rd Qu.: 11.000   3rd Qu.:  3.000  
##  IM0000105:   1   Max.   :363.0869   Max.   :170.000   Max.   :120.000  
##  (Other)  :3994                                                         
##  number_of_selfloops
##  Min.   : 0.0000    
##  1st Qu.: 0.0000    
##  Median : 0.0000    
##  Mean   : 0.3792    
##  3rd Qu.: 1.0000    
##  Max.   :35.0000    
## 

Cluster analysis

The overal summary shows that there is a wide diversity on each defined aspect of performance. Moreover, it is clear that all variables are right skewed, due to the existence of a limited number of bad performing cases. A cluster analysis might be able to distinguish bad performance cases based on these values. However, since the data is highly skewed, the variables were first normalized. To decide upon the number of clusters, we performed a various number of clusterings, each with a different number of clusters, and compared the SSE of each clustering. To control for the randomness in the selection of centres, 100 different iterations were done at each moment. The graph shows the minimum SSE that was seen for each number of clusters.

input <- scale(case_performance[,2:5]) %>% as.data.frame()

clusters <- data.frame(i = 1:15)
for(i in 1:nrow(clusters)) {
    for(j in 1:100) {
        cl <- kmeans(input,i, iter.max = 20)
        min_sse <- min(Inf, cl$totss - cl$betweenss)
    }
    clusters$sse[i] <- min_sse
}
clusters %>% ggplot(aes(i, sse)) + geom_line() +
    xlab("Number of clusters") + ylab("SSE") + 
    scale_x_continuous(breaks = 1:15) + scale_y_continuous(breaks = seq(0,16000,2000))

It can be observed that the in SSE is negligible when the number of clusters is higher than 5. Therefore, 5 seems to be a reasonable number of clusters. The output of the final clustering belows show that the 5 resulting clusters differ reasonably in size: there is one major cluster, containg about 77% of the cases, and another smaller clusters containing another 17%. The remaining 6% of cases are divided into three tiny clusters. However, keeping in mind the skewedness of the data, this result is not that surprising.

set.seed(4)
cl <- kmeans(input,5, iter.max = 20)
case_performance <- case_performance %>% bind_cols(data.frame(cluster = factor(cl$cluster)))
cl %>% str
## List of 9
##  $ cluster     : int [1:4000] 4 4 4 4 4 4 4 4 4 4 ...
##  $ centers     : num [1:5, 1:4] 0.00192 1.3363 0.67602 -0.2048 10.0468 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:4] "throughput_time" "trace_length" "repetitions" "number_of_selfloops"
##  $ totss       : num 15996
##  $ withinss    : num [1:5] 539 1025 1427 503 2024
##  $ tot.withinss: num 5519
##  $ betweenss   : num 10477
##  $ size        : int [1:5] 701 78 373 2826 22
##  $ iter        : int 7
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

The table and figures below show how the different clusters are characterised by the different variables. The main cluster, i.e. cluster 4, contains cases which score low on all 4 measures. On the other side of the spectrum, cluster 5 contains cases which on average have a very high value for all metrics. Note however that this is the smallest cluster, thereby really covering exceptional behaviour. Cluster 2 is more or less similar to cluster 5 concerningthe number of repetitions and the trace length. However, the cases in this cluster have less self-loops and remarkably lower throughput times. The two remaining clusters, 1 and 3, contain cases that score reasonably good on all aspects, though inferior to cluster 4. Among these two, cluster 1 seems to be superior.

Explaining the clusters

We have now divided all cases in groups with similar performance characteristics. Subsequently, it would be interesting to see whether these performance characteristics are connected with other attributes, related to the incidents itself. Therefore, we connect the clustering output with the case attributes.

## Source: local data frame [5 x 6]
## 
##   cluster  freq mean_nr_of_repetitions mean_nr_of_selfloops
##    (fctr) (int)                  (dbl)                (dbl)
## 1       1   701              2.3366619            1.2182596
## 2       2    78             40.1025641            2.8076923
## 3       3   373             13.4825737            0.9168901
## 4       4  2826              0.8262562            0.0000000
## 5       5    22             34.2727273            4.6363636
##   mean_trace_length mean_throughput_time
##               (dbl)                (dbl)
## 1         10.751783             5.150311
## 2         53.743590            28.387965
## 3         23.654155            16.889430
## 4          6.444444             1.550452
## 5         52.136364           180.077092

BPIC14_incident_case_attributes <- BPIC14_incident_case_attributes %>%
    merge(case_performance) 

The graph below that for some configuration item types, the number of cases which performe good at the selected measures is relatively lower that for others. For instance, for incidents related to subapplications, about 80% belongs to the high-performing cluster (4), while for network components this is only about 50%

BPIC14_incident_case_attributes %>% ggplot(aes(reorder(ci_type_aff, as.numeric(cluster) == 4, FUN = "mean"), fill = cluster)) + geom_bar(position = "fill")  +
    scale_fill_brewer() + coord_flip() + xlab("ci_type_aff") + scale_y_continuous(breaks = seq(0,1,0.1))


The following graph shows that cases with bad performance scores (2 and 5) typically are reassigned a lot of times, while cases with good performance levels (cluster 4) have zero are only a few reassignments.

BPIC14_incident_case_attributes %>% ggplot(aes(cluster, x_reassignments)) + geom_boxplot()  +
    scale_fill_brewer() + coord_flip() 


The closure code, stating in which way the incident was solved also seem to be related in a certain sense to performance. Incidents were the users didn’t read the user manual were often scored good on performance levels, while cases which had to be referred had typically a bad performance.

BPIC14_incident_case_attributes %>% ggplot(aes(reorder(closure_code, as.numeric(cluster) == 4, FUN = "mean"), fill = cluster)) + geom_bar(position = "fill")  +
    scale_fill_brewer() + coord_flip() + xlab("closure_code") + scale_y_continuous(breaks = seq(0,1,0.1))


## Subsetting the eventlog

Instead of looking at case attributes, it is also possible to filter out the bad performaning cases, for instance cluster 5.

cluster_5 <- BPIC14_incident_log %>% 
    merge(case_performance) %>%
    filter(cluster == 5) %>%
    eventlog(case_id(BPIC14_incident_log), 
             activity_id(BPIC14_incident_log),
             activity_instance_id(BPIC14_incident_log),
             lifecycle_id(BPIC14_incident_log),
             timestamp(BPIC14_incident_log))

When printing the filtered event log, it can be seen that all 22 cases have a different trace, i.e. activity sequence, which makes them unique in that sense. A visualization of the behaviour in a directed graph might be helpful to understand it better. The reader is refered to packages like igraph of markovchain for this visualization.

cluster_5 
## Event log consisting of:
## 1147 events
## 22 traces
## 22 cases
## 28 activities
## 1147 activity instances
## 
## Source: local data frame [1,147 x 13]
## 
##    incident_id          data_stamp incident_activity_number
##         (fctr)              (time)                   (fctr)
## 1    IM0000013 2013-02-13 12:03:35              001A3970723
## 2    IM0000013 2013-11-08 13:54:13              001A5891812
## 3    IM0000013 2013-11-08 13:50:18              001A5891796
## 4    IM0000013 2013-11-08 13:50:17              001A5891794
## 5    IM0000013 2013-11-08 13:50:18              001A5891795
## 6    IM0000013 2013-11-08 13:54:13              001A5891813
## 7    IM0000048 2013-02-06 13:10:52              001A3926349
## 8    IM0000048 2013-02-06 13:10:52              001A3926348
## 9    IM0000048 2013-02-06 15:15:03              001A3926671
## 10   IM0000048 2013-10-17 13:54:33              001A5709262
## ..         ...                 ...                      ...
## Variables not shown: incident_activity_type (fctr), assignment_group
##   (fctr), km_number (fctr), interaction_id (fctr), lifecycle (chr),
##   throughput_time (dbl), trace_length (int), repetitions (dbl),
##   number_of_selfloops (dbl), cluster (fctr)
cluster_5 %>% traces
## Source: local data frame [22 x 4]
## 
##                                                                          trace
##                                                                          (chr)
## 1  Open,Assignment,Operator Update,Status Change,Assignment,Operator Update,As
## 2  Open,Assignment,Reassignment,Assignment,Assignment,Status Change,Problem Wo
## 3  Open,Assignment,Reassignment,Assignment,Operator Update,Description Update,
## 4  Open,Assignment,Status Change,Operator Update,External Vendor Assignment,Ve
## 5  Open,Assignment,Status Change,Reassignment,Assignment,Operator Update,Reass
## 6  Open,Assignment,Status Change,Reassignment,Assignment,Operator Update,Updat
## 7  Open,Assignment,Status Change,Update,Reassignment,Assignment,Status Change,
## 8  Open,Open,Update,Assignment,Status Change,Reassignment,Assignment,Operator 
## 9  Open,Reassignment,Assignment,Description Update,Status Change,Update,Update
## 10 Open,Reassignment,Assignment,Reassignment,Assignment,Operator Update,Assign
## ..                                                                         ...
## Variables not shown: trace_id (dbl), absolute_frequency (int),
##   relative_frequency (dbl)