Unleashing the power of R for event data

Gert Janssenswillen

2/12/2015

Not only does edeaR allow one to hande, describe and filter event data in a convenient way, it provides the analyst with a broad spectrum of analysis techniques which are already available within R. This vignette will highlight this by means of some simple though illustrative examples. The examples in this vignette will used the event log of the BPI Challenge 2014. More specifically, a sample of incident activities with their corresponding case attributes will be used. Both data sets are available within the packages.

library(edeaR)
library(dplyr)
library(ggplot2)
data("BPIC14_incident_log")
data("BPIC14_incident_case_attributes")
BPIC14_incident_log %>% print
## Event log consisting of:
## 39911 events
## 2028 traces
## 4000 cases
## 36 activities
## 39911 activity instances
## 
## # A tibble: 39,911 × 8
##    incident_id          data_stamp incident_activity_number
##         <fctr>              <dttm>                   <fctr>
## 1    IM0000013 2013-02-13 12:03:35              001A3970723
## 2    IM0000013 2013-11-08 13:54:13              001A5891812
## 3    IM0000013 2013-11-08 13:50:18              001A5891796
## 4    IM0000013 2013-11-08 13:50:17              001A5891794
## 5    IM0000013 2013-11-08 13:50:18              001A5891795
## 6    IM0000013 2013-11-08 13:54:13              001A5891813
## 7    IM0000048 2013-02-06 13:10:52              001A3926349
## 8    IM0000048 2013-02-06 13:10:52              001A3926348
## 9    IM0000048 2013-02-06 15:15:03              001A3926671
## 10   IM0000048 2013-10-17 13:54:33              001A5709262
## # ... with 39,901 more rows, and 5 more variables:
## #   incident_activity_type <fctr>, assignment_group <fctr>,
## #   km_number <fctr>, interaction_id <fctr>, lifecycle <chr>
BPIC14_incident_case_attributes %>% print
## # A tibble: 4,000 × 28
##    ci_name_aff    ci_type_aff           ci_subtype_aff
##         <fctr>         <fctr>                   <fctr>
## 1    WBA000124    application    Web Based Application
## 2    SBA000662    application Server Based Application
## 3    SBA000607    application Server Based Application
## 4    SAP000005    application                      SAP
## 5    SUB000508 subapplication    Web Based Application
## 6    WBA000124    application    Web Based Application
## 7    WBA000124    application    Web Based Application
## 8    WBA000133    application    Web Based Application
## 9    WSR001424       computer           Windows Server
## 10   WBA000124    application    Web Based Application
## # ... with 3,990 more rows, and 25 more variables:
## #   service_component_wbs_aff <fctr>, incident_id <fctr>, status <fctr>,
## #   impact <ord>, urgency <ord>, priority <ord>, category <fctr>,
## #   km_number <fctr>, alert_status <fctr>, x_reassignments <int>,
## #   open_time <dttm>, reopen_time <dttm>, resolved_time <dttm>,
## #   close_time <dttm>, handle_time_hours <dbl>, closure_code <fctr>,
## #   x_related_interactions <int>, related_interaction <fctr>,
## #   x_related_incidents <int>, x_related_changes <int>,
## #   related_change <fctr>, ci_name_cby <fctr>, ci_type_cby <fctr>,
## #   ci_subtype_cby <fctr>, servicecomp_wbs_cby <fctr>

Suppose we are interested in inspecting the structuredness of this event log, and any possible relationship with performance. A first inspection is just to look at the number of different traces, i.e. variants in terms of activity sequences, which are present in the event log.

trace_coverage(BPIC14_incident_log, level_of_analysis = "trace") %>% print(width = Inf)
## # A tibble: 2,430 × 4
##                                                    trace absolute relative
##                                                    <chr>    <int>    <dbl>
## 1                               Open,Closed,Caused By CI      158  0.03950
## 2                               Open,Caused By CI,Closed      135  0.03375
## 3                    Open,Assignment,Closed,Caused By CI      118  0.02950
## 4                    Open,Assignment,Caused By CI,Closed       74  0.01850
## 5      Open,Status Change,Assignment,Closed,Caused By CI       61  0.01525
## 6      Open,Assignment,Status Change,Closed,Caused By CI       58  0.01450
## 7      Open,Status Change,Assignment,Caused By CI,Closed       57  0.01425
## 8      Open,Assignment,Status Change,Caused By CI,Closed       56  0.01400
## 9  Open,Status Change,Assignment,Mail to Customer,Closed       46  0.01150
## 10 Open,Assignment,Status Change,Mail to Customer,Closed       41  0.01025
##    cum_sum
##      <dbl>
## 1  0.03950
## 2  0.07325
## 3  0.10275
## 4  0.12125
## 5  0.13650
## 6  0.15100
## 7  0.16525
## 8  0.17925
## 9  0.19075
## 10 0.20100
## # ... with 2,420 more rows

The output shows that the 10 most common activity sequences are able to cover about 20% of the cases in the log. However, there are a total of 2430 different traces in the event log. Thus, it is clear that there is a fair amount of unstructuredness. Moreover, it seems that less frequent traces are long, in terms of the number of activity instance. The graph below shows that the activity sequences which occur less that occur more than once, remain quite short. However, there exists a lot of exceptional traces which get very long, i.e. up to 170 activity instances.

trace_length(BPIC14_incident_log, "trace") %>% ggplot(aes(relative_trace_frequency, absolute)) + geom_jitter() + scale_x_continuous(limits = c(0,0.01)) + ylab("Trace length") + xlab("Relative trace frequency")
## Warning: Removed 10 rows containing missing values (geom_point).

An interesting analysis would therefore be to define a performance vector for each case in order to identify bad performance cases and examine them more closely, e.g., by looking at their case attributes. For the sake of simplicity, predefined measures will be used to quantify performance. However, any self-defined property of cases can be used. The case properties used in this example are trace length, throughput time, number of self-loops and the number of repetitions. The following piece of code will compute these measures and combine them in one table.

case_performance <- BPIC14_incident_log %>% throughput_time("case") %>% 
    left_join(BPIC14_incident_log %>% trace_length("case")) %>% 
    left_join(BPIC14_incident_log %>% repetitions("case") %>% select(incident_id, absolute) %>% rename(repetitions = absolute)) %>% 
    left_join(BPIC14_incident_log %>% number_of_selfloops("case") %>% select(incident_id, absolute) %>% rename(number_of_selfloops = absolute)) 
## Joining, by = "incident_id"
## Joining, by = "incident_id"
## Joining, by = "incident_id"
case_performance %>% summary
##     incident_id   throughput_time     trace_length      repetitions     
##  IM0000013:   1   Min.   :  0.0002   Min.   :  2.000   Min.   :  0.000  
##  IM0000048:   1   1st Qu.:  0.0523   1st Qu.:  5.000   1st Qu.:  0.000  
##  IM0000049:   1   Median :  0.7648   Median :  7.000   Median :  0.000  
##  IM0000073:   1   Mean   :  5.1169   Mean   :  9.978   Mean   :  3.219  
##  IM0000074:   1   3rd Qu.:  3.9583   3rd Qu.: 11.000   3rd Qu.:  3.000  
##  IM0000105:   1   Max.   :363.0869   Max.   :170.000   Max.   :125.000  
##  (Other)  :3994                                                         
##  number_of_selfloops
##  Min.   : 0.000     
##  1st Qu.: 0.000     
##  Median : 0.000     
##  Mean   : 0.391     
##  3rd Qu.: 1.000     
##  Max.   :12.000     
## 

Cluster analysis

The overal summary shows that there is a wide diversity on each defined aspect of performance. Moreover, it is clear that all variables are right skewed, due to the existence of a limited number of bad performing cases. A cluster analysis might be able to distinguish bad performance cases based on these values. However, since the data is highly skewed, the variables were first normalized. To decide upon the number of clusters, we performed a various number of clusterings, each with a different number of clusters, and compared the SSE of each clustering. To control for the randomness in the selection of centres, 100 different iterations were done at each moment. The graph shows the minimum SSE that was seen for each number of clusters.

input <- scale(case_performance[,2:5]) %>% as.data.frame()

clusters <- data.frame(i = 1:15)
for(i in 1:nrow(clusters)) {
    for(j in 1:100) {
        cl <- kmeans(input,i, iter.max = 20)
        min_sse <- min(Inf, cl$totss - cl$betweenss)
    }
    clusters$sse[i] <- min_sse
}
clusters %>% ggplot(aes(i, sse)) + geom_line() +
    xlab("Number of clusters") + ylab("SSE") + 
    scale_x_continuous(breaks = 1:15) + scale_y_continuous(breaks = seq(0,16000,2000))

It can be observed that the in SSE is negligible when the number of clusters is higher than 5. Therefore, 5 seems to be a reasonable number of clusters. The output of the final clustering belows show that the 5 resulting clusters differ reasonably in size: there is one major cluster, containg about 77% of the cases, and another smaller clusters containing another 17%. The remaining 6% of cases are divided into three tiny clusters. However, keeping in mind the skewedness of the data, this result is not that surprising.

set.seed(4)
cl <- kmeans(input,5, iter.max = 20)
case_performance <- case_performance %>% bind_cols(data.frame(cluster = factor(cl$cluster)))
cl %>% str
## List of 9
##  $ cluster     : int [1:4000] 4 4 4 4 4 4 4 4 4 4 ...
##  $ centers     : num [1:5, 1:4] -0.0268 0.8137 0.8762 -0.2163 7.3392 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:4] "throughput_time" "trace_length" "repetitions" "number_of_selfloops"
##  $ totss       : num 15996
##  $ withinss    : num [1:5] 618 862 996 336 2057
##  $ tot.withinss: num 4868
##  $ betweenss   : num 11128
##  $ size        : int [1:5] 765 154 279 2768 34
##  $ iter        : int 7
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

The table and figures below show how the different clusters are characterised by the different variables. The main cluster, i.e. cluster 4, contains cases which score low on all 4 measures. On the other side of the spectrum, cluster 5 contains cases which on average have a very high value for all metrics. Note however that this is the smallest cluster, thereby really covering exceptional behaviour. Cluster 2 is more or less similar to cluster 5 concerningthe number of repetitions and the trace length. However, the cases in this cluster have less self-loops and remarkably lower throughput times. The two remaining clusters, 1 and 3, contain cases that score reasonably good on all aspects, though inferior to cluster 4. Among these two, cluster 1 seems to be superior.

Explaining the clusters

We have now divided all cases in groups with similar performance characteristics. Subsequently, it would be interesting to see whether these performance characteristics are connected with other attributes, related to the incidents itself. Therefore, we connect the clustering output with the case attributes.

## # A tibble: 5 × 6
##   cluster  freq mean_nr_of_repetitions mean_nr_of_selfloops
##    <fctr> <int>                  <dbl>                <dbl>
## 1       1   765              3.3594771            1.2143791
## 2       2   154             25.6558442            2.3701299
## 3       3   279             10.6845878            0.5698925
## 4       4  2768              0.6195809            0.0000000
## 5       5    34             48.7941176            3.2647059
##   mean_trace_length mean_throughput_time
##               <dbl>                <dbl>
## 1         12.037908             4.650038
## 2         38.415584            19.286986
## 3         20.179211            20.374675
## 4          6.133309             1.349787
## 5         64.088235           132.924664

BPIC14_incident_case_attributes <- BPIC14_incident_case_attributes %>%
    merge(case_performance) 

The graph below that for some configuration item types, the number of cases which performe good at the selected measures is relatively lower that for others. For instance, for incidents related to subapplications, about 80% belongs to the high-performing cluster (4), while for network components this is only about 50%

BPIC14_incident_case_attributes %>% ggplot(aes(reorder(ci_type_aff, as.numeric(cluster) == 4, FUN = "mean"), fill = cluster)) + geom_bar(position = "fill")  +
    scale_fill_brewer() + coord_flip() + xlab("ci_type_aff") + scale_y_continuous(breaks = seq(0,1,0.1))

The following graph shows that cases with bad performance scores (2 and 5) typically are reassigned a lot of times, while cases with good performance levels (cluster 4) have zero are only a few reassignments.

BPIC14_incident_case_attributes %>% ggplot(aes(cluster, x_reassignments)) + geom_boxplot()  +
    scale_fill_brewer() + coord_flip() 

The closure code, stating in which way the incident was solved also seem to be related in a certain sense to performance. Incidents were the users didn’t read the user manual were often scored good on performance levels, while cases which had to be referred had typically a bad performance.

BPIC14_incident_case_attributes %>% ggplot(aes(reorder(closure_code, as.numeric(cluster) == 4, FUN = "mean"), fill = cluster)) + geom_bar(position = "fill")  +
    scale_fill_brewer() + coord_flip() + xlab("closure_code") + scale_y_continuous(breaks = seq(0,1,0.1))

Subsetting the eventlog

Instead of looking at case attributes, it is also possible to filter out the bad performaning cases, for instance cluster 5.

cluster_5 <- BPIC14_incident_log %>% 
    merge(case_performance) %>%
    filter(cluster == 5) %>%
    eventlog(case_id(BPIC14_incident_log), 
             activity_id(BPIC14_incident_log),
             activity_instance_id(BPIC14_incident_log),
             lifecycle_id(BPIC14_incident_log),
             timestamp(BPIC14_incident_log))
## Warning in eventlog(., case_id(BPIC14_incident_log),
## activity_id(BPIC14_incident_log), : No resource identifier provided nor
## found. Set to default: NA

When printing the filtered event log, it can be seen that all 22 cases have a different trace, i.e. activity sequence, which makes them unique in that sense. A visualization of the behaviour in a directed graph might be helpful to understand it better. The reader is refered to packages like igraph of markovchain for this visualization.

cluster_5 
## Event log consisting of:
## 2179 events
## 34 traces
## 34 cases
## 31 activities
## 2179 activity instances
## 
## # A tibble: 2,179 × 13
##    incident_id          data_stamp incident_activity_number
##         <fctr>              <dttm>                   <fctr>
## 1    IM0000013 2013-02-13 12:03:35              001A3970723
## 2    IM0000013 2013-11-08 13:54:13              001A5891812
## 3    IM0000013 2013-11-08 13:50:18              001A5891796
## 4    IM0000013 2013-11-08 13:50:17              001A5891794
## 5    IM0000013 2013-11-08 13:50:18              001A5891795
## 6    IM0000013 2013-11-08 13:54:13              001A5891813
## 7    IM0000048 2013-02-06 13:10:52              001A3926349
## 8    IM0000048 2013-02-06 13:10:52              001A3926348
## 9    IM0000048 2013-02-06 15:15:03              001A3926671
## 10   IM0000048 2013-10-17 13:54:33              001A5709262
## # ... with 2,169 more rows, and 10 more variables:
## #   incident_activity_type <fctr>, assignment_group <fctr>,
## #   km_number <fctr>, interaction_id <fctr>, lifecycle <chr>,
## #   throughput_time <dbl>, trace_length <int>, repetitions <dbl>,
## #   number_of_selfloops <dbl>, cluster <fctr>
cluster_5 %>% traces
## # A tibble: 34 × 4
##                                                                          trace
##                                                                          <chr>
## 1  Open,Status Change,Assignment,External Vendor Assignment,Operator Update,Ve
## 2  Open,Reassignment,Assignment,Description Update,Status Change,Update,Update
## 3  Update from customer,Reassignment,Update from customer,Assignment,Closed,Ca
## 4  Open,Reassignment,Assignment,Reassignment,Operator Update,Assignment,Assign
## 5  Open,Update from customer,Update from customer,Update from customer,Reassig
## 6  Open,Open,Update,Assignment,Status Change,Assignment,Reassignment,Operator 
## 7  Open,Update from customer,Status Change,Assignment,Update from customer,Ass
## 8  Open,Operator Update,Reassignment,Assignment,Update,Update,Reassignment,Ass
## 9  Open,Operator Update,Reassignment,Reassignment,Operator Update,Operator Upd
## 10 Open,Assignment,Reassignment,Assignment,Assignment,Problem Workaround,Statu
## # ... with 24 more rows, and 3 more variables: trace_id <dbl>,
## #   absolute_frequency <int>, relative_frequency <dbl>