Working with data from INCA or Rockan can sometimes be a pain! Not only are some formats strange (such as Boolean and dates), but sometimes the formats also differ internally in INCA or after export to work locally. The incadata
package is aimed to streamline the process of reading and using RCC data (mostly from INCA, hence the name, but also from Rockan).
This vignette will use some example data ex_data
found in the package:
suppressPackageStartupMessages(library(dplyr))
library(incadata)
##
## Attaching package: 'incadata'
## The following object is masked from 'package:dplyr':
##
## id
## The following object is masked from 'package:stats':
##
## filter
dim(ex_data)
## [1] 497 433
It’s a data set with many columns with all types of synthetic INCA-data (it is based on real INCA but everything is randomized and scrambled not to give any details about real patients, doctors, hospitals et cetera).
Le’s here chose a subset of columns just for illustrative purpose:
x <-
ex_data %>%
select(
a_lkf,
a_inrappdatum,
a_inrappsjh,
a_inrappklk,
a_kompl,
a_rappSjHemSj_Beskrivning
)
Now, how are these variables stored?
glimpse(x)
## Observations: 497
## Variables: 6
## $ a_lkf <chr> "148101", "078021", "058019", "14420...
## $ a_inrappdatum <chr> "1985-08-06", "1986-03-15", "1987-08...
## $ a_inrappsjh <chr> "51320", "65016", "521058", "42328",...
## $ a_inrappklk <chr> "571", "056", "951", "290", "009", "...
## $ a_kompl <fctr> , , , , , , , , , , , , , , , , , ,...
## $ a_rappSjHemSj_Beskrivning <fctr> Nej, Nej, Nej, Nej, Ja, Ja, Nej, Ne...
We can see that:
a_inrappdatum
looks like a date but is treated as charactera_lkf
, a_inrappsjh
and a_inrappklk
both look like numbers but are charactersa_kompl
looks like a Boolean but is a factora_rappSjHemSj_Beskrivning
looks like a factor and is … a factor :-)We now want to change these formats to get something more natural.
The package has two main functions and one of them is as.incadata
. It can take either a single vector, or a data frame and it converts its input to a format more relevant for RCC data.
x2 <- as.incadata(x)
## Factors coerced to character: a_rappsjhemsj_beskrivning
## The following variables have new formats:
## * a_inrappdatum (character -> Date)
## * a_inrappsjh (character -> integer)
## * a_kompl (factor -> logical)
## Warning: a_lkf -> a_lkf_lan_beskrivning: transformed to match the keyvalue: Only the first 2 characters are used.
## Warning: a_lkf -> a_lkf_kommun_beskrivning: transformed to match the keyvalue: Only the first 4 characters are used.
## Warning: a_lkf -> a_lkf_hemort_beskrivning: Some codes could not be translated (41 cells)
## New decoded columns added:
## * a_lkf_lan_beskrivning
## * a_lkf_kommun_beskrivning
## * a_lkf_forsamling_beskrivning
## * a_lkf_hemort2_beskrivning
## * a_lkf_hemort_beskrivning
## rownames used as id!
The output message is quite verbose. This is intended since it is probably a good idea to check that all columns are coerced to reasonable formats.
Let’s have a closer look at the result:
glimpse(x2)
## Observations: 497
## Variables: 12
## $ a_lkf <chr> "148101", "078021", "058019", "14...
## $ a_inrappdatum <date> 1985-08-06, 1986-03-15, 1987-08-...
## $ a_inrappsjh <int> 51320, 65016, 521058, 42328, 4110...
## $ a_inrappklk <chr> "571", "056", "951", "290", "009"...
## $ a_kompl <lgl> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ a_rappsjhemsj_beskrivning <chr> "Nej", "Nej", "Nej", "Nej", "Ja",...
## $ a_lkf_lan_beskrivning <chr> "Västra Götalands län", "Kronober...
## $ a_lkf_kommun_beskrivning <chr> "Mölndal", "Växjö", "Linköping", ...
## $ a_lkf_forsamling_beskrivning <chr> "Fässberg", "Söraby", "Vreta klos...
## $ a_lkf_hemort2_beskrivning <chr> "Fässberg", "Söraby", "Vreta klos...
## $ a_lkf_hemort_beskrivning <chr> "Fässberg", "Söraby", "Vreta klos...
## $ id <chr> "1", "2", "3", "4", "5", "6", "7"...
Some things have happened:
a_rappSjHemSj_Beskrivning
-> a_rappsjhemsj_beskrivning
). If two (or more) variable names differ only with regard to case, this will be handled adequately.a_inrappdatum
is now a date! To recognise dates, especially from Rockan, but sometimes also from INCA is a chapter on its own (think of days or months = “00” and “Y-m-d” dates mixed with two digit year ciobined with week numbers such as “7403” et cetera). We dig deeper into that issue in a separate vignette found by: vignette("rccdates")
.as.codedata
knows about this and only treat numbers with non-leading zeroes as numeric (it also distinguish between integers and decimal numbers and it translates the Swedish decimal coma to an English decimal point since some RCC variables are stored that way).a_kompl
is now Boolean and this will happen regardless if we work on INCA (where Boolean are stored as 0/1 or locally where the same values are transformed to “True” or blanks).id
column pointing to individual patients. This variable will be based on either personal identification number, patient id or a simple row number. The idea is that this variable have different names depending on the source (INCA/Rockan) and it is easier to always have an id column with the same name. Also if a personal identification number is included in the data, this will be checked (see the sweidnumbr
package for more info), while the id column will not.a_lkf_xxx_beskrivning
. These are all based on the fact that a_lkf
is a code variable recognized by the decoder
package.The other main function from the packe is use_incadata
. It could be thought of as read.incadata
but it is constructed to work also on INCA (where the data is already available in a data frame named “df” and therefore not read from disk).
This function has three main advantages:
read.csv2
or similair) be used both locally and in INCA so there is no need to have different scripts for development and production.as.incadata
might be slow. With the use of use_incadata
instead of first loading the data, and to always transform it (by as.incadata
), speed might increase. If the original data file is changed (a new export from INCA), the cache will be updated automatically after comparing of MD5 check sums. (The whole caching mechanism is obviously ignored if working on INCA, where the data should always be fresh).as.incadata
is quite verbose (for good reason) but if using the same data over and over again, it might not be meaningful to re-read the messages every time, which use_incadata
does not.Let’s use an example with the same data as above. We save the data to disk as a csv2-file to simulate an exported INCA file.
# Save data as csv2 in temp file
fl <- tempfile("ex_data", fileext = ".csv2")
write.csv2(incadata::ex_data, fl, row.names = FALSE)
Let us now use the data for the “first time”. The process will be verbose (but we omit it here just to save space). Here the cache file will be saved to our temporary directory since that’s where we store our data. When working locally, the cache will be saved next to the original file (from where it can be copied or removed as a regular file). We also time the process to compare the speed with later attempt:
system.time(
x <- use_incadata(fl)
)
## The following variables have new formats:
## * a_inrappdatum (character -> Date)
## * a_inrappsjh (character -> integer)
## * a_kompl (character -> logical)
## * age (character -> integer)
## * a_rappsjhemsj_varde (character -> integer)
## * sjukhuskod (character -> integer)
## * a_ifylldat (character -> Date)
## * a_diadat (character -> Date)
## * a_diagrund_varde (character -> integer)
## * a_prepnummer (character -> integer)
## * a_prepregar (character -> integer)
## * a_morfrefpat_varde (character -> integer)
## * a_patcytefterlabkod (character -> integer)
## * a_prepeftergraregar (character -> integer)
## * a_primop_varde (character -> integer)
## * a_enhprimopsjhkod (character -> integer)
## * a_remannansjukv_varde (character -> integer)
## * a_annansjvremstallds (character -> integer)
## * avgmark_value (character -> integer)
## * avliddat (character -> Date)
## * avregistreringsdatum (character -> Date)
## * fodelsedatum (character -> Date)
## * forsamling_value (character -> integer)
## * kommun_value (character -> integer)
## * landskod_nr (character -> integer)
## * lkf_value (character -> integer)
## * lopnr (character -> integer)
## * makulerad (character -> logical)
## * navetdatum (character -> Date)
## * p_inrappdat (character -> Date)
## * p_inrappenhsjhkod (character -> integer)
## * fakevar_datcervixbiop (character -> Date)
## * fakevar_datpadcervix (character -> Date)
## * fakevar_datkon (character -> Date)
## * fakevar_datpadsvarkon (character -> Date)
## * p_datpadprimkir (character -> Date)
## * p_kompl (character -> logical)
## * p_primkirurgi_varde (character -> integer)
## * fakevar_morfverposlym_varde (character -> integer)
## * fakevar_samplaorta_varde (character -> integer)
## * p_antpelvkort (character -> integer)
## * p_antpospelvkort (character -> integer)
## * p_ant_paraortkort (character -> integer)
## * p_antposparaort (character -> integer)
## * p_metastaser_varde (character -> integer)
## * p_metaparam (character -> logical)
## * p_metaadnexa (character -> logical)
## * p_metavaginaovre (character -> logical)
## * p_metavaginaned (character -> logical)
## * p_metapelvlymf (character -> logical)
## * p_metalever (character -> logical)
## * p_metalunga (character -> logical)
## * p_metahjarna (character -> logical)
## * p_metaskel (character -> logical)
## * p_metaandra (character -> logical)
## * p_diffgradwho_varde (character -> integer)
## * p_datbehbeslut (character -> Date)
## * p_petctkon_varde (character -> integer)
## * p_petctcystoskopi_varde (character -> integer)
## * p_petctlungrtg_varde (character -> integer)
## * p_petctctthorax_varde (character -> integer)
## * p_petcturografi_varde (character -> integer)
## * p_petctcturografi_varde (character -> integer)
## * p_petctctbuk_varde (character -> integer)
## * p_petctmr_varde (character -> integer)
## * p_petctpetct_varde (character -> integer)
## * p_cervixscredelt_varde (character -> integer)
## * p_primbehenlvpr_varde (character -> integer)
## * p_primbehklinstudie_varde (character -> integer)
## * p_mdk_varde (character -> integer)
## * fakevar_behint_varde (character -> integer)
## * fakevar_givenprimbeha_varde (character -> integer)
## * fakevar_sekvens_varde (character -> integer)
## * fakevar_sekvens0_varde (character -> integer)
## * fakevar_sekvens1_varde (character -> integer)
## * fakevar_sekvens2_varde (character -> integer)
## * fakevar_sekvens3_varde (character -> integer)
## * p_behmodannan_varde (character -> integer)
## * fakevar_sekvens4_varde (character -> integer)
## * p_datstartickekirbeh (character -> Date)
## * fakevar_funkstat_varde (character -> integer)
## * p_orsnedsfunstatinfk_varde (character -> integer)
## * p_sjkstatforebeh_varde (character -> integer)
## * p_datstartkt (character -> Date)
## * p_antcykneoadjkt (character -> integer)
## * p_mitosetop (character -> logical)
## * p_mitospakli (character -> logical)
## * p_mitosannan (character -> logical)
## * fakevar_bleom (character -> logical)
## * p_cytotepir (character -> logical)
## * p_ovrcisp (character -> logical)
## * p_ovrkarbo (character -> logical)
## * p_ktadmintraven (character -> logical)
## * p_ktplanregimfullf_varde (character -> integer)
## * p_ktorsakplanregejfu_varde (character -> integer)
## * fakevar_behevalneoadj_varde (character -> integer)
## * fakevar_datrespbed (character -> Date)
## * fakevar_tumorresponse_varde (character -> integer)
## * fakevar_palpnark (character -> logical)
## * fakevar_rontgen (character -> logical)
## * fakevar_ct (character -> logical)
## * fakevar_mr (character -> logical)
## * fakevar_pet (character -> logical)
## * fakevar_ktrt_varde (character -> integer)
## * fakevar_datstartkt (character -> Date)
## * p_krtantcykler (character -> integer)
## * fakevar_datavslkt (character -> Date)
## * fakevar_karboplatin (character -> logical)
## * fakevar_peros (character -> logical)
## * fakevar_intranervost (character -> logical)
## * fakevar_planregfull_varde (character -> integer)
## * fakevar_ktregimejfull_varde (character -> integer)
## * fakevar_datstartkt2 (character -> Date)
## * p_aktantalcykler (character -> integer)
## * fakevar_datavslkt2 (character -> Date)
## * fakevar_ifosfamid2 (character -> logical)
## * fakevar_etoposid2 (character -> logical)
## * fakevar_paklitaxel (character -> logical)
## * fakevar_epirubicin2 (character -> logical)
## * fakevar_cislpatin (character -> logical)
## * fakevar_karboplatinak (character -> logical)
## * fakevar_topotekan2 (character -> logical)
## * fakevar_intravenost2 (character -> logical)
## * fakevar_planregfullf_varde (character -> integer)
## * fakevar_ktregimejfull0_varde (character -> integer)
## * fakevar_datavslbt (character -> Date)
## * fakevar_malomrade_varde (character -> integer)
## * fakevar_applring (character -> logical)
## * cervex_ovoider (character -> logical)
## * fakevar_intrauterin (character -> logical)
## * fakevar_dosrat_varde (character -> integer)
## * fakevar_dosplbaspa_varde (character -> integer)
## * p_slutdosbttot (character -> integer)
## * p_slutdosbtantfrak (character -> integer)
## * p_btplanregfullf_varde (character -> integer)
## * p_btplanregejfullf_varde (character -> integer)
## * p_datstartrt (character -> Date)
## * fakevar_lymfkortelmet (character -> logical)
## * fakevar_tumorresekt (character -> logical)
## * fakevar_snavamarginal (character -> logical)
## * fakevar_karlinvaxt (character -> logical)
## * fakevar_aggresivhist (character -> logical)
## * p_rtindstr23 (character -> logical)
## * fakevar_stroma (character -> logical)
## * p_rtindts (character -> logical)
## * fakevar_annatind (character -> logical)
## * p_rtsnavmarginalbred (character -> integer)
## * fakevar_tumorpelstat (character -> logical)
## * fakevar_boost (character -> logical)
## * fakevar_boostpelvina (character -> logical)
## * fakevar_malomrade2 (character -> logical)
## * fakevar_paraaortala (character -> logical)
## * fakevar_annatmalomr (character -> logical)
## * fakevar_radmed_varde (character -> integer)
## * fakevar_fotoner (character -> logical)
## * fakevar_protoner (character -> logical)
## * fakevar_imrt (character -> logical)
## * p_slutdosmalomr1 (character -> numeric)
## * p_slutdosmalomr2 (character -> numeric)
## * p_slutdosmalomr3 (character -> numeric)
## * p_slutdosmalomr4 (character -> numeric)
## * p_slutdosparaaortlgl (character -> numeric)
## * fakevar_planstralfull_varde (character -> integer)
## * fakevar_planregejfull_varde (character -> integer)
## * p_annanprimbeh_varde (character -> integer)
## * p_behevalefter_varde (character -> integer)
## * p_datresponsbed (character -> Date)
## * fakevar_funkstat1_varde (character -> integer)
## * fakevar_evalavser_varde (character -> integer)
## * p_bedefterprimbeh_varde (character -> integer)
## * p_palputannark (character -> logical)
## * p_palpinark (character -> logical)
## * fakevar_responsbedct (character -> logical)
## * fakevar_respbed (character -> logical)
## * p_petct (character -> logical)
## * fakevar_datstartbt (character -> Date)
## * regdatum (character -> Date)
## * sekretess_value (character -> integer)
## * u_inrappdat (character -> Date)
## * u_inrappenhsjhkod (character -> integer)
## * u_kompl (character -> logical)
## * u_datuppf (character -> Date)
## * u_sjukdomsstat_varde (character -> integer)
## * u_nastaklinkont_varde (character -> integer)
## * u_enhnastaklinkontk (character -> integer)
## * u_enhnastaklinkontj (character -> integer)
## * u_omantalman (character -> integer)
## * u_dodsdat (character -> Date)
## * u_dodsorsak_varde (character -> integer)
## * u_datreci (character -> Date)
## * vitalstatus (character -> integer)
## * vitalstatusdatum (character -> Date)
## * vitalstatusdatum_estimat (character -> Date)
## * a_snomed (character -> integer)
## * a_gynonkenhremsjh (character -> integer)
## New decoded columns added:
## * a_lkf_forsamling_beskrivning
## * a_lkf_lan_beskrivning
## * a_lkf_kommun_beskrivning
## * a_lkf_hemort2_beskrivning
## * a_lkf_hemort_beskrivning
## * kon_value_kon_beskrivning
## * lan_value_lan_beskrivning
## * lan_value_hemort2_beskrivning
## * lkf_value_kommun_beskrivning
## * lkf_value_hemort_beskrivning
## * lkf_value_forsamling_beskrivning
## * lkf_value_hemort2_beskrivning
## * lkf_value_lan_beskrivning
## pat_id used as id!
## Cache file saved: /var/folders/dd/z0l8zzy51jv48ywmfrv8yqt40000gp/T//Rtmpmto6Gg/ex_data573530b795b0.csv2.rds
## user system elapsed
## 1.172 0.012 1.187
Now, let’s assume that we for some reason has to restart the process all over again (and let´s time it again for the sake of comparison):
system.time(
x <- use_incadata(fl)
)
## Use cached file created 2017-07-28 14:39:37
## user system elapsed
## 0.033 0.000 0.035
Voila! Data is already in a good format and process was therefore faster than before!
Additional functions from the package are found by help(package = "incadata")
. We do for example include som dplyr-verb wrappers just to maintain some attributes and to get pretty printing of objects.