Vignette 1 of 4 for the pep725 R package. The most current version is available on GitHub and CRAN.
Phenology is the study of cyclic and seasonal natural phenomena, especially in relation to climate and plant life cycles. In plant science, phenology focuses on the timing of recurring biological events such as:
These phenological phases are strongly temperature-dependent and respond sensitively to interannual climate variability as well as long-term climate change. As a result, phenological records provide some of the most direct biological evidence of climate impacts on terrestrial ecosystems.
The PEP725 Pan European Phenology Database is one of the world’s most comprehensive collections of plant phenological observations. It brings together long-term records from national phenological monitoring networks across Europe into a harmonized and openly accessible research infrastructure. Key characteristics of PEP725 include (Templ et al. 2018, 2026):
Observations in PEP725 are based on standardized phase definitions (largely aligned with BBCH-type schemes) and consistent observation protocols within national networks. This standardization enables comparative analyses across regions, species, and time periods, thus addresses key data integration and reusability challenges commonly encountered in large, heterogeneous phenological datasets (Templ 2025).
Working with phenological data presents several challenges:
The pep725 package addresses these challenges by providing:
data.tableBefore you begin, install the package:
Once installed, load the package at the start of your R session:
When the package loads, you’ll see a startup message with tips on how to get data.
The pep725 package offers five ways to obtain data. Understanding which option to use is important for your workflow:
| Option | Best For | Data Size | Internet Required? |
|---|---|---|---|
| 1. Download synthetic data | Learning, tutorials, reproducible examples | ~7.24M rows | First download only |
| 2. Seed dataset | Quick offline tests, minimal examples | ~1,300 rows | No |
| 3. Generate synthetic data | Creating shareable versions of your data | Variable | No |
| 4. Import real PEP725 data | Actual research and publications | ~7.24M rows | For initial download |
| 5. Use your own phenological data | Regional networks, field trials, citizen science datasets | Variable | No |
You might wonder why we use synthetic data instead of real observations. The original PEP725 database has usage restrictions that prevent redistribution - you must register on the pep725.eu website and accept the data use policy. This creates a problem for tutorials and examples that need to be reproducible.
Synthetic data solves this problem. It is generated to preserve the statistical properties of real data:
This means you can learn all the analysis techniques using synthetic data, then apply them confidently to real data when you’re ready for actual research.
To learn the package, download the pre-generated synthetic dataset:
What happens here?
You can check your cache status or clear it if needed:
# Check cache status - shows file location and size
pep_cache_info()
#> $path
#> [1] "/Users/matthias/Library/Caches/org.R-project.R/R/pep725/pep_synth.rda"
#>
#> $size_mb
#> [1] 111.41
#>
#> $modified
#> [1] "2026-03-04 15:00:46 CET"
#>
#> $exists
#> [1] TRUE
# If you need to clear the cache (e.g., to get an updated version):
# pep_cache_clear()For quick tests when you don’t need the full dataset, use the included seed data:
# Load the included seed dataset (small, for quick tests only)
data(pep_seed)
# This is a small subset (~1,300 rows) useful for:
# - Testing code quickly
# - Working offline
# - Running examples in documentationWhen to use this: If you’re developing code and want fast iteration without waiting for the full dataset to load, the seed dataset is perfect. However, it’s too small for meaningful statistical analyses.
If you have real PEP725 data and want to create a shareable synthetic
version (e.g., for teaching or supplementary materials), use
pep_simulate():
# Generate synthetic data based on your real data
# This preserves statistical properties while anonymizing observations
pep_synthetic <- pep_simulate(your_real_pep_data)
# You can then share pep_synthetic freelyUse case: You’ve done an analysis with real data for a paper, and you want to provide reproducible examples in supplementary materials without violating data sharing restrictions.
For actual research and publications, you’ll want real data. Here’s the process:
pep_import()# Import from downloaded PEP725 files
# Point to the folder containing your downloaded CSV files
pep <- pep_import("path/to/pep725_data/")The pep_import() function automatically cleans and
formats the data:
Although pep725 is designed around the structure and conventions of the PEP725 database, it is not limited to PEP725 data.
If your dataset consists of station-based phenological observations with:
you can use most of the functionality in pep725 with minimal adaptation.
The key requirement is a PEP725-like tabular structure — one row per observation. This structure can usually be achieved with straightforward preprocessing (e.g., renaming columns and ensuring consistent DOY formatting).
Once your data has the required columns, convert it to a
pep object:
# Prepare your data with the required columns
my_data <- data.frame(
s_id = c("MY001", "MY001", "MY002"),
lon = c(10.1, 10.1, 11.3),
lat = c(47.5, 47.5, 48.1),
genus = c("Malus", "Malus", "Malus"),
species = c("Malus domestica", "Malus domestica", "Malus domestica"),
phase_id = c(60, 65, 60),
year = c(2020, 2020, 2020),
day = c(110, 125, 108)
)
# Convert to a pep object — validates required columns automatically
my_pep <- as.pep(my_data)Important Note for Research
All examples in these vignettes use synthetic data. The patterns you see are realistic but not actual scientific observations.
For publishable research, always use real PEP725 data! After registering at pep725.eu, simply replace
pep_download()withpep_import("your/data/path/"). All the analysis techniques work identically.
Now that we have data loaded, let’s understand what we’re working
with. The pep object is a special type of data frame
optimized for large datasets.
# View the data - the print method shows a summary
print(pep)
#> PEP725 Phenological Data
#> --------------------------------------------------
#> Observations: 7,240,883
#> Stations: 10,927
#> Species: 41
#> Phases: 73 (BBCH codes: 0, 1, 3, 7, 9, ...)
#> Years: 1775 - 2025
#> Extent: lon [-10.24, 48.35], lat [14.44, 68.44]
#> Countries: 34 (Austria, Belgium, Bosnia and Herz., Bulgaria, Croatia, ...)
#> --------------------------------------------------
#> First 5 rows:
#> s_id lon lat genus species phase_id year day
#> <fctr> <num> <num> <fctr> <fctr> <int> <int> <int>
#> 1: 26675 16.3264 48.3036 Vitis Vitis vinifera 100 1775 297
#> 2: 26675 16.3264 48.3036 Vitis Vitis vinifera 100 1776 281
#> 3: 26675 16.3264 48.3036 Vitis Vitis vinifera 100 1777 288
#> 4: 26675 16.3264 48.3036 Vitis Vitis vinifera 100 1778 290
#> 5: 26675 16.3264 48.3036 Vitis Vitis vinifera 100 1779 288Let’s get a quick visual overview. The plot() method for
pep objects provides three built-in views — a station map,
a time series, and a DOY histogram:
The output shows you:
Each row in the dataset represents one phenological observation - a record of when a specific plant at a specific location reached a particular developmental stage.
| Column | Description | Example |
|---|---|---|
s_id |
Station identifier - A unique code for each monitoring location | “AT0001” |
lon, lat |
Geographic coordinates - Longitude and latitude in decimal degrees | 15.5, 48.2 |
alt |
Altitude - Elevation above sea level in meters | 350 |
genus |
Plant genus - The genus name (first part of scientific name) | “Malus” |
species |
Full species name - Genus + specific epithet | “Malus domestica” |
phase_id |
BBCH code - Standardized phenological stage (explained below) | 60 |
year |
Observation year | 2015 |
day |
Day of year (DOY) - The day when the phase was observed (1-365) | 145 |
country |
Country name | “Austria” |
Phenological data uses Day of Year (DOY) rather than calendar dates. This simplifies calculations and comparisons:
Why DOY? It makes it easy to calculate things like “flowering advanced by 10 days” without dealing with month boundaries.
pep ClassThe package provides a custom class system that extends
data.table with phenology-specific methods. Let’s verify
what class our data has:
You’ll see three classes:
pep - Our custom class with phenology-specific
methodsdata.table - Fast data manipulation (from the
data.table package)data.frame - Base R compatibilityThis means you can use:
The summary() function is your first tool for
understanding a phenological dataset. It provides different views
depending on the by parameter:
# Default summary - shows top species by observation count
summary(pep)
#> PEP725 Phenological Data Summary
#> ==================================================
#>
#> OVERVIEW
#> Total observations: 7,240,883
#> Unique stations: 10,927
#> Year range: 1775 - 2025
#> DOY range: -210 - 578
#>
#> TOP 10 SPECIES (by observation count)
#> genus species n_obs n_stations year_min year_max median_doy
#> <fctr> <fctr> <int> <int> <int> <int> <int>
#> 1: Hordeum Hordeum vulgare 930924 7996 1870 2025 175
#> 2: Malus Malus domestica 899311 8403 1896 2025 140
#> 3: Triticum Triticum aestivum 668330 7504 1870 2024 220
#> 4: Avena Avena sativa 642098 7737 1870 2025 150
#> 5: Secale Secale cereale 605981 7837 1870 2025 213
#> 6: Solanum Solanum tuberosum 593638 7866 1870 2025 176
#> 7: Prunus Prunus cerasus 483615 7083 1873 2025 130
#> 8: Prunus Prunus domestica 429971 7617 1868 2025 134
#> 9: Zea Zea mays 417009 5692 1941 2025 193
#> 10: Beta Beta vulgaris 403413 6219 1926 2024 130What this tells you: Which species have the most data, how many stations observe each species, and the year range of observations.
# Summary by phenological phase
summary(pep, by = "phase")
#> PEP725 Phenological Data Summary
#> ==================================================
#>
#> OVERVIEW
#> Total observations: 7,240,883
#> Unique stations: 10,927
#> Year range: 1775 - 2025
#> DOY range: -210 - 578
#>
#> OBSERVATIONS BY BBCH PHASE
#> phase_id phase_name n_obs n_species
#> <int> <char> <int> <int>
#> 1: 0 Dry seed / Dormancy 751405 21
#> 2: 1 Seed imbibition 18041 8
#> 3: 3 Other 1099 6
#> 4: 7 Coleoptile emerged 17758 9
#> 5: 9 Emergence 230 8
#> 6: 10 First leaf through coleoptile 899415 19
#> 7: 11 First leaf unfolded 22210 22
#> 8: 12 2 leaves unfolded 61 5
#> 9: 13 3 leaves unfolded 328 2
#> 10: 14 Other 142 1
#> 11: 15 True leaves 3382 11
#> 12: 16 Other 1 1
#> 13: 17 Other 2 1
#> 14: 19 9+ leaves unfolded 45 6
#> 15: 21 Beginning of tillering 1 1
#> 16: 31 First node detectable 414453 9
#> 17: 33 Other 1 1
#> 18: 35 Other 1947 2
#> 19: 48 Other 41 1
#> 20: 49 Late boot 120 1
#> 21: 51 Beginning of heading 417161 8
#> 22: 55 Middle of heading 10054 6
#> 23: 56 Other 3 1
#> 24: 57 Other 79 2
#> 25: 59 End of heading 12005 9
#> 26: 60 Beginning of flowering / Heading complete 834172 33
#> 27: 61 Beginning of flowering 80103 29
#> 28: 63 Other 628 15
#> 29: 65 Full flowering / Anthesis 475922 33
#> 30: 67 Other 207 15
#> 31: 69 End of flowering 442901 20
#> 32: 75 Medium milk 29851 5
#> 33: 77 Late milk 1 1
#> 34: 80 Dough development 115 1
#> 35: 81 Beginning of ripening or fruit colouration 9340 20
#> 36: 83 Early dough 414 2
#> 37: 85 Soft dough 299417 14
#> 38: 87 Hard dough 894688 22
#> 39: 89 Fully ripe 15497 18
#> 40: 91 Other 660 10
#> 41: 93 Other 1070 12
#> 42: 95 50% of leaves fallen 93066 20
#> 43: 97 Plant dead 2677 13
#> 44: 100 Harvest 923474 34
#> 45: 109 Other 16802 8
#> 46: 111 First cut for silage 241233 3
#> 47: 131 First cut for hay 127007 1
#> 48: 135 Other 253 1
#> 49: 139 Other 235 1
#> 50: 140 Other 7638 1
#> 51: 151 Start of harvest for silage (corn, grass) 60999 1
#> 52: 161 Other 4796 1
#> 53: 182 Other 35771 1
#> 54: 200 Other 103 4
#> 55: 201 Other 975 12
#> 56: 202 Other 2 1
#> 57: 203 Other 2190 6
#> 58: 204 Other 35 1
#> 59: 205 Autumn coloring >= 50% 53709 17
#> 60: 206 Other 1 1
#> 61: 207 Other 3 1
#> 62: 209 Other 2412 7
#> 63: 210 Other 27 2
#> 64: 213 Other 51 4
#> 65: 223 Other 1166 3
#> 66: 250 Other 5960 1
#> 67: 380 Other 1982 16
#> 68: 381 Other 222 11
#> 69: 383 Other 1 1
#> 70: 385 Other 3119 9
#> 71: 386 Other 2 1
#> 72: 387 Other 1 1
#> 73: 389 Other 1 1
#> phase_id phase_name n_obs n_species
#> <int> <char> <int> <int>
#> median_doy iqr_doy
#> <int> <int>
#> 1: 127 170
#> 2: 138 38
#> 3: 115 60
#> 4: 117 18
#> 5: 117 37
#> 6: 141 157
#> 7: 118 22
#> 8: 107 39
#> 9: 165 185
#> 10: 106 178
#> 11: 111 27
#> 12: 123 0
#> 13: 126 3
#> 14: 140 19
#> 15: 130 0
#> 16: 129 29
#> 17: 130 0
#> 18: 163 20
#> 19: 207 28
#> 20: 216 52
#> 21: 161 25
#> 22: 160 21
#> 23: 69 4
#> 24: 96 11
#> 25: 182 29
#> 26: 124 30
#> 27: 157 21
#> 28: 126 105
#> 29: 124 18
#> 30: 113 45
#> 31: 132 17
#> 32: 228 17
#> 33: 109 0
#> 34: 278 20
#> 35: 239 28
#> 36: 214 16
#> 37: 206 27
#> 38: 234 64
#> 39: 217 28
#> 40: 274 38
#> 41: 292 26
#> 42: 301 18
#> 43: 314 16
#> 44: 226 49
#> 45: 242 35
#> 46: 171 101
#> 47: 155 15
#> 48: 169 14
#> 49: 181 17
#> 50: 232 21
#> 51: 269 20
#> 52: 270 22
#> 53: 81 22
#> 54: 263 24
#> 55: 281 26
#> 56: 296 5
#> 57: 286 15
#> 58: 282 24
#> 59: 286 20
#> 60: 301 0
#> 61: 320 6
#> 62: 295 15
#> 63: 273 20
#> 64: 285 19
#> 65: 120 14
#> 66: 85 25
#> 67: 222 84
#> 68: 263 112
#> 69: 301 0
#> 70: 258 69
#> 71: 276 15
#> 72: 309 0
#> 73: 309 0
#> median_doy iqr_doy
#> <int> <int>What this tells you: Which developmental stages are recorded, how many observations of each, and the typical DOY (useful for understanding the seasonal distribution of your data).
# Summary by country (showing top 5)
summary(pep, by = "country", top = 5)
#> PEP725 Phenological Data Summary
#> ==================================================
#>
#> OVERVIEW
#> Total observations: 7,240,883
#> Unique stations: 10,927
#> Year range: 1775 - 2025
#> DOY range: -210 - 578
#>
#> TOP 5 COUNTRIES (by observation count)
#> country n_obs n_stations n_species year_min year_max
#> <char> <int> <int> <int> <int> <int>
#> 1: Germany-North 4181984 4685 17 1951 2023
#> 2: Germany-South 2333531 1987 18 1930 2023
#> 3: Austria 366048 1434 22 1775 2025
#> 4: Sweden 82331 908 22 1870 2024
#> 5: Slovakia 53052 76 15 1950 2024What this tells you: Geographic coverage - which countries contribute most data, how many stations they have, and species diversity.
# Temporal coverage summary
summary(pep, by = "year")
#> PEP725 Phenological Data Summary
#> ==================================================
#>
#> OVERVIEW
#> Total observations: 7,240,883
#> Unique stations: 10,927
#> Year range: 1775 - 2025
#> DOY range: -210 - 578
#>
#> TEMPORAL COVERAGE
#> Years with data: 247
#> Median obs/year: 1,243
#> Peak year: 1968 (151,997 obs)
#>
#> Last 10 years:
#> year n_obs n_stations n_species
#> <int> <int> <int> <int>
#> 1: 2016 47812 1654 32
#> 2: 2017 43068 1571 29
#> 3: 2018 41041 1536 33
#> 4: 2019 42209 1522 29
#> 5: 2020 40181 1500 32
#> 6: 2021 39085 1393 31
#> 7: 2022 36813 1376 28
#> 8: 2023 35133 1351 28
#> 9: 2024 4315 369 27
#> 10: 2025 649 85 19What this tells you: How observations are distributed across years - are there gaps? Is coverage increasing or decreasing over time?
You’ll often want to focus on specific subsets of the data. The
package preserves the pep class when you subset, so all
methods continue to work:
# Filter to apple data using data.table syntax
# The syntax is: dataset[row_filter, column_selection]
apple <- pep[species == "Malus domestica"]
print(apple)
#> PEP725 Phenological Data
#> --------------------------------------------------
#> Observations: 899,311
#> Stations: 8,403
#> Species: 1
#> Phases: 31 (BBCH codes: 0, 7, 10, 11, 15, ...)
#> Years: 1896 - 2025
#> Extent: lon [-8.37, 48.35], lat [14.44, 66.90]
#> Countries: 24 (Austria, Bosnia and Herz., Croatia, Czechia, France, ...)
#> --------------------------------------------------
#> First 5 rows:
#> s_id lon lat genus species phase_id year day
#> <fctr> <num> <num> <fctr> <fctr> <int> <int> <int>
#> 1: 29529 2.49 48.8 Malus Malus domestica 65 1896 103
#> 2: 29529 2.49 48.8 Malus Malus domestica 65 1897 118
#> 3: 29529 2.49 48.8 Malus Malus domestica 65 1898 118
#> 4: 29529 2.49 48.8 Malus Malus domestica 65 1900 130
#> 5: 29529 2.49 48.8 Malus Malus domestica 65 1901 116
# Filter by year range
recent <- pep[year >= 2000 & year <= 2015]
print(recent)
#> PEP725 Phenological Data
#> --------------------------------------------------
#> Observations: 985,849
#> Stations: 3,576
#> Species: 33
#> Phases: 56 (BBCH codes: 0, 3, 7, 9, 10, ...)
#> Years: 2000 - 2015
#> Extent: lon [-10.24, 28.40], lat [37.22, 64.55]
#> Countries: 28 (Austria, Belgium, Bosnia and Herz., Bulgaria, Croatia, ...)
#> --------------------------------------------------
#> First 5 rows:
#> s_id lon lat genus species phase_id year day
#> <fctr> <num> <num> <fctr> <fctr> <int> <int> <int>
#> 1: 136 9.18333 54.2500 perm_grass <NA> 182 2000 5
#> 2: 6047 12.10000 50.8333 perm_grass <NA> 182 2000 9
#> 3: 1304 7.10000 51.1000 perm_grass <NA> 182 2000 10
#> 4: 6532 14.08330 47.4833 Secale Secale cereale 0 2000 15
#> 5: 206 9.66667 54.3167 perm_grass <NA> 182 2000 17
# The summary method works on subsets too
summary(recent)
#> PEP725 Phenological Data Summary
#> ==================================================
#>
#> OVERVIEW
#> Total observations: 985,849
#> Unique stations: 3,576
#> Year range: 2000 - 2015
#> DOY range: -128 - 523
#>
#> TOP 10 SPECIES (by observation count)
#> genus species n_obs n_stations year_min year_max median_doy
#> <fctr> <fctr> <int> <int> <int> <int> <int>
#> 1: Malus Malus domestica 179208 2734 2000 2015 133
#> 2: Zea Zea mays 107126 1829 2000 2015 198
#> 3: Hordeum Hordeum vulgare 95636 1906 2000 2015 181
#> 4: Triticum Triticum aestivum 90484 1848 2000 2015 207
#> 5: Prunus Prunus avium 86023 2820 2000 2015 122
#> 6: Secale Secale cereale 68568 1491 2000 2015 202
#> 7: Avena Avena sativa 67242 1558 2000 2015 150
#> 8: Pyrus Pyrus communis 65482 2063 2000 2015 121
#> 9: Prunus Prunus cerasus 61866 1835 2000 2015 123
#> 10: perm_grass <NA> 50086 2328 2000 2015 135Tip for beginners: The data.table
syntax pep[condition] is similar to base R’s
subset() function but much faster for large datasets. You
can combine multiple conditions with & (and) or
| (or).
The BBCH scale [from Biologische Bundesanstalt, Bundessortenamt und CHemische Industrie; Meier (2018)] is a standardized system for describing plant developmental stages. It’s used worldwide, ensuring that “flowering” means the same thing whether observed in Germany or Spain.
BBCH codes are two-digit numbers where:
| First Digit | Principal Growth Stage |
|---|---|
| 0 | Germination / sprouting |
| 1 | Leaf development |
| 2 | Formation of side shoots / tillering |
| 3 | Stem elongation |
| 4 | Development of harvestable vegetative parts |
| 5 | Inflorescence emergence / heading |
| 6 | Flowering |
| 7 | Development of fruit |
| 8 | Ripening of fruit and seed |
| 9 | Senescence / dormancy |
The bbch_description() function translates codes to
human-readable descriptions:
# Get descriptions for BBCH codes present in your data
codes <- unique(pep$phase_id)
bbch_description(codes)
#> phase_id description
#> 1 0 Dry seed / Dormancy
#> 2 1 Seed imbibition
#> 3 7 Coleoptile emerged
#> 4 9 Emergence
#> 5 10 First leaf through coleoptile
#> 6 11 First leaf unfolded
#> 7 12 2 leaves unfolded
#> 8 13 3 leaves unfolded
#> 9 15 True leaves
#> 10 19 9+ leaves unfolded
#> 11 21 Beginning of tillering
#> 12 31 First node detectable
#> 13 49 Late boot
#> 14 51 Beginning of heading
#> 15 55 Middle of heading
#> 16 59 End of heading
#> 17 60 Beginning of flowering / Heading complete
#> 18 61 Beginning of flowering
#> 19 65 Full flowering / Anthesis
#> 20 69 End of flowering
#> 21 75 Medium milk
#> 22 77 Late milk
#> 23 80 Dough development
#> 24 81 Beginning of ripening or fruit colouration
#> 25 83 Early dough
#> 26 85 Soft dough
#> 27 87 Hard dough
#> 28 89 Fully ripe
#> 29 95 50% of leaves fallen
#> 30 97 Plant dead
#> 31 100 Harvest
#> 32 111 First cut for silage
#> 33 131 First cut for hay
#> 34 151 Start of harvest for silage (corn, grass)
#> 35 205 Autumn coloring >= 50%| Code | Stage | What to Look For |
|---|---|---|
| 10 | First leaves | Cotyledons or first true leaves visible |
| 11 | First leaf unfolded | First true leaf fully unfolded |
| 60 | First flowers open | Beginning of flowering (anthesis) |
| 65 | Full flowering | 50% of flowers open |
| 69 | End of flowering | All petals fallen |
| 85 | Fruit coloring | Softening begins |
| 87 | Fruit ripe | Ready for harvest |
| 89 | Full ripeness | Fruit fully mature |
| 91 | Shoot growth complete | Leaves fully colored |
| 95 | Leaf fall | 50% of leaves fallen |
When analyzing phenology, you typically focus on specific phenophases. Here’s how:
# Get all flowering observations (BBCH 60-69)
flowering <- pep[phase_id >= 60 & phase_id <= 69]
cat("Flowering observations:", nrow(flowering), "\n")
#> Flowering observations: 1833933
# Get only full flowering (BBCH 65)
full_flowering <- pep[phase_id == 65]
cat("Full flowering observations:", nrow(full_flowering), "\n")
#> Full flowering observations: 475922
# Get harvest-related observations
harvest <- pep[phase_id %in% c(87, 89)]
cat("Harvest observations:", nrow(harvest), "\n")
#> Harvest observations: 910185Before diving into analysis, it’s crucial to understand what your
dataset contains. The pep_coverage() function provides a
comprehensive overview.
# Get a complete coverage report
pep_coverage(pep)
#> PEP725 Data Coverage Report
#> ==================================================
#>
#> TEMPORAL COVERAGE
#> ------------------------------
#> Year range: 1775 - 2025 (247 years)
#> Missing years: 4 gaps
#> (1786, 1805, 1820, 1829)
#> Median obs/year: 1,243
#> Peak year: 1968 (151,997 obs)
#>
#> GEOGRAPHICAL COVERAGE
#> ------------------------------
#> Countries: 35
#> Stations: 10,927
#> Longitude: -10.24 to 48.35
#> Latitude: 14.44 to 68.44
#> Altitude: 0 to 1933 m
#>
#> Top countries:
#> country n_obs n_stations n_species
#> <char> <int> <int> <int>
#> 1: Germany-North 4181984 4685 17
#> 2: Germany-South 2333531 1987 18
#> 3: Austria 366048 1434 22
#> 4: Sweden 82331 908 22
#> 5: Slovakia 53052 76 15
#> 6: Switzerland 51620 186 19
#> 7: <NA> 32117 98 22
#> 8: France 20487 706 22
#> 9: Croatia 20130 10 15
#> 10: Czechia 20008 146 14
#>
#> SPECIES COVERAGE
#> ------------------------------
#> Genera: 30
#> Species: 41
#> Subspecies: 129
#>
#> Top genera:
#> genus n_obs n_species n_stations
#> <fctr> <int> <int> <int>
#> 1: Prunus 1357252 7 10012
#> 2: Hordeum 930924 1 7996
#> 3: Malus 899311 1 8403
#> 4: Triticum 668330 1 7504
#> 5: Avena 642098 1 7737
#> 6: Secale 605981 1 7837
#> 7: Solanum 593638 1 7866
#> 8: Zea 417009 1 5692
#> 9: Beta 403413 1 6219
#> 10: Pyrus 308320 1 7749
#>
#> Top species:
#> genus species n_obs n_stations year_min year_max
#> <fctr> <fctr> <int> <int> <int> <int>
#> 1: Hordeum Hordeum vulgare 930924 7996 1870 2025
#> 2: Malus Malus domestica 899311 8403 1896 2025
#> 3: Triticum Triticum aestivum 668330 7504 1870 2024
#> 4: Avena Avena sativa 642098 7737 1870 2025
#> 5: Secale Secale cereale 605981 7837 1870 2025
#> 6: Solanum Solanum tuberosum 593638 7866 1870 2025
#> 7: Prunus Prunus cerasus 483615 7083 1873 2025
#> 8: Prunus Prunus domestica 429971 7617 1868 2025
#> 9: Zea Zea mays 417009 5692 1941 2025
#> 10: Beta Beta vulgaris 403413 6219 1926 2024This report shows you three dimensions of coverage:
You can focus on specific aspects and optionally generate plots:
#> PEP725 Data Coverage Report
#> ==================================================
#>
#> TEMPORAL COVERAGE
#> ------------------------------
#> Year range: 1775 - 2025 (247 years)
#> Missing years: 4 gaps
#> (1786, 1805, 1820, 1829)
#> Median obs/year: 1,243
#> Peak year: 1968 (151,997 obs)
The temporal plot shows observation density over time. Look for:
# Geographical coverage - which countries have most data
pep_coverage(pep, kind = "geographical", top = 5)
#> PEP725 Data Coverage Report
#> ==================================================
#>
#> GEOGRAPHICAL COVERAGE
#> ------------------------------
#> Countries: 35
#> Stations: 10,927
#> Longitude: -10.24 to 48.35
#> Latitude: 14.44 to 68.44
#> Altitude: 0 to 1933 m
#>
#> Top countries:
#> country n_obs n_stations n_species
#> <char> <int> <int> <int>
#> 1: Germany-North 4181984 4685 17
#> 2: Germany-South 2333531 1987 18
#> 3: Austria 366048 1434 22
#> 4: Sweden 82331 908 22
#> 5: Slovakia 53052 76 15Geographic coverage tells you where your data comes from. This is important because:
# Species coverage - which species are best represented
pep_coverage(pep, kind = "species", top = 5)
#> PEP725 Data Coverage Report
#> ==================================================
#>
#> SPECIES COVERAGE
#> ------------------------------
#> Genera: 30
#> Species: 41
#> Subspecies: 129
#>
#> Top genera:
#> genus n_obs n_species n_stations
#> <fctr> <int> <int> <int>
#> 1: Prunus 1357252 7 10012
#> 2: Hordeum 930924 1 7996
#> 3: Malus 899311 1 8403
#> 4: Triticum 668330 1 7504
#> 5: Avena 642098 1 7737
#>
#> Top species:
#> genus species n_obs n_stations year_min year_max
#> <fctr> <fctr> <int> <int> <int> <int>
#> 1: Hordeum Hordeum vulgare 930924 7996 1870 2025
#> 2: Malus Malus domestica 899311 8403 1896 2025
#> 3: Triticum Triticum aestivum 668330 7504 1870 2024
#> 4: Avena Avena sativa 642098 7737 1870 2025
#> 5: Secale Secale cereale 605981 7837 1870 2025Species coverage shows which plants have enough data for robust analyses. As a rule of thumb:
You can break down coverage by grouping variables:
# Temporal coverage broken down by country
cov_by_country <- pep_coverage(pep, kind = "temporal", by = "country")
# Show observations by group
cov_by_country$temporal$obs_by_group
#> country n_obs n_stations year_min year_max n_years
#> <char> <int> <int> <int> <int> <int>
#> 1: Germany-North 4181984 4685 1951 2023 73
#> 2: Germany-South 2333531 1987 1930 2023 80
#> 3: Austria 366048 1434 1775 2025 193
#> 4: Sweden 82331 908 1870 2024 115
#> 5: Slovakia 53052 76 1950 2024 55
#> 6: Switzerland 51620 186 1951 2024 74
#> 7: <NA> 32117 98 1873 2025 121
#> 8: France 20487 706 1872 2025 151
#> 9: Croatia 20130 10 1961 2024 63
#> 10: Czechia 20008 146 1930 2023 76
#> 11: Slovenia 11507 15 1961 2020 60
#> 12: Lithuania 11449 28 1960 2004 45
#> 13: Poland 9461 14 1951 2024 74
#> 14: Montenegro 9386 7 1951 2021 63
#> 15: Spain 7446 79 1946 2025 77
#> 16: Latvia 6595 102 1962 2018 52
#> 17: Bosnia and Herz. 5278 7 1971 2024 42
#> 18: Hungary 3757 9 1947 2024 67
#> 19: Romania 3700 70 1982 2005 24
#> 20: Netherlands 3195 246 1868 2009 109
#> 21: Russia 2395 33 1873 2019 68
#> 22: Belgium 1860 64 1944 2020 62
#> 23: Italy 650 12 1984 2024 40
#> 24: Luxembourg 639 2 1955 2004 47
#> 25: Norway 637 5 1927 2005 79
#> 26: Ireland 453 5 1968 2024 57
#> 27: Bulgaria 385 8 1973 2005 33
#> 28: Liechtenstein 309 1 1973 2024 49
#> 29: Serbia 294 6 1971 2002 29
#> 30: Denmark 133 4 1972 2000 29
#> 31: United Kingdom 23 1 1971 1981 11
#> 32: Portugal 12 1 1973 1981 8
#> 33: Greece 6 2 1971 1999 2
#> 34: Finland 4 1 1976 1984 3
#> 35: Yemen 1 1 2025 2025 1
#> country n_obs n_stations year_min year_max n_years
#> <char> <int> <int> <int> <int> <int>This helps identify which countries have continuous records versus sporadic observations.
Let’s put everything together with a simple but meaningful analysis: tracking how apple flowering dates have changed over time.
# Step 1: Filter to apple flowering (BBCH 60 = first flowers)
apple_flowering <- pep[species == "Malus domestica" & phase_id == 60]
cat("Apple flowering observations:", nrow(apple_flowering), "\n")
#> Apple flowering observations: 205403
cat("Year range:", min(apple_flowering$year), "-", max(apple_flowering$year), "\n")
#> Year range: 1926 - 2025
cat("Countries:", length(unique(apple_flowering$country)), "\n")
#> Countries: 24
# Step 2: Calculate annual mean DOY
# The data.table syntax here means:
# - Group by year
# - Calculate mean DOY and count observations
# - Order by year
annual_mean <- apple_flowering[, .(
mean_doy = mean(day, na.rm = TRUE),
n_obs = .N
), by = year][order(year)]
# Step 3: Look at the results
print(annual_mean)
#> year mean_doy n_obs
#> <int> <num> <int>
#> 1: 1926 119.4658 73
#> 2: 1927 126.0164 61
#> 3: 1928 126.5750 80
#> 4: 1929 133.8764 89
#> 5: 1930 126.1333 90
#> 6: 1931 132.2447 94
#> 7: 1932 130.5000 2
#> 8: 1933 135.5000 2
#> 9: 1934 138.0000 2
#> 10: 1935 139.0000 1
#> 11: 1936 135.0000 1
#> 12: 1937 136.0000 1
#> 13: 1938 141.0000 1
#> 14: 1939 145.0000 1
#> 15: 1940 138.0000 1
#> 16: 1941 128.0000 1
#> 17: 1942 144.0000 1
#> 18: 1943 136.0000 3
#> 19: 1944 120.0000 4
#> 20: 1945 128.5000 2
#> 21: 1946 112.7351 268
#> 22: 1947 118.6537 283
#> 23: 1948 116.9475 305
#> 24: 1949 117.0677 310
#> 25: 1950 118.7792 308
#> 26: 1951 125.8272 2245
#> 27: 1952 122.3525 2369
#> 28: 1953 121.1412 2471
#> 29: 1954 128.4589 2702
#> 30: 1955 129.3909 2937
#> 31: 1956 130.0572 3060
#> 32: 1957 124.0544 2905
#> 33: 1958 128.6096 3256
#> 34: 1959 121.6012 3054
#> 35: 1960 125.2657 3161
#> 36: 1961 120.4273 2612
#> 37: 1962 127.8020 2646
#> 38: 1963 127.8143 2703
#> 39: 1964 126.9062 2591
#> 40: 1965 129.0643 2659
#> 41: 1966 125.9031 3167
#> 42: 1967 126.7869 3248
#> 43: 1968 125.0423 3306
#> 44: 1969 128.3654 3270
#> 45: 1970 130.1720 3139
#> 46: 1971 127.7643 3250
#> 47: 1972 127.3223 3267
#> 48: 1973 129.3266 3295
#> 49: 1974 125.8274 3180
#> 50: 1975 128.0957 3157
#> 51: 1976 127.9840 3122
#> 52: 1977 128.3329 3046
#> 53: 1978 128.7715 3064
#> 54: 1979 129.8692 3042
#> 55: 1980 129.7322 3193
#> 56: 1981 127.6371 2987
#> 57: 1982 128.9987 2982
#> 58: 1983 128.0494 2917
#> 59: 1984 129.0000 2894
#> 60: 1985 128.3925 2838
#> 61: 1986 127.8025 2795
#> 62: 1987 127.2105 2717
#> 63: 1988 125.6127 2703
#> 64: 1989 123.4611 2611
#> 65: 1990 121.7045 2474
#> 66: 1991 122.6691 3545
#> 67: 1992 122.7809 4212
#> 68: 1993 122.0902 3782
#> 69: 1994 122.3447 3797
#> 70: 1995 122.1057 3585
#> 71: 1996 121.9156 3921
#> 72: 1997 121.0351 3535
#> 73: 1998 120.4800 3421
#> 74: 1999 119.8520 3290
#> 75: 2000 119.3216 3265
#> 76: 2001 119.8012 2736
#> 77: 2002 118.9894 2634
#> 78: 2003 118.9929 2552
#> 79: 2004 118.8181 2524
#> 80: 2005 118.3861 2362
#> 81: 2006 118.8295 2369
#> 82: 2007 117.2291 2270
#> 83: 2008 116.9619 2284
#> 84: 2009 116.6252 2276
#> 85: 2010 116.6667 2292
#> 86: 2011 115.7146 2267
#> 87: 2012 115.8378 2324
#> 88: 2013 116.8630 2175
#> 89: 2014 114.3385 2077
#> 90: 2015 115.7285 2081
#> 91: 2016 115.5584 2013
#> 92: 2017 114.5264 1894
#> 93: 2018 114.5914 1833
#> 94: 2019 113.8795 1817
#> 95: 2020 113.5284 1830
#> 96: 2021 116.5177 1756
#> 97: 2022 115.4621 1755
#> 98: 2023 116.7220 1676
#> 99: 2024 106.9771 218
#> 100: 2025 109.4375 16
#> year mean_doy n_obs
#> <int> <num> <int>Now let’s visualize the trend:
# Step 4: Plot the trend with ggplot2
library(ggplot2)
p <- ggplot(annual_mean, aes(x = year, y = mean_doy)) +
geom_point(aes(size = n_obs), color = "steelblue", alpha = 0.6) +
geom_line(color = "steelblue", alpha = 0.4) +
geom_smooth(method = "lm", color = "red", linetype = 2, se = FALSE) +
labs(
title = "Apple Flowering Date Over Time",
x = "Year", y = "Mean Day of Year",
size = "Observations"
) +
theme_minimal()
print(p)
#> `geom_smooth()` using formula = 'y ~ x'
# Step 5: Quantify the trend
trend <- lm(mean_doy ~ year, data = annual_mean)
slope <- coef(trend)[2]
cat("\nTrend:", round(slope * 10, 2), "days per decade\n")
#>
#> Trend: -1.77 days per decade
if (slope < 0) {
cat("Interpretation: Flowering is getting EARLIER\n")
} else {
cat("Interpretation: Flowering is getting LATER\n")
}
#> Interpretation: Flowering is getting EARLIERInterpreting the results: A negative slope means flowering is occurring earlier in the year - a common finding for spring phenology in a warming climate. A shift of -2 days per decade means that over 50 years, flowering has advanced by about 10 days.
Let’s compare apple with grapevine (Vitis vinifera), which has the longest time series in the database (records back to 1775 in some regions):
# Step 1: Filter to grapevine flowering (BBCH 65 = full flowering)
vine_flowering <- pep[species == "Vitis vinifera" & phase_id == 65]
cat("Grapevine flowering observations:", nrow(vine_flowering), "\n")
#> Grapevine flowering observations: 17749
cat("Year range:", min(vine_flowering$year), "-", max(vine_flowering$year), "\n")
#> Year range: 1951 - 2024
cat("Countries:", length(unique(vine_flowering$country)), "\n")
#> Countries: 15
# Step 2: Calculate annual mean DOY
vine_annual <- vine_flowering[, .(
mean_doy = mean(day, na.rm = TRUE),
n_obs = .N
), by = year][order(year)]
# Step 3: Plot the trend
p <- ggplot(vine_annual, aes(x = year, y = mean_doy)) +
geom_point(aes(size = n_obs), color = "purple", alpha = 0.6) +
geom_line(color = "purple", alpha = 0.4) +
geom_smooth(method = "lm", color = "red", linetype = 2, se = FALSE) +
labs(
title = "Grapevine Flowering Date Over Time",
x = "Year", y = "Mean Day of Year",
size = "Observations"
) +
theme_minimal()
print(p)
#> `geom_smooth()` using formula = 'y ~ x'
trend <- lm(mean_doy ~ year, data = vine_annual)
slope <- coef(trend)[2]
cat("\nTrend:", round(slope * 10, 2), "days per decade\n")
#>
#> Trend: -1.95 days per decadeWhy grapevine? Vitis vinifera is economically important (wine production), and its phenology is closely monitored. The long historical records make it ideal for studying long-term climate trends.
As you start working with phenological data, keep these points in mind:
Now that you understand the basics, explore the other vignettes for more advanced analyses:
Learn to calculate phenological normals (long-term averages), detect anomalies (unusual years), and analyze trends using:
pheno_normals() - Calculate climatological
baselinespheno_anomaly() - Identify deviations from normalpheno_trend_turning() - Detect changes in trend
directionAnalyze how phenology varies across landscapes:
pheno_gradient() - Quantify elevation and latitude
effectspheno_synchrony() - Measure spatial coherencepheno_map() - Create maps of phenological patternsIf you encounter problems:
?function_name (e.g., ?pheno_normals)For reproducibility, here’s the R environment used for this vignette:
sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Tahoe 26.0.1
#>
#> Matrix products: default
#> BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: Europe/Zurich
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] pep725_1.0.0 ggplot2_4.0.1 data.table_1.18.2.1
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.10 generics_0.1.4 tidyr_1.3.2
#> [4] class_7.3-23 robustbase_0.99-7 KernSmooth_2.23-26
#> [7] lattice_0.22-7 digest_0.6.39 magrittr_2.0.4
#> [10] evaluate_1.0.5 grid_4.5.2 RColorBrewer_1.1-3
#> [13] rnaturalearth_1.2.0 fastmap_1.2.0 Matrix_1.7-4
#> [16] jsonlite_2.0.0 e1071_1.7-17 DBI_1.2.3
#> [19] mgcv_1.9-3 purrr_1.2.1 viridisLite_0.4.2
#> [22] scales_1.4.0 jquerylib_0.1.4 cli_3.6.5
#> [25] rlang_1.1.7 units_1.0-0 splines_4.5.2
#> [28] withr_3.0.2 cachem_1.1.0 yaml_2.3.12
#> [31] otel_0.2.0 tools_4.5.2 dplyr_1.2.0
#> [34] vctrs_0.7.1 R6_2.6.1 proxy_0.4-29
#> [37] lifecycle_1.0.5 classInt_0.4-11 pkgconfig_2.0.3
#> [40] pillar_1.11.1 bslib_0.10.0 gtable_0.3.6
#> [43] glue_1.8.0 Rcpp_1.1.1 sf_1.0-24
#> [46] DEoptimR_1.1-4 xfun_0.56 tibble_3.3.1
#> [49] tidyselect_1.2.1 rstudioapi_0.18.0 knitr_1.51
#> [52] farver_2.1.2 nlme_3.1-168 htmltools_0.5.9
#> [55] patchwork_1.3.2 rmarkdown_2.30 labeling_0.4.3
#> [58] compiler_4.5.2 S7_0.2.1