Getting Started with pep725

Vignette 1 of 4 for the pep725 R package. The most current version is available on GitHub and CRAN.

What is Phenology?

Phenology is the study of cyclic and seasonal natural phenomena, especially in relation to climate and plant life cycles. In plant science, phenology focuses on the timing of recurring biological events such as:

Budburst: When dormant buds begin to open in spring
Leaf unfolding: The emergence of new leaves
Flowering: When flowers open (anthesis)
Fruit ripening: The maturation of fruits
Leaf senescence: Autumn coloring and leaf fall

These phenological phases are strongly temperature-dependent and respond sensitively to interannual climate variability as well as long-term climate change. As a result, phenological records provide some of the most direct biological evidence of climate impacts on terrestrial ecosystems.

The PEP725 Database

The PEP725 Pan European Phenology Database is one of the world’s most comprehensive collections of plant phenological observations. It brings together long-term records from national phenological monitoring networks across Europe into a harmonized and openly accessible research infrastructure. Key characteristics of PEP725 include (Templ et al. 2018, 2026):

Over 12 million observations from across over 30 European countries
Records spanning 250+ years (some stations have data back to the 1750s)
Thousands of monitoring stations spanning a wide range of elevations
Broad taxonomic coverage including agricultural crops, fruit trees, wild plant and forest tree species

Observations in PEP725 are based on standardized phase definitions (largely aligned with BBCH-type schemes) and consistent observation protocols within national networks. This standardization enables comparative analyses across regions, species, and time periods, thus addresses key data integration and reusability challenges commonly encountered in large, heterogeneous phenological datasets (Templ 2025).

Why Use the pep725 R Package?

Working with phenological data presents several challenges:

Data volume: Millions of observations require efficient data handling
Data quality: Observations may contain errors or outliers
Standardization: Different countries use different formats
Analysis complexity: Calculating trends, normals, and anomalies requires specialized methods

The pep725 package addresses these challenges by providing:

Efficient data import and handling using data.table
Quality assessment and outlier detection tools
Functions for calculating phenological normals and anomalies
Spatial analysis (gradients, synchrony)
Publication-ready visualizations

Installation

Before you begin, install the package:

# Install from CRAN (when available)
install.packages("pep725")

# Or install development version from GitHub
# First install devtools if you don't have it:
# install.packages("devtools")
devtools::install_github("matthias-da/pep725")

Loading the Package

Once installed, load the package at the start of your R session:

library(pep725)

When the package loads, you’ll see a startup message with tips on how to get data.

Getting Phenological Data

The pep725 package offers five ways to obtain data. Understanding which option to use is important for your workflow:

Option	Best For	Data Size	Internet Required?
1. Download synthetic data	Learning, tutorials, reproducible examples	~7.24M rows	First download only
2. Seed dataset	Quick offline tests, minimal examples	~1,300 rows	No
3. Generate synthetic data	Creating shareable versions of your data	Variable	No
4. Import real PEP725 data	Actual research and publications	~7.24M rows	For initial download
5. Use your own phenological data	Regional networks, field trials, citizen science datasets	Variable	No

Why Synthetic Data?

You might wonder why we use synthetic data instead of real observations. The original PEP725 database has usage restrictions that prevent redistribution - you must register on the pep725.eu website and accept the data use policy. This creates a problem for tutorials and examples that need to be reproducible.

Synthetic data solves this problem. It is generated to preserve the statistical properties of real data:

Same distributions of phenological events across seasons
Similar correlations between variables (e.g. latitude and flowering date)
Realistic spatial patterns across Europe
Comparable year-to-year variability

This means you can learn all the analysis techniques using synthetic data, then apply them confidently to real data when you’re ready for actual research.

Option 1: Download Synthetic Data (Recommended for Learning)

To learn the package, download the pre-generated synthetic dataset:

# Download synthetic data (cached locally after first download)
pep <- pep_download()

What happens here?

The first time you run this, the package downloads a ~64 MB file from GitHub
The file is cached in a local directory (typically in your user folder)
Subsequent calls load from the cache instantly - no re-download needed

You can check your cache status or clear it if needed:

# Check cache status - shows file location and size
pep_cache_info()
#> $path
#> [1] "/Users/matthias/Library/Caches/org.R-project.R/R/pep725/pep_synth.rda"
#> 
#> $size_mb
#> [1] 111.41
#> 
#> $modified
#> [1] "2026-03-04 15:00:46 CET"
#> 
#> $exists
#> [1] TRUE

# If you need to clear the cache (e.g., to get an updated version):
# pep_cache_clear()

Option 2: Use the Small Seed Dataset

For quick tests when you don’t need the full dataset, use the included seed data:

# Load the included seed dataset (small, for quick tests only)
data(pep_seed)

# This is a small subset (~1,300 rows) useful for:
# - Testing code quickly
# - Working offline
# - Running examples in documentation

When to use this: If you’re developing code and want fast iteration without waiting for the full dataset to load, the seed dataset is perfect. However, it’s too small for meaningful statistical analyses.

Option 3: Generate Your Own Synthetic Data

If you have real PEP725 data and want to create a shareable synthetic version (e.g., for teaching or supplementary materials), use pep_simulate():

# Generate synthetic data based on your real data
# This preserves statistical properties while anonymizing observations
pep_synthetic <- pep_simulate(your_real_pep_data)

# You can then share pep_synthetic freely

Use case: You’ve done an analysis with real data for a paper, and you want to provide reproducible examples in supplementary materials without violating data sharing restrictions.

Option 4: Import Real PEP725 Data (For Research)

For actual research and publications, you’ll want real data. Here’s the process:

Register at pep725.eu (free for research)
Download data for your species/regions of interest
Import using pep_import()

# Import from downloaded PEP725 files
# Point to the folder containing your downloaded CSV files
pep <- pep_import("path/to/pep725_data/")

The pep_import() function automatically cleans and formats the data:

Converts columns to appropriate types (factors, dates, etc.)
Handles missing values (e.g., -9999 altitude codes become NA)
Removes problematic columns
Optionally adds country names based on coordinates

Option 5: Use Your Own Plant Phenological Data

Although pep725 is designed around the structure and conventions of the PEP725 database, it is not limited to PEP725 data.

If your dataset consists of station-based phenological observations with:

spatial coordinates (longitude and latitude),
observation year,
species identifiers,
phenophase identifiers,
and day-of-year (DOY) information,

you can use most of the functionality in pep725 with minimal adaptation.

The key requirement is a PEP725-like tabular structure — one row per observation. This structure can usually be achieved with straightforward preprocessing (e.g., renaming columns and ensuring consistent DOY formatting).

Once your data has the required columns, convert it to a pep object:

# Prepare your data with the required columns
my_data <- data.frame(
  s_id = c("MY001", "MY001", "MY002"),
  lon = c(10.1, 10.1, 11.3),
  lat = c(47.5, 47.5, 48.1),
  genus = c("Malus", "Malus", "Malus"),
  species = c("Malus domestica", "Malus domestica", "Malus domestica"),
  phase_id = c(60, 65, 60),
  year = c(2020, 2020, 2020),
  day = c(110, 125, 108)
)

# Convert to a pep object — validates required columns automatically
my_pep <- as.pep(my_data)

Important Note for Research

All examples in these vignettes use synthetic data. The patterns you see are realistic but not actual scientific observations.

For publishable research, always use real PEP725 data! After registering at pep725.eu, simply replace pep_download() with pep_import("your/data/path/"). All the analysis techniques work identically.

Understanding the Data Structure

Now that we have data loaded, let’s understand what we’re working with. The pep object is a special type of data frame optimized for large datasets.

# View the data - the print method shows a summary
print(pep)
#> PEP725 Phenological Data
#> --------------------------------------------------
#> Observations: 7,240,883
#> Stations:     10,927
#> Species:      41
#> Phases:       73 (BBCH codes: 0, 1, 3, 7, 9, ...)
#> Years:        1775 - 2025
#> Extent:       lon [-10.24, 48.35], lat [14.44, 68.44]
#> Countries:    34 (Austria, Belgium, Bosnia and Herz., Bulgaria, Croatia, ...)
#> --------------------------------------------------
#> First 5 rows:
#>      s_id     lon     lat  genus        species phase_id  year   day
#>    <fctr>   <num>   <num> <fctr>         <fctr>    <int> <int> <int>
#> 1:  26675 16.3264 48.3036  Vitis Vitis vinifera      100  1775   297
#> 2:  26675 16.3264 48.3036  Vitis Vitis vinifera      100  1776   281
#> 3:  26675 16.3264 48.3036  Vitis Vitis vinifera      100  1777   288
#> 4:  26675 16.3264 48.3036  Vitis Vitis vinifera      100  1778   290
#> 5:  26675 16.3264 48.3036  Vitis Vitis vinifera      100  1779   288

Let’s get a quick visual overview. The plot() method for pep objects provides three built-in views — a station map, a time series, and a DOY histogram:

# Where are the stations?
plot(pep, type = "map")


# How are phenological events distributed across the year?
plot(pep, type = "histogram")

What You See in the Output

The output shows you:

Dimensions: Number of rows (observations) and columns
Column names and types: What variables are available
First few rows: A preview of actual values

Key Columns Explained

Each row in the dataset represents one phenological observation - a record of when a specific plant at a specific location reached a particular developmental stage.

Column	Description	Example
`s_id`	Station identifier - A unique code for each monitoring location	“AT0001”
`lon`, `lat`	Geographic coordinates - Longitude and latitude in decimal degrees	15.5, 48.2
`alt`	Altitude - Elevation above sea level in meters	350
`genus`	Plant genus - The genus name (first part of scientific name)	“Malus”
`species`	Full species name - Genus + specific epithet	“Malus domestica”
`phase_id`	BBCH code - Standardized phenological stage (explained below)	60
`year`	Observation year	2015
`day`	Day of year (DOY) - The day when the phase was observed (1-365)	145
`country`	Country name	“Austria”

Understanding Day of Year (DOY)

Phenological data uses Day of Year (DOY) rather than calendar dates. This simplifies calculations and comparisons:

DOY 1 = January 1
DOY 91 = April 1 (approximately)
DOY 182 = July 1 (approximately)
DOY 274 = October 1 (approximately)
DOY 365/366 = December 31

Why DOY? It makes it easy to calculate things like “flowering advanced by 10 days” without dealing with month boundaries.

The `pep` Class

The package provides a custom class system that extends data.table with phenology-specific methods. Let’s verify what class our data has:

# Check the class hierarchy
class(pep)
#> [1] "pep"        "data.table" "data.frame"

You’ll see three classes:

pep - Our custom class with phenology-specific methods
data.table - Fast data manipulation (from the data.table package)
data.frame - Base R compatibility

This means you can use:

pep725 functions designed for phenological analysis
data.table syntax for fast filtering and aggregation
Base R functions that work with data frames

The Summary Method

The summary() function is your first tool for understanding a phenological dataset. It provides different views depending on the by parameter:

# Default summary - shows top species by observation count
summary(pep)
#> PEP725 Phenological Data Summary
#> ==================================================
#> 
#> OVERVIEW
#>   Total observations: 7,240,883
#>   Unique stations:    10,927
#>   Year range:         1775 - 2025
#>   DOY range:          -210 - 578
#> 
#> TOP 10 SPECIES (by observation count)
#>        genus           species  n_obs n_stations year_min year_max median_doy
#>       <fctr>            <fctr>  <int>      <int>    <int>    <int>      <int>
#>  1:  Hordeum   Hordeum vulgare 930924       7996     1870     2025        175
#>  2:    Malus   Malus domestica 899311       8403     1896     2025        140
#>  3: Triticum Triticum aestivum 668330       7504     1870     2024        220
#>  4:    Avena      Avena sativa 642098       7737     1870     2025        150
#>  5:   Secale    Secale cereale 605981       7837     1870     2025        213
#>  6:  Solanum Solanum tuberosum 593638       7866     1870     2025        176
#>  7:   Prunus    Prunus cerasus 483615       7083     1873     2025        130
#>  8:   Prunus  Prunus domestica 429971       7617     1868     2025        134
#>  9:      Zea          Zea mays 417009       5692     1941     2025        193
#> 10:     Beta     Beta vulgaris 403413       6219     1926     2024        130

What this tells you: Which species have the most data, how many stations observe each species, and the year range of observations.

# Summary by phenological phase
summary(pep, by = "phase")
#> PEP725 Phenological Data Summary
#> ==================================================
#> 
#> OVERVIEW
#>   Total observations: 7,240,883
#>   Unique stations:    10,927
#>   Year range:         1775 - 2025
#>   DOY range:          -210 - 578
#> 
#> OBSERVATIONS BY BBCH PHASE
#>     phase_id                                 phase_name  n_obs n_species
#>        <int>                                     <char>  <int>     <int>
#>  1:        0                        Dry seed / Dormancy 751405        21
#>  2:        1                            Seed imbibition  18041         8
#>  3:        3                                      Other   1099         6
#>  4:        7                         Coleoptile emerged  17758         9
#>  5:        9                                  Emergence    230         8
#>  6:       10              First leaf through coleoptile 899415        19
#>  7:       11                        First leaf unfolded  22210        22
#>  8:       12                          2 leaves unfolded     61         5
#>  9:       13                          3 leaves unfolded    328         2
#> 10:       14                                      Other    142         1
#> 11:       15                                True leaves   3382        11
#> 12:       16                                      Other      1         1
#> 13:       17                                      Other      2         1
#> 14:       19                         9+ leaves unfolded     45         6
#> 15:       21                     Beginning of tillering      1         1
#> 16:       31                      First node detectable 414453         9
#> 17:       33                                      Other      1         1
#> 18:       35                                      Other   1947         2
#> 19:       48                                      Other     41         1
#> 20:       49                                  Late boot    120         1
#> 21:       51                       Beginning of heading 417161         8
#> 22:       55                          Middle of heading  10054         6
#> 23:       56                                      Other      3         1
#> 24:       57                                      Other     79         2
#> 25:       59                             End of heading  12005         9
#> 26:       60  Beginning of flowering / Heading complete 834172        33
#> 27:       61                     Beginning of flowering  80103        29
#> 28:       63                                      Other    628        15
#> 29:       65                  Full flowering / Anthesis 475922        33
#> 30:       67                                      Other    207        15
#> 31:       69                           End of flowering 442901        20
#> 32:       75                                Medium milk  29851         5
#> 33:       77                                  Late milk      1         1
#> 34:       80                          Dough development    115         1
#> 35:       81 Beginning of ripening or fruit colouration   9340        20
#> 36:       83                                Early dough    414         2
#> 37:       85                                 Soft dough 299417        14
#> 38:       87                                 Hard dough 894688        22
#> 39:       89                                 Fully ripe  15497        18
#> 40:       91                                      Other    660        10
#> 41:       93                                      Other   1070        12
#> 42:       95                       50% of leaves fallen  93066        20
#> 43:       97                                 Plant dead   2677        13
#> 44:      100                                    Harvest 923474        34
#> 45:      109                                      Other  16802         8
#> 46:      111                       First cut for silage 241233         3
#> 47:      131                          First cut for hay 127007         1
#> 48:      135                                      Other    253         1
#> 49:      139                                      Other    235         1
#> 50:      140                                      Other   7638         1
#> 51:      151  Start of harvest for silage (corn, grass)  60999         1
#> 52:      161                                      Other   4796         1
#> 53:      182                                      Other  35771         1
#> 54:      200                                      Other    103         4
#> 55:      201                                      Other    975        12
#> 56:      202                                      Other      2         1
#> 57:      203                                      Other   2190         6
#> 58:      204                                      Other     35         1
#> 59:      205                     Autumn coloring >= 50%  53709        17
#> 60:      206                                      Other      1         1
#> 61:      207                                      Other      3         1
#> 62:      209                                      Other   2412         7
#> 63:      210                                      Other     27         2
#> 64:      213                                      Other     51         4
#> 65:      223                                      Other   1166         3
#> 66:      250                                      Other   5960         1
#> 67:      380                                      Other   1982        16
#> 68:      381                                      Other    222        11
#> 69:      383                                      Other      1         1
#> 70:      385                                      Other   3119         9
#> 71:      386                                      Other      2         1
#> 72:      387                                      Other      1         1
#> 73:      389                                      Other      1         1
#>     phase_id                                 phase_name  n_obs n_species
#>        <int>                                     <char>  <int>     <int>
#>     median_doy iqr_doy
#>          <int>   <int>
#>  1:        127     170
#>  2:        138      38
#>  3:        115      60
#>  4:        117      18
#>  5:        117      37
#>  6:        141     157
#>  7:        118      22
#>  8:        107      39
#>  9:        165     185
#> 10:        106     178
#> 11:        111      27
#> 12:        123       0
#> 13:        126       3
#> 14:        140      19
#> 15:        130       0
#> 16:        129      29
#> 17:        130       0
#> 18:        163      20
#> 19:        207      28
#> 20:        216      52
#> 21:        161      25
#> 22:        160      21
#> 23:         69       4
#> 24:         96      11
#> 25:        182      29
#> 26:        124      30
#> 27:        157      21
#> 28:        126     105
#> 29:        124      18
#> 30:        113      45
#> 31:        132      17
#> 32:        228      17
#> 33:        109       0
#> 34:        278      20
#> 35:        239      28
#> 36:        214      16
#> 37:        206      27
#> 38:        234      64
#> 39:        217      28
#> 40:        274      38
#> 41:        292      26
#> 42:        301      18
#> 43:        314      16
#> 44:        226      49
#> 45:        242      35
#> 46:        171     101
#> 47:        155      15
#> 48:        169      14
#> 49:        181      17
#> 50:        232      21
#> 51:        269      20
#> 52:        270      22
#> 53:         81      22
#> 54:        263      24
#> 55:        281      26
#> 56:        296       5
#> 57:        286      15
#> 58:        282      24
#> 59:        286      20
#> 60:        301       0
#> 61:        320       6
#> 62:        295      15
#> 63:        273      20
#> 64:        285      19
#> 65:        120      14
#> 66:         85      25
#> 67:        222      84
#> 68:        263     112
#> 69:        301       0
#> 70:        258      69
#> 71:        276      15
#> 72:        309       0
#> 73:        309       0
#>     median_doy iqr_doy
#>          <int>   <int>

What this tells you: Which developmental stages are recorded, how many observations of each, and the typical DOY (useful for understanding the seasonal distribution of your data).

# Summary by country (showing top 5)
summary(pep, by = "country", top = 5)
#> PEP725 Phenological Data Summary
#> ==================================================
#> 
#> OVERVIEW
#>   Total observations: 7,240,883
#>   Unique stations:    10,927
#>   Year range:         1775 - 2025
#>   DOY range:          -210 - 578
#> 
#> TOP 5 COUNTRIES (by observation count)
#>          country   n_obs n_stations n_species year_min year_max
#>           <char>   <int>      <int>     <int>    <int>    <int>
#> 1: Germany-North 4181984       4685        17     1951     2023
#> 2: Germany-South 2333531       1987        18     1930     2023
#> 3:       Austria  366048       1434        22     1775     2025
#> 4:        Sweden   82331        908        22     1870     2024
#> 5:      Slovakia   53052         76        15     1950     2024

What this tells you: Geographic coverage - which countries contribute most data, how many stations they have, and species diversity.

# Temporal coverage summary
summary(pep, by = "year")
#> PEP725 Phenological Data Summary
#> ==================================================
#> 
#> OVERVIEW
#>   Total observations: 7,240,883
#>   Unique stations:    10,927
#>   Year range:         1775 - 2025
#>   DOY range:          -210 - 578
#> 
#> TEMPORAL COVERAGE
#>   Years with data:    247
#>   Median obs/year:    1,243
#>   Peak year:          1968 (151,997 obs)
#> 
#>   Last 10 years:
#>      year n_obs n_stations n_species
#>     <int> <int>      <int>     <int>
#>  1:  2016 47812       1654        32
#>  2:  2017 43068       1571        29
#>  3:  2018 41041       1536        33
#>  4:  2019 42209       1522        29
#>  5:  2020 40181       1500        32
#>  6:  2021 39085       1393        31
#>  7:  2022 36813       1376        28
#>  8:  2023 35133       1351        28
#>  9:  2024  4315        369        27
#> 10:  2025   649         85        19

What this tells you: How observations are distributed across years - are there gaps? Is coverage increasing or decreasing over time?

Subsetting the Data

You’ll often want to focus on specific subsets of the data. The package preserves the pep class when you subset, so all methods continue to work:

# Filter to apple data using data.table syntax
# The syntax is: dataset[row_filter, column_selection]
apple <- pep[species == "Malus domestica"]
print(apple)
#> PEP725 Phenological Data
#> --------------------------------------------------
#> Observations: 899,311
#> Stations:     8,403
#> Species:      1
#> Phases:       31 (BBCH codes: 0, 7, 10, 11, 15, ...)
#> Years:        1896 - 2025
#> Extent:       lon [-8.37, 48.35], lat [14.44, 66.90]
#> Countries:    24 (Austria, Bosnia and Herz., Croatia, Czechia, France, ...)
#> --------------------------------------------------
#> First 5 rows:
#>      s_id   lon   lat  genus         species phase_id  year   day
#>    <fctr> <num> <num> <fctr>          <fctr>    <int> <int> <int>
#> 1:  29529  2.49  48.8  Malus Malus domestica       65  1896   103
#> 2:  29529  2.49  48.8  Malus Malus domestica       65  1897   118
#> 3:  29529  2.49  48.8  Malus Malus domestica       65  1898   118
#> 4:  29529  2.49  48.8  Malus Malus domestica       65  1900   130
#> 5:  29529  2.49  48.8  Malus Malus domestica       65  1901   116

# Filter by year range
recent <- pep[year >= 2000 & year <= 2015]
print(recent)
#> PEP725 Phenological Data
#> --------------------------------------------------
#> Observations: 985,849
#> Stations:     3,576
#> Species:      33
#> Phases:       56 (BBCH codes: 0, 3, 7, 9, 10, ...)
#> Years:        2000 - 2015
#> Extent:       lon [-10.24, 28.40], lat [37.22, 64.55]
#> Countries:    28 (Austria, Belgium, Bosnia and Herz., Bulgaria, Croatia, ...)
#> --------------------------------------------------
#> First 5 rows:
#>      s_id      lon     lat      genus        species phase_id  year   day
#>    <fctr>    <num>   <num>     <fctr>         <fctr>    <int> <int> <int>
#> 1:    136  9.18333 54.2500 perm_grass           <NA>      182  2000     5
#> 2:   6047 12.10000 50.8333 perm_grass           <NA>      182  2000     9
#> 3:   1304  7.10000 51.1000 perm_grass           <NA>      182  2000    10
#> 4:   6532 14.08330 47.4833     Secale Secale cereale        0  2000    15
#> 5:    206  9.66667 54.3167 perm_grass           <NA>      182  2000    17

# The summary method works on subsets too
summary(recent)
#> PEP725 Phenological Data Summary
#> ==================================================
#> 
#> OVERVIEW
#>   Total observations: 985,849
#>   Unique stations:    3,576
#>   Year range:         2000 - 2015
#>   DOY range:          -128 - 523
#> 
#> TOP 10 SPECIES (by observation count)
#>          genus           species  n_obs n_stations year_min year_max median_doy
#>         <fctr>            <fctr>  <int>      <int>    <int>    <int>      <int>
#>  1:      Malus   Malus domestica 179208       2734     2000     2015        133
#>  2:        Zea          Zea mays 107126       1829     2000     2015        198
#>  3:    Hordeum   Hordeum vulgare  95636       1906     2000     2015        181
#>  4:   Triticum Triticum aestivum  90484       1848     2000     2015        207
#>  5:     Prunus      Prunus avium  86023       2820     2000     2015        122
#>  6:     Secale    Secale cereale  68568       1491     2000     2015        202
#>  7:      Avena      Avena sativa  67242       1558     2000     2015        150
#>  8:      Pyrus    Pyrus communis  65482       2063     2000     2015        121
#>  9:     Prunus    Prunus cerasus  61866       1835     2000     2015        123
#> 10: perm_grass              <NA>  50086       2328     2000     2015        135

Tip for beginners: The data.table syntax pep[condition] is similar to base R’s subset() function but much faster for large datasets. You can combine multiple conditions with & (and) or | (or).

Understanding BBCH Codes

The BBCH scale [from Biologische Bundesanstalt, Bundessortenamt und CHemische Industrie; Meier (2018)] is a standardized system for describing plant developmental stages. It’s used worldwide, ensuring that “flowering” means the same thing whether observed in Germany or Spain.

How BBCH Codes Work

BBCH codes are two-digit numbers where:

The first digit (0-9) indicates the principal growth stage
The second digit indicates the secondary stage within that principal stage

First Digit	Principal Growth Stage
0	Germination / sprouting
1	Leaf development
2	Formation of side shoots / tillering
3	Stem elongation
4	Development of harvestable vegetative parts
5	Inflorescence emergence / heading
6	Flowering
7	Development of fruit
8	Ripening of fruit and seed
9	Senescence / dormancy

Looking Up BBCH Codes

The bbch_description() function translates codes to human-readable descriptions:

# Get descriptions for BBCH codes present in your data
codes <- unique(pep$phase_id)
bbch_description(codes)
#>    phase_id                                description
#> 1         0                        Dry seed / Dormancy
#> 2         1                            Seed imbibition
#> 3         7                         Coleoptile emerged
#> 4         9                                  Emergence
#> 5        10              First leaf through coleoptile
#> 6        11                        First leaf unfolded
#> 7        12                          2 leaves unfolded
#> 8        13                          3 leaves unfolded
#> 9        15                                True leaves
#> 10       19                         9+ leaves unfolded
#> 11       21                     Beginning of tillering
#> 12       31                      First node detectable
#> 13       49                                  Late boot
#> 14       51                       Beginning of heading
#> 15       55                          Middle of heading
#> 16       59                             End of heading
#> 17       60  Beginning of flowering / Heading complete
#> 18       61                     Beginning of flowering
#> 19       65                  Full flowering / Anthesis
#> 20       69                           End of flowering
#> 21       75                                Medium milk
#> 22       77                                  Late milk
#> 23       80                          Dough development
#> 24       81 Beginning of ripening or fruit colouration
#> 25       83                                Early dough
#> 26       85                                 Soft dough
#> 27       87                                 Hard dough
#> 28       89                                 Fully ripe
#> 29       95                       50% of leaves fallen
#> 30       97                                 Plant dead
#> 31      100                                    Harvest
#> 32      111                       First cut for silage
#> 33      131                          First cut for hay
#> 34      151  Start of harvest for silage (corn, grass)
#> 35      205                     Autumn coloring >= 50%

Commonly Used BBCH Codes in Phenological Research

Code	Stage	What to Look For
10	First leaves	Cotyledons or first true leaves visible
11	First leaf unfolded	First true leaf fully unfolded
60	First flowers open	Beginning of flowering (anthesis)
65	Full flowering	50% of flowers open
69	End of flowering	All petals fallen
85	Fruit coloring	Softening begins
87	Fruit ripe	Ready for harvest
89	Full ripeness	Fruit fully mature
91	Shoot growth complete	Leaves fully colored
95	Leaf fall	50% of leaves fallen

Filtering by Phenological Phase

When analyzing phenology, you typically focus on specific phenophases. Here’s how:

# Get all flowering observations (BBCH 60-69)
flowering <- pep[phase_id >= 60 & phase_id <= 69]
cat("Flowering observations:", nrow(flowering), "\n")
#> Flowering observations: 1833933

# Get only full flowering (BBCH 65)
full_flowering <- pep[phase_id == 65]
cat("Full flowering observations:", nrow(full_flowering), "\n")
#> Full flowering observations: 475922

# Get harvest-related observations
harvest <- pep[phase_id %in% c(87, 89)]
cat("Harvest observations:", nrow(harvest), "\n")
#> Harvest observations: 910185

Exploring Data Coverage

Before diving into analysis, it’s crucial to understand what your dataset contains. The pep_coverage() function provides a comprehensive overview.

Full Coverage Report

# Get a complete coverage report
pep_coverage(pep)
#> PEP725 Data Coverage Report
#> ==================================================
#> 
#> TEMPORAL COVERAGE
#> ------------------------------
#>   Year range:      1775 - 2025 (247 years)
#>   Missing years:   4 gaps
#>                    (1786, 1805, 1820, 1829)
#>   Median obs/year: 1,243
#>   Peak year:       1968 (151,997 obs)
#> 
#> GEOGRAPHICAL COVERAGE
#> ------------------------------
#>   Countries:       35
#>   Stations:        10,927
#>   Longitude:       -10.24 to 48.35
#>   Latitude:        14.44 to 68.44
#>   Altitude:        0 to 1933 m
#> 
#>   Top countries:
#>           country   n_obs n_stations n_species
#>            <char>   <int>      <int>     <int>
#>  1: Germany-North 4181984       4685        17
#>  2: Germany-South 2333531       1987        18
#>  3:       Austria  366048       1434        22
#>  4:        Sweden   82331        908        22
#>  5:      Slovakia   53052         76        15
#>  6:   Switzerland   51620        186        19
#>  7:          <NA>   32117         98        22
#>  8:        France   20487        706        22
#>  9:       Croatia   20130         10        15
#> 10:       Czechia   20008        146        14
#> 
#> SPECIES COVERAGE
#> ------------------------------
#>   Genera:          30
#>   Species:         41
#>   Subspecies:      129
#> 
#>   Top genera:
#>        genus   n_obs n_species n_stations
#>       <fctr>   <int>     <int>      <int>
#>  1:   Prunus 1357252         7      10012
#>  2:  Hordeum  930924         1       7996
#>  3:    Malus  899311         1       8403
#>  4: Triticum  668330         1       7504
#>  5:    Avena  642098         1       7737
#>  6:   Secale  605981         1       7837
#>  7:  Solanum  593638         1       7866
#>  8:      Zea  417009         1       5692
#>  9:     Beta  403413         1       6219
#> 10:    Pyrus  308320         1       7749
#> 
#>   Top species:
#>        genus           species  n_obs n_stations year_min year_max
#>       <fctr>            <fctr>  <int>      <int>    <int>    <int>
#>  1:  Hordeum   Hordeum vulgare 930924       7996     1870     2025
#>  2:    Malus   Malus domestica 899311       8403     1896     2025
#>  3: Triticum Triticum aestivum 668330       7504     1870     2024
#>  4:    Avena      Avena sativa 642098       7737     1870     2025
#>  5:   Secale    Secale cereale 605981       7837     1870     2025
#>  6:  Solanum Solanum tuberosum 593638       7866     1870     2025
#>  7:   Prunus    Prunus cerasus 483615       7083     1873     2025
#>  8:   Prunus  Prunus domestica 429971       7617     1868     2025
#>  9:      Zea          Zea mays 417009       5692     1941     2025
#> 10:     Beta     Beta vulgaris 403413       6219     1926     2024

This report shows you three dimensions of coverage:

Temporal: What years are covered? Any gaps?
Geographical: Which countries/regions? How many stations?
Species: Which plants are observed? How many observations each?

Focused Coverage Analysis

You can focus on specific aspects and optionally generate plots:

# Temporal coverage with a visualization
pep_coverage(pep, kind = "temporal", plot = TRUE)

#> PEP725 Data Coverage Report
#> ==================================================
#> 
#> TEMPORAL COVERAGE
#> ------------------------------
#>   Year range:      1775 - 2025 (247 years)
#>   Missing years:   4 gaps
#>                    (1786, 1805, 1820, 1829)
#>   Median obs/year: 1,243
#>   Peak year:       1968 (151,997 obs)

The temporal plot shows observation density over time. Look for:

Gaps: Years with few or no observations
Trends: Is data coverage increasing or decreasing over time?
Breakpoints: Sudden changes (e.g. due to methodology changes)

# Geographical coverage - which countries have most data
pep_coverage(pep, kind = "geographical", top = 5)
#> PEP725 Data Coverage Report
#> ==================================================
#> 
#> GEOGRAPHICAL COVERAGE
#> ------------------------------
#>   Countries:       35
#>   Stations:        10,927
#>   Longitude:       -10.24 to 48.35
#>   Latitude:        14.44 to 68.44
#>   Altitude:        0 to 1933 m
#> 
#>   Top countries:
#>          country   n_obs n_stations n_species
#>           <char>   <int>      <int>     <int>
#> 1: Germany-North 4181984       4685        17
#> 2: Germany-South 2333531       1987        18
#> 3:       Austria  366048       1434        22
#> 4:        Sweden   82331        908        22
#> 5:      Slovakia   53052         76        15

Geographic coverage tells you where your data comes from. This is important because:

Different regions have different climates
Some countries have longer observation histories
Network density affects spatial analyses

# Species coverage - which species are best represented
pep_coverage(pep, kind = "species", top = 5)
#> PEP725 Data Coverage Report
#> ==================================================
#> 
#> SPECIES COVERAGE
#> ------------------------------
#>   Genera:          30
#>   Species:         41
#>   Subspecies:      129
#> 
#>   Top genera:
#>       genus   n_obs n_species n_stations
#>      <fctr>   <int>     <int>      <int>
#> 1:   Prunus 1357252         7      10012
#> 2:  Hordeum  930924         1       7996
#> 3:    Malus  899311         1       8403
#> 4: Triticum  668330         1       7504
#> 5:    Avena  642098         1       7737
#> 
#>   Top species:
#>       genus           species  n_obs n_stations year_min year_max
#>      <fctr>            <fctr>  <int>      <int>    <int>    <int>
#> 1:  Hordeum   Hordeum vulgare 930924       7996     1870     2025
#> 2:    Malus   Malus domestica 899311       8403     1896     2025
#> 3: Triticum Triticum aestivum 668330       7504     1870     2024
#> 4:    Avena      Avena sativa 642098       7737     1870     2025
#> 5:   Secale    Secale cereale 605981       7837     1870     2025

Species coverage shows which plants have enough data for robust analyses. As a rule of thumb:

>10,000 observations: Excellent for most analyses
1,000-10,000 observations: Good for national/regional analyses
<1,000 observations: Limited; may need to combine with related species

Coverage by Groups

You can break down coverage by grouping variables:

# Temporal coverage broken down by country
cov_by_country <- pep_coverage(pep, kind = "temporal", by = "country")
# Show observations by group
cov_by_country$temporal$obs_by_group
#>              country   n_obs n_stations year_min year_max n_years
#>               <char>   <int>      <int>    <int>    <int>   <int>
#>  1:    Germany-North 4181984       4685     1951     2023      73
#>  2:    Germany-South 2333531       1987     1930     2023      80
#>  3:          Austria  366048       1434     1775     2025     193
#>  4:           Sweden   82331        908     1870     2024     115
#>  5:         Slovakia   53052         76     1950     2024      55
#>  6:      Switzerland   51620        186     1951     2024      74
#>  7:             <NA>   32117         98     1873     2025     121
#>  8:           France   20487        706     1872     2025     151
#>  9:          Croatia   20130         10     1961     2024      63
#> 10:          Czechia   20008        146     1930     2023      76
#> 11:         Slovenia   11507         15     1961     2020      60
#> 12:        Lithuania   11449         28     1960     2004      45
#> 13:           Poland    9461         14     1951     2024      74
#> 14:       Montenegro    9386          7     1951     2021      63
#> 15:            Spain    7446         79     1946     2025      77
#> 16:           Latvia    6595        102     1962     2018      52
#> 17: Bosnia and Herz.    5278          7     1971     2024      42
#> 18:          Hungary    3757          9     1947     2024      67
#> 19:          Romania    3700         70     1982     2005      24
#> 20:      Netherlands    3195        246     1868     2009     109
#> 21:           Russia    2395         33     1873     2019      68
#> 22:          Belgium    1860         64     1944     2020      62
#> 23:            Italy     650         12     1984     2024      40
#> 24:       Luxembourg     639          2     1955     2004      47
#> 25:           Norway     637          5     1927     2005      79
#> 26:          Ireland     453          5     1968     2024      57
#> 27:         Bulgaria     385          8     1973     2005      33
#> 28:    Liechtenstein     309          1     1973     2024      49
#> 29:           Serbia     294          6     1971     2002      29
#> 30:          Denmark     133          4     1972     2000      29
#> 31:   United Kingdom      23          1     1971     1981      11
#> 32:         Portugal      12          1     1973     1981       8
#> 33:           Greece       6          2     1971     1999       2
#> 34:          Finland       4          1     1976     1984       3
#> 35:            Yemen       1          1     2025     2025       1
#>              country   n_obs n_stations year_min year_max n_years
#>               <char>   <int>      <int>    <int>    <int>   <int>

This helps identify which countries have continuous records versus sporadic observations.

Your First Analysis: Tracking Flowering Trends

Let’s put everything together with a simple but meaningful analysis: tracking how apple flowering dates have changed over time.

# Step 1: Filter to apple flowering (BBCH 60 = first flowers)
apple_flowering <- pep[species == "Malus domestica" & phase_id == 60]

cat("Apple flowering observations:", nrow(apple_flowering), "\n")
#> Apple flowering observations: 205403
cat("Year range:", min(apple_flowering$year), "-", max(apple_flowering$year), "\n")
#> Year range: 1926 - 2025
cat("Countries:", length(unique(apple_flowering$country)), "\n")
#> Countries: 24

# Step 2: Calculate annual mean DOY
# The data.table syntax here means:
#   - Group by year
#   - Calculate mean DOY and count observations
#   - Order by year
annual_mean <- apple_flowering[, .(
  mean_doy = mean(day, na.rm = TRUE),
  n_obs = .N
), by = year][order(year)]

# Step 3: Look at the results
print(annual_mean)
#>       year mean_doy n_obs
#>      <int>    <num> <int>
#>   1:  1926 119.4658    73
#>   2:  1927 126.0164    61
#>   3:  1928 126.5750    80
#>   4:  1929 133.8764    89
#>   5:  1930 126.1333    90
#>   6:  1931 132.2447    94
#>   7:  1932 130.5000     2
#>   8:  1933 135.5000     2
#>   9:  1934 138.0000     2
#>  10:  1935 139.0000     1
#>  11:  1936 135.0000     1
#>  12:  1937 136.0000     1
#>  13:  1938 141.0000     1
#>  14:  1939 145.0000     1
#>  15:  1940 138.0000     1
#>  16:  1941 128.0000     1
#>  17:  1942 144.0000     1
#>  18:  1943 136.0000     3
#>  19:  1944 120.0000     4
#>  20:  1945 128.5000     2
#>  21:  1946 112.7351   268
#>  22:  1947 118.6537   283
#>  23:  1948 116.9475   305
#>  24:  1949 117.0677   310
#>  25:  1950 118.7792   308
#>  26:  1951 125.8272  2245
#>  27:  1952 122.3525  2369
#>  28:  1953 121.1412  2471
#>  29:  1954 128.4589  2702
#>  30:  1955 129.3909  2937
#>  31:  1956 130.0572  3060
#>  32:  1957 124.0544  2905
#>  33:  1958 128.6096  3256
#>  34:  1959 121.6012  3054
#>  35:  1960 125.2657  3161
#>  36:  1961 120.4273  2612
#>  37:  1962 127.8020  2646
#>  38:  1963 127.8143  2703
#>  39:  1964 126.9062  2591
#>  40:  1965 129.0643  2659
#>  41:  1966 125.9031  3167
#>  42:  1967 126.7869  3248
#>  43:  1968 125.0423  3306
#>  44:  1969 128.3654  3270
#>  45:  1970 130.1720  3139
#>  46:  1971 127.7643  3250
#>  47:  1972 127.3223  3267
#>  48:  1973 129.3266  3295
#>  49:  1974 125.8274  3180
#>  50:  1975 128.0957  3157
#>  51:  1976 127.9840  3122
#>  52:  1977 128.3329  3046
#>  53:  1978 128.7715  3064
#>  54:  1979 129.8692  3042
#>  55:  1980 129.7322  3193
#>  56:  1981 127.6371  2987
#>  57:  1982 128.9987  2982
#>  58:  1983 128.0494  2917
#>  59:  1984 129.0000  2894
#>  60:  1985 128.3925  2838
#>  61:  1986 127.8025  2795
#>  62:  1987 127.2105  2717
#>  63:  1988 125.6127  2703
#>  64:  1989 123.4611  2611
#>  65:  1990 121.7045  2474
#>  66:  1991 122.6691  3545
#>  67:  1992 122.7809  4212
#>  68:  1993 122.0902  3782
#>  69:  1994 122.3447  3797
#>  70:  1995 122.1057  3585
#>  71:  1996 121.9156  3921
#>  72:  1997 121.0351  3535
#>  73:  1998 120.4800  3421
#>  74:  1999 119.8520  3290
#>  75:  2000 119.3216  3265
#>  76:  2001 119.8012  2736
#>  77:  2002 118.9894  2634
#>  78:  2003 118.9929  2552
#>  79:  2004 118.8181  2524
#>  80:  2005 118.3861  2362
#>  81:  2006 118.8295  2369
#>  82:  2007 117.2291  2270
#>  83:  2008 116.9619  2284
#>  84:  2009 116.6252  2276
#>  85:  2010 116.6667  2292
#>  86:  2011 115.7146  2267
#>  87:  2012 115.8378  2324
#>  88:  2013 116.8630  2175
#>  89:  2014 114.3385  2077
#>  90:  2015 115.7285  2081
#>  91:  2016 115.5584  2013
#>  92:  2017 114.5264  1894
#>  93:  2018 114.5914  1833
#>  94:  2019 113.8795  1817
#>  95:  2020 113.5284  1830
#>  96:  2021 116.5177  1756
#>  97:  2022 115.4621  1755
#>  98:  2023 116.7220  1676
#>  99:  2024 106.9771   218
#> 100:  2025 109.4375    16
#>       year mean_doy n_obs
#>      <int>    <num> <int>

Now let’s visualize the trend:

# Step 4: Plot the trend with ggplot2
library(ggplot2)

p <- ggplot(annual_mean, aes(x = year, y = mean_doy)) +
  geom_point(aes(size = n_obs), color = "steelblue", alpha = 0.6) +
  geom_line(color = "steelblue", alpha = 0.4) +
  geom_smooth(method = "lm", color = "red", linetype = 2, se = FALSE) +
  labs(
    title = "Apple Flowering Date Over Time",
    x = "Year", y = "Mean Day of Year",
    size = "Observations"
  ) +
  theme_minimal()
print(p)
#> `geom_smooth()` using formula = 'y ~ x'


# Step 5: Quantify the trend
trend <- lm(mean_doy ~ year, data = annual_mean)
slope <- coef(trend)[2]
cat("\nTrend:", round(slope * 10, 2), "days per decade\n")
#> 
#> Trend: -1.77 days per decade
if (slope < 0) {
  cat("Interpretation: Flowering is getting EARLIER\n")
} else {
  cat("Interpretation: Flowering is getting LATER\n")
}
#> Interpretation: Flowering is getting EARLIER

Interpreting the results: A negative slope means flowering is occurring earlier in the year - a common finding for spring phenology in a warming climate. A shift of -2 days per decade means that over 50 years, flowering has advanced by about 10 days.

Comparing with Another Species: Grapevine

Let’s compare apple with grapevine (Vitis vinifera), which has the longest time series in the database (records back to 1775 in some regions):

# Step 1: Filter to grapevine flowering (BBCH 65 = full flowering)
vine_flowering <- pep[species == "Vitis vinifera" & phase_id == 65]

cat("Grapevine flowering observations:", nrow(vine_flowering), "\n")
#> Grapevine flowering observations: 17749
cat("Year range:", min(vine_flowering$year), "-", max(vine_flowering$year), "\n")
#> Year range: 1951 - 2024
cat("Countries:", length(unique(vine_flowering$country)), "\n")
#> Countries: 15

# Step 2: Calculate annual mean DOY
vine_annual <- vine_flowering[, .(
  mean_doy = mean(day, na.rm = TRUE),
  n_obs = .N
), by = year][order(year)]

# Step 3: Plot the trend
p <- ggplot(vine_annual, aes(x = year, y = mean_doy)) +
  geom_point(aes(size = n_obs), color = "purple", alpha = 0.6) +
  geom_line(color = "purple", alpha = 0.4) +
  geom_smooth(method = "lm", color = "red", linetype = 2, se = FALSE) +
  labs(
    title = "Grapevine Flowering Date Over Time",
    x = "Year", y = "Mean Day of Year",
    size = "Observations"
  ) +
  theme_minimal()
print(p)
#> `geom_smooth()` using formula = 'y ~ x'


trend <- lm(mean_doy ~ year, data = vine_annual)
slope <- coef(trend)[2]
cat("\nTrend:", round(slope * 10, 2), "days per decade\n")
#> 
#> Trend: -1.95 days per decade

Why grapevine? Vitis vinifera is economically important (wine production), and its phenology is closely monitored. The long historical records make it ideal for studying long-term climate trends.

University of Applied Sciences Northwestern Switzerland (FHNW)↩︎
Swiss Federal Institute for Forest, Snow and Landscape Research (WSL)↩︎

Getting Started with pep725

Matthias Templ1

Barbara Templ2

2026-04-23

What is Phenology?

The PEP725 Database

Why Use the pep725 R Package?

Installation

Loading the Package

Getting Phenological Data

Why Synthetic Data?

Option 1: Download Synthetic Data (Recommended for Learning)

Option 2: Use the Small Seed Dataset

Option 3: Generate Your Own Synthetic Data

Option 4: Import Real PEP725 Data (For Research)

Option 5: Use Your Own Plant Phenological Data

Understanding the Data Structure

What You See in the Output

Key Columns Explained

Understanding Day of Year (DOY)

The pep Class

The Summary Method

Subsetting the Data

Understanding BBCH Codes

How BBCH Codes Work

Looking Up BBCH Codes

Commonly Used BBCH Codes in Phenological Research

Filtering by Phenological Phase

Exploring Data Coverage

Full Coverage Report

Focused Coverage Analysis

Coverage by Groups

Your First Analysis: Tracking Flowering Trends

Comparing with Another Species: Grapevine

Common Pitfalls and Tips

Data Quality Matters

Species Considerations

Next Steps

Phenological Analysis Vignette

Spatial Phenological Patterns Vignette

Data Quality Assessment Vignette

Getting Help

Session Info

References

Matthias Templ¹

Barbara Templ²

The `pep` Class