| Title: | Flags Spatial Errors in Biological Collection Data Using Specialists' Information |
| Version: | 1.0.1 |
| BugReports: | https://github.com/wevertonbio/RuHere/issues |
| Description: | Automatically flags common spatial errors in biological collection data using metadata and specialists' information. RuHere implements a workflow to manage occurrence data through six steps: dataset merging, metadata flagging, validation against expert-derived distribution maps, visualization of flagged records, and sampling bias exploration. It specifically integrates specialist-curated range information to identify geographic errors and introductions that often escape standard automated validation procedures. For details on the methodology, see: Trindade & Caron (2026) <doi:10.64898/2026.02.02.703373>. |
| Imports: | Rcpp, terra, data.table, faunabr (≥ 1.0.0), florabr (≥ 1.3.1), jsonlite, rgbif, rredlist, stringi, BIEN, ridigbio, fields, ggplot2, mapview, sf, ggnewscale |
| Suggests: | pbapply, knitr, R.utils, rmarkdown, CoordinateCleaner |
| License: | GPL (≥ 3) |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| Depends: | R (≥ 3.5) |
| LazyData: | true |
| LinkingTo: | Rcpp (≥ 1.1.0), RcppArmadillo (≥ 15.0.2.2) |
| VignetteBuilder: | knitr |
| URL: | https://wevertonbio.github.io/RuHere/ |
| NeedsCompilation: | yes |
| Packaged: | 2026-02-12 14:54:08 UTC; wever |
| Author: | Weverton C. F. Trindade
|
| Maintainer: | Weverton C. F. Trindade <wevertonf1993@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-02-17 15:30:02 UTC |
Check the available distribution datasets for a set of species
Description
This function checks which datasets contain distributional information for a given set of species, based on expert-curated sources. It searches the selected datasets and reports whether each species has available distribution data.
Usage
available_datasets(
data_dir,
species,
datasets = "all",
return_distribution = FALSE
)
Arguments
data_dir |
(character) directory path where the datasets were saved. See Details for more information. |
species |
(character) vector with the species names to be checked for the availability of distributional information. |
datasets |
(character) vector indicating which datasets to search.
Options are |
return_distribution |
(logical) whether to return the spatial objects
( |
Details
The distribution datasets can be obtained using the functions
florabr_here(), wcvp_here(), bien_here(), and faunabr_here(),
which download and prepare the corresponding sources for use in RuHere.
Value
If return_distribution = FALSE, a data.frame containing the species names
and the datasets where distributional information is available.
If return_distribution = TRUE, it also returns a list containing the
SpatVector objects representing the species ranges.
Examples
# Set directory where datasets were saved
# Here, we'll use the directory where the example datasets are stored
datadir <- system.file("extdata", "datasets", package = "RuHere")
# Check available datasets
d <- available_datasets(data_dir = datadir,
species = c("Araucaria angustifolia",
"Handroanthus serratifolius",
"Cyanocorax caeruleus"))
# Check available datasets and return distribution
d2 <- available_datasets(data_dir = datadir,
species = c("Araucaria angustifolia",
"Handroanthus serratifolius",
"Cyanocorax caeruleus"),
return_distribution = TRUE)
Download species distribution information from BIEN
Description
This function downloads distribution information from the BIEN database,
required for filtering occurrence records using specialists' information via
the flag_bien() function.
Usage
bien_here(
data_dir,
species,
synonyms = NULL,
overwrite = TRUE,
progress_bar = FALSE,
verbose = TRUE
)
Arguments
data_dir |
(character) directory to save the data downloaded from BIEN. |
species |
(character) a vector of species names for which to retrieve distribution information. |
synonyms |
(data.frame) an optional data.frame containing synonyms of
the target species. The first column must contain the target species names,
and the second column their corresponding synonyms. Default is |
overwrite |
(logical) whether to overwrite existing files. Default is
|
progress_bar |
(logical) whether to display a progress bar during processing.
If TRUE, the 'pbapply' package must be installed. Default is |
verbose |
(logical) whether to display progress messages. Default is
|
Details
This function uses the BIEN::BIEN_ranges_load_species() function to
retrieve polygons representing the distribution ranges of species available
in the BIEN database.
Because taxonomic information in BIEN may be outdated, you can optionally
provide a table of synonyms to broaden the search. The synonyms data.frame
should have the accepted species in the first column and their synonyms in
the second. See RuHere::synonys for an example.
Value
A data frame indicating whether the polygon(s) representing the species range
are available in BIEN.
If the range is available, a GeoPackage file (.gpkg) is saved in
data_dir/bien. The file name corresponds to the species name, with an
underscore (“_”) replacing the space between the genus and the specific
epithet.
Examples
# Define a directory to save the data
data_dir <- tempdir() # Here, a temporary directory
# Download species distribution information from BIEN
bien_here(data_dir = data_dir, species = "Handroanthus serratifolius")
Bind occurrences after standardizing columns
Description
Combines multiple occurrence data frames (for example, from GBIF,
SpeciesLink, BIEN, or iDigBio) into a single standardized dataset. This is
particularly useful after using format_columns() to ensure column
compatibility across data sources.
Usage
bind_here(..., fill = FALSE)
Arguments
... |
(data.frame) two or more data frames with occurrence records to combine. |
fill |
(logical) whether to fills missing columns with |
Details
When fill = TRUE, columns not shared among the input data frames are added
and filled with NA, ensuring that all columns align before binding.
Internally, this function uses data.table::rbindlist() for efficient row
binding.
Value
A data.frame containing all occurrence records combined.
Examples
# Import and standardize GBIF
data("occ_gbif", package = "RuHere") #Import data example
gbif_standardized <- format_columns(occ_gbif, metadata = "gbif")
# Import and standardize SpeciesLink
data("occ_splink", package = "RuHere") #Import data example
splink_standardized <- format_columns(occ_splink, metadata = "specieslink")
# Import and standardize BIEN
data("occ_bien", package = "RuHere") #Import data example
bien_standardized <- format_columns(occ_bien, metadata = "bien")
# Import and standardize idigbio
data("occ_idig", package = "RuHere") #Import data example
idig_standardized <- format_columns(occ_idig, metadata = "idigbio")
# Merge all
all_occ <- bind_here(gbif_standardized, splink_standardized,
bien_standardized, idig_standardized)
Check if the records fall in the country assigned in the metadata
Description
Check if the records fall in the country assigned in the metadata
Usage
check_countries(
occ,
long = "decimalLongitude",
lat = "decimalLatitude",
country_column,
distance = 5,
try_to_fix = FALSE,
progress_bar = FALSE,
verbose = TRUE
)
Arguments
occ |
(data.frame) a dataset with occurrence records, preferably with
country information standardized using |
long |
(character) column name with longitude. Default is 'decimalLongitude'. |
lat |
lat (character) column name with latitude. Default is 'decimalLatitude'. |
country_column |
(character) column name containing the country information. |
distance |
(numeric) maximum distance (in kilometers) a record can fall
outside the country assigned in the |
try_to_fix |
(logical) whether to check if coordinates are inverted or
transposed (see |
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
verbose |
(logical) whether to print messages about function progress.
Default is |
Value
The original occ data.frame with an additional column (correct_country)
indicating whether each record falls within the country specified in the
metadata (TRUE) or not (FALSE).
Examples
# Load example data
data("occurrences", package = "RuHere") #Import data example
# Standardize country names
occ_country <- standardize_countries(occ = occurrences,
return_dictionary = FALSE)
# Check whether records fall within assigned countries
occ_country_checked <- check_countries(occ = occ_country,
country_column = "country_suggested")
Check if the records fall in the state assigned in the metadata
Description
Check if the records fall in the state assigned in the metadata
Usage
check_states(
occ,
long = "decimalLongitude",
lat = "decimalLatitude",
state_column,
distance = 5,
try_to_fix = FALSE,
progress_bar = FALSE,
verbose = TRUE
)
Arguments
occ |
(data.frame) a dataset with occurrence records, preferably with
country information standardized using |
long |
(character) column name with longitude. Default is 'decimalLongitude'. |
lat |
lat (character) column name with latitude. Default is 'decimalLatitude'. |
state_column |
(character) column name containing the state information. |
distance |
(numeric) maximum distance (in kilometers) a record can fall
outside the state assigned in the |
try_to_fix |
(logical) whether to check if coordinates are inverted or
transposed (see |
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
verbose |
(logical) whether to print messages about function progress.
Default is |
Value
The original occ data.frame with an additional column (correct_state)
indicating whether each record falls within the state specified in the
metadata (TRUE) or not (FALSE).
Examples
# Load example data
data("occurrences", package = "RuHere") #Import data example
# Subset occurrences for Araucaria angustifolia
occ <- occurrences[occurrences$species == "Araucaria angustifolia", ]
# Standardize country names
occ_country <- standardize_countries(occ = occ,
return_dictionary = FALSE)
# Standardize state names
occ_state <- standardize_states(occ = occ_country,
country_column = "country_suggested",
return_dictionary = FALSE)
# Check whether records fall within assigned states
occ_state_checked <- check_states(occ = occ_state,
state_column = "state_suggested")
Country dictionary for standardizing country names and codes
Description
country_dictionary provides a set of lookup tables used to standardize
country names and country codes in occurrence datasets.
The dictionary is built from rnaturalearthdata::map_units110
and consolidates a wide variety of country name variants (in several
languages and formats), as well as multiple coding systems, into a single
suggested standardized name.
This object is used internally by functions that clean or harmonize
country fields, ensuring that country names in occurrence datasets (e.g.,
"Brasil","brasil", "BR", "BRA", "République Française") are all
mapped consistently to a single standardized form ("brazil", "france",
etc.).
Usage
country_dictionary
Format
A named list of two data frames:
country_nameA data frame with two columns:
country_nameCharacter. Lowercased and accent-stripped country name variants (from multiple
rnaturalearthdatafields such as name, name_long, abbrev, formal_en, and alternative names in several languages).country_suggestedCharacter. The standardized country name, derived from the
namecolumn ofmap_units110, also lowercased and accent-stripped.
country_codeA data frame with two columns:
country_codeCharacter. Country codes from several systems, including ISO-2, ISO-3, FIPS, postal codes, and others, after filtering invalid or ambiguous codes.
country_suggestedCharacter. The standardized country name corresponding to each code.
Details
The dictionary is generated by:
extracting multiple name and code fields from
rnaturalearthdata::map_units110,converting names to lowercase and removing accents,
converting codes to uppercase,
removing invalid or ambiguous codes (e.g.,
-99,"J", various country mismatches),and ensuring uniqueness across all entries.
Examples
data(country_dictionary)
head(country_dictionary$country_name)
head(country_dictionary$country_code)
Extract country from coordinates
Description
Extracts the country for each occurrence record based on coordinates.
Usage
country_from_coords(
occ,
long = "decimalLongitude",
lat = "decimalLatitude",
country_column = NULL,
from = "all",
output_column = "country_xy",
append_source = FALSE
)
Arguments
occ |
(data.frame) a dataset with occurrence records, preferably
standardized using |
long |
(character) column name with longitude. Default is 'decimalLongitude'. |
lat |
(character) column name with latitude. Default is 'decimalLatitude'. |
country_column |
(character) the column name containing the country.
Only applicable if |
from |
(character) whether to extract the country for all records ('all') or only for records missing country information ('na_only'). If 'na_only', you must provide the name of the column with country information. Default is 'all'. |
output_column |
(character) column name created in |
append_source |
(logical) whether to create a new column in |
Details
The countries are extracted from coordinates using a map retrieved from
rnaturalearthdata::map_units110.
Value
The original occ data.frame with an additional column containing the
countries extracted from coordinates.
Examples
# Import and standardize GBIF
data("occ_gbif", package = "RuHere") #Import data example
gbif_standardized <- format_columns(occ_gbif, metadata = "gbif")
gbif_countries <- country_from_coords(occ = gbif_standardized)
Create metadata template
Description
This function creates a metadata template to be used in format_columns()
for formatting and standardizing column names and classes in occurrence
datasets.
All column names specified as arguments must be present in the occ dataset.
If you obtained data from GBIF, SpeciesLink, BIEN or iDigBio using the functions provided in the RuHere package, you do not need to use this function, as the package already includes metadata templates for these datasets.
Usage
create_metadata(
occ,
scientificName,
decimalLongitude,
decimalLatitude,
collectionCode = NA,
catalogNumber = NA,
coordinateUncertaintyInMeters = NA,
elevation = NA,
country = NA,
stateProvince = NA,
municipality = NA,
locality = NA,
year = NA,
eventDate = NA,
recordedBy = NA,
identifiedBy = NA,
basisOfRecord = NA,
occurrenceRemarks = NA,
habitat = NA,
datasetName = NA,
datasetKey = NA,
key = NA
)
Arguments
occ |
(data.frame or data.table) a dataset with occurrence records to be standardized. |
scientificName |
(character) column name in |
decimalLongitude |
(character) column name in |
decimalLatitude |
(character) column name in |
collectionCode |
(character) an optional column name in |
catalogNumber |
(character) an optional column name in |
coordinateUncertaintyInMeters |
(character) an optional column name with the coordinate uncertainty in meters. |
elevation |
(character) an optional column name with the elevation information. |
country |
(character) an optional column name with the country of the record. |
stateProvince |
(character) an optional column name with the state or province of the record. |
municipality |
(character) an optional column name with the municipality of the record. |
locality |
(character) an optional column name with the locality description. |
year |
(character) an optional column name with the year when the occurrence was recorded. |
eventDate |
(character) an optional column name with the event date. |
recordedBy |
(character) an optional column name with the name of the collector or recorder. |
identifiedBy |
(character) an optional column name with the name of the identifier. |
basisOfRecord |
(character) an optional column name with the basis of record. |
occurrenceRemarks |
(character) an optional column name with remarks about the occurrence. |
habitat |
(character) an optional column name with the habitat description. |
datasetName |
(character) an optional column name with the dataset name. |
datasetKey |
(character) an optional column name with the dataset key. |
key |
(character) an optional column name with the unique occurrence identifier. |
Value
A data.frame containing a metadata template that can be directly used in
the format_columns() function.
Examples
# Load data example
# Occurrences of Puma concolor from the atlanticr R package
data("puma_atlanticr", package = "RuHere")
# Create metadata to standardize the occurrences
puma_metadata <- create_metadata(occ = puma_atlanticr,
scientificName = "actual_species_name",
decimalLongitude = "longitude",
decimalLatitude = "latitude",
elevation = "altitude",
country = "country",
stateProvince = "state",
municipality = "municipality",
locality = "study_location",
year = "year_finish",
habitat = "vegetation_type",
datasetName = "reference")
# Now, we can use this metadata to standardize the columns
puma_occ <- format_columns(occ = puma_atlanticr, metadata = puma_metadata,
binomial_from = "actual_species_name",
data_source = "atlanticr")
Dictionary of terms used to flag cultivated individuals
Description
cultivated is a list of character vectors containing keywords used to
identify whether an occurrence record refers to cultivated or
non-cultivated individuals.
This object is used internally by flag_cultivated() to scan occurrence
fields (such as notes, habitat descriptions, or remarks) and classify
records as cultivated or not cultivated based on textual patterns.
The list combines terms from plantR (plantR:::cultivated and
plantR:::notCultivated) with additional multilingual variants commonly
found in herbarium metadata.
Usage
cultivated
Format
A named list with two elements:
cultivatedCharacter vector. Terms that indicate an individual is cultivated. Imported from
plantR:::cultivated.not_cultivatedCharacter vector. Terms suggesting an individual is not cultivated (e.g., “not cultivated”, “not planted”, “no plantada”, “no cultivada”), including terms from
plantR:::notCultivated.
Details
These terms are matched case-insensitively after text cleaning (e.g., lowercasing and accent removal).
References
de Lima, Renato AF, et al. plantR: An R package and workflow for managing species records from biological collections. Methods in Ecology and Evolution, 14.2 (2023): 332-339.
See Also
flag_cultivated
Examples
data(cultivated)
cultivated$cultivated
cultivated$not_cultivated
Fake occurrence data for testing coordinate validation functions
Description
fake_data is a synthetic dataset created for testing functions that validate
and correct country- or state-level geographic coordinates.
Controlled coordinate errors were introduced (e.g., inverted signs, swapped values, combinations of swaps and inversions) to simulate common georeferencing mistakes.
This dataset is intended for automated testing of functions such as
check_countries() and check_states().
Usage
fake_data
Format
A data frame with the same structure as all_occ, containing
occurrence records with intentionally manipulated coordinates.
An additional column data_source = "fake_data" identifies these records.
Details
The coordinate errors include:
-
Inverted longitude: multiplying longitude by -1.
-
Inverted latitude: multiplying latitude by -1.
-
Both coordinates inverted.
-
Swapped coordinates: (
lon,lat) → (lat,lon). -
Swapped + inverted in four combinations:
swapped only,
swapped + inverted longitude,
swapped + inverted latitude,
swapped + both inverted.
Examples
data(fake_data)
Download the latest version of the Fauna do Brazil (Taxonomic Catalog of the Brazilian Fauna)
Description
This function downloads the Taxonomic Catalog of the Brazilian Fauna
database, which is required for filtering occurrence records using
specialists' information via the flag_faunabr() function.
Usage
faunabr_here(
data_dir,
data_version = "latest",
solve_discrepancy = TRUE,
overwrite = TRUE,
remove_files = TRUE,
verbose = TRUE
)
Arguments
data_dir |
(character) a directory to save the data downloaded from Fauna do Brazil. |
data_version |
(character) version of the Fauna do Brazil database to download. Use "latest" to get the most recent version, which is updated frequently. Alternatively, specify an older version (e.g., data_version="1.2"). Default value is "latest". |
solve_discrepancy |
(logical) whether to resolve inconsistencies between species and subspecies information. When set to TRUE (default), species information is updated based on unique data from subspecies. For example, if a subspecies occurs in a certain state, it implies that the species also occurs in that state. |
overwrite |
(logical) If TRUE, data is overwritten. Default is TRUE. |
remove_files |
(logical) whether to remove the downloaded files used in building the final dataset. Default is TRUE. |
verbose |
(logical) whether to display messages during function execution. Set to TRUE to enable display, or FALSE to run silently. Default is TRUE. |
Value
A message indicating that the data were successfully saved in the directory
specified by data_dir.
Examples
# Define a directory to save the data
data_dir <- tempdir() # Here, a temporary directory
# Download the latest version of the Flora e Funga do Brazil database
faunabr_here(data_dir = data_dir)
Identify and correct coordinates based on country information
Description
This function identifies and correct inverted and transposed coordinates based on country information
Usage
fix_countries(
occ,
long = "decimalLongitude",
lat = "decimalLatitude",
country_column,
correct_country = "correct_country",
distance = 5,
progress_bar = FALSE,
verbose = TRUE
)
Arguments
occ |
(data.frame) a dataset with occurrence records, preferably with
country information checked using |
long |
(character) column name with longitude. Default is 'decimalLongitude'. |
lat |
lat (character) column name with latitude. Default is 'decimalLatitude'. |
country_column |
(character) name of the column containing the country information. |
correct_country |
(character) name of the column with logical value indicating whether each record falls within the country specified in the metadata. Default is 'correct_country'. See details. |
distance |
(numeric) maximum distance (in kilometers) a record can fall
outside the country assigned in the |
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
verbose |
(logical) whether to print messages about function progress.
Default is |
Details
The function checks and corrects coordinate errors in occurrence records
by testing whether each point falls within the expected country polygon
(from RuHere’s internal world map).
The input occurrence data must contain a column (specified in the
correct_country argument) with logical values indicating which records to
check and fix — only those marked as FALSE will be processed. This column
can be obtained by running the check_countries() function.
It runs a series of seven tests to detect common issues such as inverted signs or swapped latitude/longitude values. Inverted coordinates have their signs flipped (e.g., -45 instead of 45), placing the point in the opposite hemisphere, while swapped coordinates have latitude and longitude values exchanged (e.g., -47, -15 instead of -15, -47).
For each test, country borders are buffered by distance km to account for
minor positional errors.
The type of issue (or "correct") is recorded in a new column,
country_issues. Records that match their assigned country after any
correction are updated accordingly, while remaining mismatches are labeled
"incorrect".
This function can be used internally by check_countries() to automatically
identify and fix common coordinate errors.
Value
The original occ data.frame with the coordinates in the long and lat
columns corrected, and an additional column (country_issues) indicating
whether the coordinates are:
-
correct: the record falls within the assigned country;
-
inverted: longitude and/or latitude have reversed signs;
-
swapped: longitude and latitude are transposed (i.e., each appears in the other's column). incorrect: the record falls outside the assigned country and could not be corrected.
Examples
# Load example data
data("occurrences", package = "RuHere") # Import example data
# Standardize country names
occ_country <- standardize_countries(occ = occurrences,
return_dictionary = FALSE)
# Check whether records fall within the assigned countries
occ_country_checked <- check_countries(occ = occ_country,
country_column = "country_suggested")
# Fix records with incorrect or misassigned countries
occ_country_fixed <- fix_countries(occ = occ_country_checked,
country_column = "country_suggested")
Identify and correct coordinates based on state information
Description
This function identifies and correct inverted and transposed coordinates based on state information.
Usage
fix_states(
occ,
long = "decimalLongitude",
lat = "decimalLatitude",
state_column,
correct_state = "correct_state",
distance = 5,
progress_bar = FALSE,
verbose = TRUE
)
Arguments
occ |
(data.frame) a dataset with occurrence records, preferably with
state information checked using |
long |
(character) column name with longitude. Default is 'decimalLongitude'. |
lat |
lat (character) column name with latitude. Default is 'decimalLatitude'. |
state_column |
(character) name of the column containing the state information. |
correct_state |
(character) name of the column with logical value indicating whether each record falls within the state specified in the metadata. Default is 'correct_state'. See details. |
distance |
(numeric) maximum distance (in kilometers) a record can fall
outside the state assigned in the |
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
verbose |
(logical) whether to print messages about function progress. Default is TRUE. |
Details
The function checks and corrects coordinate errors in occurrence records
by testing whether each point falls within the expected state polygon
(from RuHere’s internal world map).
The input occurrence data must contain a column (specified in the
correct_state argument) with logical values indicating which records to
check and fix — only those marked as FALSE will be processed. This column
can be obtained by running the check_states() function.
It runs a series of seven tests to detect common issues such as inverted signs or swapped latitude/longitude values. Inverted coordinates have their signs flipped (e.g., -45 instead of 45), placing the point in the opposite hemisphere, while swapped coordinates have latitude and longitude values exchanged (e.g., -47, -15 instead of -15, -47).
For each test, state borders are buffered by distance km to account for
minor positional errors.
The type of issue (or "correct") is recorded in a new column,
state_issues. Records that match their assigned state after any
correction are updated accordingly, while remaining mismatches are labeled
"incorrect".
This function can be used internally by check_states() to automatically
identify and fix common coordinate errors.
Value
The original occ data.frame with the coordinates in the long and lat
columns corrected, and an additional column (state_issues) indicating
whether the coordinates are:
-
correct: the record falls within the assigned state;
-
inverted: longitude and/or latitude have reversed signs;
-
swapped: longitude and latitude are transposed (i.e., each appears in the other's column). incorrect: the record falls outside the assigned state and could not be corrected.
Examples
# Load example data
data("occurrences", package = "RuHere") # Import example data
# Subset records of Araucaria
occ <- occurrences[occurrences$species == "Araucaria angustifolia", ]
# Standardize country names
occ_country <- standardize_countries(occ = occ,
return_dictionary = FALSE)
# Standardize state names
occ_state <- standardize_states(occ = occ_country,
country_column = "country_suggested",
return_dictionary = FALSE)
# Check whether records fall within the assigned states
occ_states_checked <- check_states(occ = occ_state,
state_column = "state_suggested")
# Fix records with incorrect or misassigned states
occ_states_fixed <- fix_states(occ = occ_states_checked,
state_column = "state_suggested")
Identify records outside natural ranges according to BIEN
Description
Flags (validates) occurrence records based on known distribution data
from the Botanical Information and Ecology Network (BIEN) data. This function
checks if an occurrence point for a given species falls within its documented
distribution, allowing for user-defined buffers around the region. Records
are flagged as valid (TRUE) if they fall inside the documented
distribution (plus optional buffer) for the species in the BIEN dataset.
Usage
flag_bien(
data_dir,
occ,
species = "species",
long = "decimalLongitude",
lat = "decimalLatitude",
buffer = 10,
progress_bar = FALSE,
verbose = TRUE
)
Arguments
data_dir |
(character) Required directory path where the |
occ |
(data.frame or data.table) a data frame containing the occurrence records to be flagged. Must contain columns for species, longitude, and latitude. |
species |
(character) the name of the column in |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
buffer |
(numeric) buffer distance (in kilometers) to be applied around the region of distribution. Default is 20 km. |
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
verbose |
(logical) if |
Value
A data.frame that is the original occ data frame
augmented with a new column named bien_flag. This column is
logical (TRUE/FALSE) indicating whether the record falls
within the expected distribution (plus buffer) based on the BIEN
data. Records for species not found in the BIEN data will have
NA in the bien_flag column.
Examples
# Load example data
data("occurrences", package = "RuHere")
# Filter occurrences for golden trumpet tree
occ <- occurrences[occurrences$species == "Handroanthus serratifolius", ]
# Set folder where distributional datasets were saved
# Here, just a sample provided in the package
# You must run 'bien_here()' beforehand to download the necessary data files
dataset_dir <- system.file("extdata/datasets", package = "RuHere")
# Flag records using BIEN specialist information
occ_bien <- flag_bien(data_dir = dataset_dir, occ = occ)
Color palette for flagged records
Description
flag_colors is a named character vector defining the default colors used to
plot occurrence records flagged with mapview_here().
Usage
flag_colors
Format
A named character vector where:
- names
Flag labels corresponding to categories generated by the various
flag_*and checking functions.- values
Hex color codes or standard R color names used for plotting.
See Also
mapview_here
Examples
data(flag_colors)
# View all flag categories and their colors
flag_colors
Get consensus across multiple flags
Description
This functions creates a new column representing the consensus across multiple flag columns. The consensus can be computed in two ways:
-
"all_true": A record is considered valid (TRUE) only if all specified flag are valid (TRUE). -
"any_true": A record is considered valid (TRUE) if at least one specified flag is valid (TRUE).
Usage
flag_consensus(
occ,
flags,
consensus_rule = "all_true",
flag_name = "consensus_flag",
remove_flag_columns = FALSE
)
Arguments
occ |
(data.frame or data.table) a dataset with occurrence records that has been processed by two or more flagging functions. |
flags |
(character) a string vector with the names of the flags to be used in the consensus evaluation. See details for see the options. |
consensus_rule |
(character) A string specifying how the consensus
should be computed. Options are |
flag_name |
(character) name of the column that will store the
consensus result. Default is |
remove_flag_columns |
(logical) whether to remove the original flag
columns specified in |
Details
The following flags are available: correct_country, correct_state, cultivated, fossil, inaturalist, faunabr, florabr, wcvp, iucn, duplicated, thin_geo, thin_env, year, .val, .equ, .zer, .cap, .cen, .sea, .urb, .otl, .gbf, .inst, and .aohi.
Value
The original occ with an additional logical column defined by
flag_name, indicating the consensus result based on the selected
consensus_rule.
Examples
# Load example data
data("occ_flagged", package = "RuHere")
# Get consensus using florabr, wcvp, and iucn flags
# Valid (TRUE) only when all flags are TRUE
occ_consensus_all <- flag_consensus(occ = occ_flagged,
flags = c("florabr", "wcvp", "iucn"),
consensus_rule = "all_true")
# Valid (TRUE) when at least one flag is TRUE
occ_consensus_any <- flag_consensus(occ = occ_flagged,
flags = c("florabr", "wcvp", "iucn"),
consensus_rule = "any_true")
Flag occurrence records of cultived individuals
Description
This function identifies records of cultivated individuals based on record description.
Usage
flag_cultivated(
occ,
columns = c("occurrenceRemarks", "habitat", "locality"),
cultivated_terms = NULL,
not_cultivated_terms = NULL
)
Arguments
occ |
(data.frame) a data frame containing the occurrence records to be
examined, preferably standardized using |
columns |
columns (character) vector of column names in |
cultivated_terms |
(character) optional vector of additional terms that
indicate a cultivated individual. Default is NULL, meaning it will use the
cultivated-related expressions available in |
not_cultivated_terms |
(character) optional vector of additional terms
that indicate a non-cultivated individual. Default is NULL, meaning it will
use the non cultivated-related expressions available in
|
Value
A data.frame that is the original occ data frame augmented with
a new column named cultivated_flag. Records identified as cultivated
receive FALSE, while all other records receive TRUE.
Examples
# Load example data
data("occurrences", package = "RuHere")
# Flag fossil records
occ_cultivated <- flag_cultivated(occ = occurrences)
Flag duplicated records
Description
This function identifies duplicated records based on species name and coordinates, as well as user-defined additional columns or raster cells. Among duplicated records, the function keeps only one unflagged record, chosen according to a continuous variable (e.g., keeping the most recent), a categorical variable (e.g., prioritizing a specific data source), or randomly.
Usage
flag_duplicates(
occ,
species = "species",
long = "decimalLongitude",
lat = "decimalLatitude",
additional_groups = NULL,
continuous_variable = NULL,
decreasing = TRUE,
categorical_variable = NULL,
priority_categories = NULL,
by_cell = FALSE,
raster_variable = NULL
)
Arguments
occ |
(data.frame) a data frame containing the occurrence records to be
examined, preferably standardized using |
species |
(character) the name of the column containing species names. Default is "species". |
long |
(character) the name of the column containing longitude values.
Default is |
lat |
(character) the name of the column containing latitude values.
Default is |
additional_groups |
(character) optional vector of additional column
names to consider when identifying duplicates. For example, if |
continuous_variable |
(character) optional name of a numeric column used
to sort duplicated records and select one to remain unflagged. Default is
|
decreasing |
(logical) whether to sort records in decreasing order using
the |
categorical_variable |
(character) (character) optional name of a
categorical column used to sort duplicated records and select one to remain
unflagged. If provided, the order of priority must be specified through
|
priority_categories |
(character) vector of categories, in the desired
order of priority, present in the column specified in |
by_cell |
(logical) whether to use raster cells instead of raw
coordinates to identify duplicates (i.e., all records inside the same raster
cell are treated as duplicates). If |
raster_variable |
(SpatRaster) a |
Value
A data.frame that is the original occ data frame augmented with
a new column named duplicated_flag. Records identified as duplicated
receive FALSE, while all unique retained records receive TRUE.
Examples
# Load example data
data("occurrences", package = "RuHere")
# Duplicate some records as example
occurrences <- rbind(occurrences[1:1000, ], occurrences[1:100,])
# Flag duplicates
occ_dup <- flag_duplicates(occ = occurrences)
sum(!occ_dup$duplicated_flag) #Number of duplicated records
Select Environmentally Thinned Occurrences Using Moran's I Autocorrelation
Description
This function evaluates multiple environmentally thinned datasets (produced using different number of blocks) and selects the one that best balances low spatial autocorrelation and number of retained records.
For each number of bins provided in n_bins, the function computes Moran's I
for the selected environmental variables and summarizes autocorrelation using
a chosen statistic (mean, median, minimum, or maximum). The best thinning
level is then selected according to criteria described in Details.
Usage
flag_env_moran(
occ,
species = "species",
long = "decimalLongitude",
lat = "decimalLatitude",
env_layers,
n_bins,
distance = "haversine",
moran_summary = "mean",
min_records = 10,
min_imoran = 0.1,
prioritary_column = NULL,
decreasing = TRUE,
do_pca = FALSE,
mask = NULL,
pca_buffer = 1000,
flag_for_NA = FALSE,
return_all = FALSE,
verbose = TRUE
)
Arguments
occ |
(data.frame or data.table) a data frame containing the occurrence records for a single species. Must contain columns for species, longitude, and latitude. |
species |
(character) the name of the column in |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
env_layers |
(SpatRaster) object containing environmental variables for
splitting in |
n_bins |
(numeric) vector of number of bins into which each environmental variable will be divided (e.g., c(5, 10, 15, 20)). |
distance |
(character) distance metric used to compute the weight matrix
for Moran's I. One of |
moran_summary |
(character) summary statistic used to select the best
thinning distance. One of |
min_records |
(numeric) minimum number of records required for a dataset
to be considered. Default: |
min_imoran |
(numeric) minimum Moran's I required to avoid selecting
datasets with extremely low spatial autocorrelation. Default: |
prioritary_column |
(character) name of a numeric columns in |
decreasing |
(logical) whether to sort records in decreasing order using
the |
do_pca |
(logical) whether environmental variables should be summarized
using PCA before computing Moran's I. Default: |
mask |
(SpatVector or SpatExtent) optional spatial object to mask the
|
pca_buffer |
(numeric) buffer width (km) used when PCA is computed from
the convex hull of records. Ignored if |
flag_for_NA |
(logical) whether to treat records falling in |
return_all |
(logical) whether to return the full list of all thinned
datasets. Default is |
verbose |
(logical) whether to print messages about the progress.
Default is |
Details
This function is inspired by the approach used in Velazco et al. (2020), extending the procedure by allowing:
prioritization of records based on a user-defined variable (e.g., year)
optional PCA transformation of environmental layers
selection rules that prevent datasets with too few records or extremely low Moran's I from being chosen.
Procedure overview
For each bin number in
n_bins, generate a spatially thinned dataset usingthin_env()function.Extract environmental values for the retained records.
Compute Moran's I for each environmental variable.
Summarize autocorrelation per dataset (mean, median, min, or max).
Apply the selection criteria:
Keep only datasets with at least
min_recordsrecords.Keep only datasets with Moran's I greater or equal to
min_imoran.Round Moran's I to two decimal places and select the dataset with the 25th lowest autocorrelation.
If more than on dataset is selected, choose the dataset retaining more records.
If still tied, choose the dataset with the largest number of bins.
Distance matrix for Moran's I Moran's I requires a weight matrix derived from pairwise distances among records. Two distance types are available:
-
"haversine": geographic distance computed withfields::rdist.earth()(default; recommended for longitude/latitude coordinates) -
"euclidean": Euclidean distance computed withstats::dist()
Environmental PCA (optional)
If do_pca = TRUE, the environmental layers are summarized using PCA before
Moran's I is computed.
If
maskis provided, PCA is computed on masked layers.Otherwise, a convex hull around the records is buffered by
pca_bufferkilometers to define the PCA area.It will select the axis that together explain more than 90% of the variation.
Value
A list with:
-
occ: the selected thinned occurrence dataset with the column
thin_env_flagindicating whether each record is retained (TRUE) or flagged as redundant (FALSE) in the environmental space . -
imoran: a table summarizing Moran's I for each thinning distance
-
n_bins: the number of bins that produced the selected dataset
-
moran_summary: the summary statistic used to select the dataset
-
all_thined: (optional) list of thinned datasets for all bin numbers. Only returned if
return_allwas set toTRUE
Examples
# Load example data
data("occurrences", package = "RuHere")
# Subset occurrences from Araucaria
occ <- occurrences[occurrences$species == "Araucaria angustifolia", ]
# Load example of raster variables
data("worldclim", package = "RuHere")
# Unwrap Packed raster
r <- terra::unwrap(worldclim)
# Select thinned occurrences
occ_env_moran <- flag_env_moran(occ = occ,
n_bins = c(5, 10, 20, 30, 40, 50),
env_layers = r)
# Selected number of bins
occ_env_moran$n_bins
# Number of flagged and unflagged records
sum(occ_env_moran$occ$thin_env_flag) #Retained
sum(!occ_env_moran$occ$thin_env_flag) #Flagged for thinning out
# Results os the spatial autocorrelation analysis
occ_env_moran$imoran
Identify records outside natural ranges according to Fauna do Brasil
Description
Flags (validates) occurrence records based on known distribution data
from the Catálogo Taxônomico da Fauna do Brasil (faunabr) data. This function
checks if an occurrence point for a given species falls within its documented
distribution, allowing for user-defined buffers around Brazilian states,
or the entire country. Records are flagged as valid (TRUE) if they fall
within the specified range for the distribution information available in the
faunabr data.
Usage
flag_faunabr(
data_dir,
occ,
species = "species",
long = "decimalLongitude",
lat = "decimalLatitude",
origin = NULL,
by_state = TRUE,
buffer_state = 20,
by_country = TRUE,
buffer_country = 20,
keep_columns = TRUE,
spat_state = NULL,
spat_country = NULL,
progress_bar = FALSE,
verbose = FALSE
)
Arguments
data_dir |
(character) Required directory path where the |
occ |
(data.frame or data.table) a data frame containing the occurrence records to be flagged. Must contain columns for species, longitude, and latitude. |
species |
(character) the name of the column in |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
origin |
(character) filter the |
by_state |
(logical) if |
buffer_state |
(numeric) buffer distance (in kilometers) to be applied around the known state distribution boundaries. Records within this distance are considered valid. Default is 20 km. |
by_country |
(logical) if |
buffer_country |
(numeric) buffer distance (in kilometers) to be applied around the country boundaries. Records within this distance are considered valid. Default is 20 km. |
keep_columns |
(logical) if |
spat_state |
(SpatVector) a SpatVector of the Brazilian states. By default, it uses the SpatVector provided by geobr::read_state(). It can be another Spatvector, but the structure must be identical to 'faunabr::states', with a column called "abbrev_state" identifying the states codes. |
spat_country |
(SpatVector) a SpatVector of the world countries. By default, it uses the SpatVector provided by rnaturalearth::ne_countries. It can be another Spatvector, but the structure must be identical to 'faunabr::world_fauna', with a column called "country_code" identifying the country codes. |
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
verbose |
(logical) if |
Value
#' A data.frame that is the original occ data frame
augmented with a new column named faunabr_flag. This column is
logical (TRUE/FALSE) indicating whether the record falls
within the expected distribution (plus buffer) based on the faunabr
data. Records for species not found in the faunabr data will have
NA in the faunabr_flag column.
Examples
# Load example data
data("occurrences", package = "RuHere")
# Get only occurrences from Azure Jay
occ <- occurrences[occurrences$species == "Cyanocorax caeruleus", ]
# Set folder where distributional datasets were saved
# Here, just a sample provided in the package
# You must run 'faunabr_here()' beforehand to download the necessary data files for your species
dataset_dir <- system.file("extdata/datasets", package = "RuHere")
# Flag records using faunabr specialist information
occ_fauna <- flag_faunabr(data_dir = dataset_dir, occ = occ)
Identify records outside natural ranges according to Flora e Funga do Brasil
Description
Flags (validates) occurrence records based on known distribution data
from the Flora e Funga do Brasil (florabr) data. This function checks if an
occurrence point for a given species falls within its documented distribution,
allowing for user-defined buffers around Brazilian states, biomes, or the
entire country. Records are flagged as valid (TRUE) if they fall within
the specified range for the distribution information available in the
florabr data.
Usage
flag_florabr(
data_dir,
occ,
species = "species",
long = "decimalLongitude",
lat = "decimalLatitude",
origin = NULL,
by_state = TRUE,
buffer_state = 20,
by_biome = TRUE,
buffer_biome = 20,
by_endemism = TRUE,
buffer_brazil = 20,
state_vect = NULL,
state_column = NULL,
biome_vect = NULL,
biome_column = NULL,
br_vect = NULL,
keep_columns = TRUE,
progress_bar = FALSE,
verbose = FALSE
)
Arguments
data_dir |
(character) directory path where the |
occ |
(data.frame) a data frame containing the occurrence records to be flagged. Must contain columns for species, longitude, and latitude. |
species |
(character) the name of the column in |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
origin |
(character or NULL) filter the |
by_state |
(logical) if |
buffer_state |
(numeric) buffer distance (in kilometers) to be applied around the known state distribution boundaries. Records within this distance are considered valid. Default is 20 km. |
by_biome |
(logical) if |
buffer_biome |
(numeric) buffer distance (in kilometers) to be applied around the known biome distribution boundaries. Records within this distance are considered valid. Default is 20 km. |
by_endemism |
(logical) if |
buffer_brazil |
(numeric) buffer distance (in kilometers) to be applied around the entire Brazilian boundary. Default is 20 km. |
state_vect |
(SpatVector) qn optional custom simple features
( |
state_column |
(character) the name of the column in |
biome_vect |
(SpatVector) an optional custom simple features ( |
biome_column |
(character) the name of the column in |
br_vect |
(SpatVector) an optional custom simple features ( |
keep_columns |
(logical) if |
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
verbose |
(logical) if |
Value
A data.frame that is the original occ data frame
augmented with a new column named florabr_flag. This column is
logical (TRUE/FALSE) indicating whether the record falls
within the expected distribution (plus buffer) based on the florabr
data. Records for species not found in the florabr data will have
NA in the florabr_flag column.
Examples
# Load example data
data("occurrences", package = "RuHere")
# Get only occurrences from Araucaria
occ <- occurrences[occurrences$species == "Araucaria angustifolia", ]
# Set folder where distributional datasets were saved
# Here, just a sample provided in the package
# You must run 'florabr_here()' beforehand to download the necessary data files for your species
dataset_dir <- system.file("extdata/datasets", package = "RuHere")
# Flag records using specialist information from Flora do Brasil
occ_flora <- flag_florabr(data_dir = dataset_dir, occ = occ)
Flag fossil records
Description
This function identifies occurrence records that correspond to fossils, based on specific search terms found in selected columns.
Usage
flag_fossil(
occ,
columns = c("basisOfRecord", "occurrenceRemarks"),
fossil_terms = NULL
)
Arguments
occ |
(data.frame) a data frame containing the occurrence records to be
examined, preferably standardized using |
columns |
(character) vector of column names in |
fossil_terms |
(character) optional vector of additional terms that
indicate a fossil record (e.g., |
Value
A data.frame that is the original occ data frame augmented with
a new column named fossil_flag. Records identified as fossils receive
FALSE, while all other records receive TRUE.
Examples
# Load example data
data("occurrences", package = "RuHere")
# Flag fossil records
occ_fossil <- flag_fossil(occ = occurrences)
Select Spatially Thinned Occurrences Using Moran's I Autocorrelation
Description
This function evaluates multiple geographically thinned datasets (produced using different thinning distances) and selects the one that best balances low spatial autocorrelation and number of retained records.
For each thinning distance provided in d, the function computes Moran's I
for the selected environmental variables and summarizes autocorrelation using
a chosen statistic (mean, median, minimum, or maximum). The best thinning
level is then selected according to criteria described in Details.
Usage
flag_geo_moran(
occ,
species = "species",
long = "decimalLongitude",
lat = "decimalLatitude",
d,
distance = "haversine",
moran_summary = "mean",
min_records = 10,
min_imoran = 0.1,
prioritary_column = NULL,
decreasing = TRUE,
env_layers,
do_pca = FALSE,
mask = NULL,
pca_buffer = 1000,
return_all = FALSE,
verbose = TRUE
)
Arguments
occ |
(data.frame or data.table) a data frame containing the occurrence records for a single species. Must contain columns for species, longitude, and latitude. |
species |
(character) the name of the column in |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
d |
(numeric) vector of thinning distances in kilometers (e.g., c(5, 10, 15, 20)). |
distance |
(character) distance metric used to compute the weight matrix
for Moran's I. One of |
moran_summary |
(character) summary statistic used to select the best
thinning distance. One of |
min_records |
(numeric) minimum number of records required for a dataset
to be considered. Default: |
min_imoran |
(numeric) minimum Moran's I required to avoid selecting
datasets with extremely low spatial autocorrelation. Default: |
prioritary_column |
(character) name of a numeric columns in |
decreasing |
(logical) whether to sort records in decreasing order using
the |
env_layers |
(SpatRaster) object containing environmental variables for computing Moran's I. |
do_pca |
(logical) whether environmental variables should be summarized
using PCA before computing Moran's I. Default: |
mask |
(SpatVector or SpatExtent) optional spatial object to mask the
|
pca_buffer |
(numeric) buffer width (km) used when PCA is computed from
the convex hull of records. Ignored if |
return_all |
(logical) whether to return the full list of all thinned
datasets. Default: |
verbose |
(logical) whether to print messages about the progress.
Default is |
Details
This function is inspired by the approach used in Velazco et al. (2021), extending the procedure by allowing:
prioritization of records based on a user-defined variable (e.g., year)
optional PCA transformation of environmental layers
selection rules that prevent datasets with too few records or extremely low Moran's I from being chosen.
Procedure overview
For each distance in
d, generate a spatially thinned dataset usingthin_geo()function.Extract environmental values for the retained records.
Compute Moran's I for each environmental variable.
Summarize autocorrelation per dataset (mean, median, min, or max).
Apply the selection criteria:
Keep only datasets with at least
min_recordsrecords.Keep only datasets with Moran's I higher than
min_imoran.Round Moran's I to two decimal places and select the dataset with the 25th lowest autocorrelation.
If more than on dataset is selected, choose the dataset retaining more records.
If still tied, choose the dataset with the smallest thinning distance.
Distance matrix for Moran's I Moran's I requires a weight matrix derived from pairwise distances among records. Two distance types are available:
-
"haversine": geographic distance computed withfields::rdist.earth()(default; recommended for longitude/latitude coordinates) -
"euclidean": Euclidean distance computed withstats::dist()
Environmental PCA (optional)
If do_pca = TRUE, the environmental layers are summarized using PCA before
Moran's I is computed.
If
maskis provided, PCA is computed on masked layers.Otherwise, a convex hull around the records is buffered by
pca_bufferkilometers to define the PCA area.It will select the axis that together explain more than 90% of the variation.
Value
A list with:
-
occ: the selected thinned occurrence dataset with the column
thin_geo_flagindicating whether each record is retained (TRUE) or flagged. -
imoran: a table summarizing Moran's I for each thinning distance
-
distance: the thinning distance that produced the selected dataset
-
moran_summary: the summary statistic used to select the dataset
-
all_thined: (optional) list of thinned datasets for all distances. Only returned if
return_allwas set toTRUE
References
Velazco, S. J. E., Svenning, J. C., Ribeiro, B. R., & Laureto, L. M. O. (2021). On opportunities and threats to conserve the phylogenetic diversity of Neotropical palms. Diversity and Distributions, 27(3), 512–523. https://doi.org/10.1111/ddi.13215
Examples
# Load example data
data("occurrences", package = "RuHere")
# Subset occurrences from Araucaria
occ <- occurrences[occurrences$species == "Araucaria angustifolia", ]
# Load example of raster variables
data("worldclim", package = "RuHere")
# Unwrap Packed raster
r <- terra::unwrap(worldclim)
# Select thinned occurrences
occ_geo_moran <- flag_geo_moran(occ = occ, d = c(5, 10, 20, 30),
env_layers = r)
# Selected distance
occ_geo_moran$distance
# Number of flagged and unflagged records
sum(occ_geo_moran$occ$thin_geo_flag) #Retained
sum(!occ_geo_moran$occ$thin_geo_flag) #Flagged for thinning out
# Results os the spatial autocorrelation analysis
occ_geo_moran$imoran
Flag occurrence records sourced from iNaturalist
Description
This function identifies and flags occurrence records sourced from iNaturalist. It can flag all iNaturalist records or only those that do not have Research Grade status.
Usage
flag_inaturalist(occ, columns = "datasetName", research_grade = FALSE)
Arguments
occ |
(data.frame) a data frame containing the occurrence records to be
examined, preferably standardized using |
columns |
(character) column name in |
research_grade |
(logical) whether to flag all records from
iNaturalist, including those with Research Grade status. Default is |
Details
According to iNaturalist, Observations become Research Grade when:
the iNaturalist community agrees on species-level ID or lower, i.e. when more than 2/3 of identifiers agree on a taxon;
the community taxon and the observation taxon agree;
or the community agrees on an ID between family and species and votes that the community taxon is as good as it can be.
Value
A data.frame that is the original occ data frame augmented with
a new column named inaturalist_flag. Flagged records receive
FALSE, while all other records receive TRUE.
Examples
# Load example data
data("occurrences", package = "RuHere")
# Flag only iNaturalist records without Research Grade
occ_inat <- flag_inaturalist(occ = occurrences, research_grade = FALSE)
table(occ_inat$inaturalist_flag) # Number of records flagged (FALSE)
# Flag all iNaturalist records (including Research Grade)
occ_inat <- flag_inaturalist(occ = occurrences, research_grade = TRUE)
table(occ_inat$inaturalist_flag) # Number of records flagged (FALSE)
Identify records outside natural ranges according to the IUCN
Description
Flags (validates) occurrence records based on known distribution data
from the International Union for Conservation of Nature (IUCN) data. This
function checks if an occurrence point for a given species falls within its
documented distribution, allowing for user-defined buffers around the region.
Records are flagged as valid (TRUE) if they fall inside the documented
distribution (plus optional buffer) for the species in the IUCN dataset.
Usage
flag_iucn(
data_dir,
occ,
species = "species",
long = "decimalLongitude",
lat = "decimalLatitude",
origin = "native",
presence = "all",
buffer = 20,
progress_bar = FALSE,
verbose = FALSE
)
Arguments
data_dir |
(character) Required directory path where the |
occ |
(data.frame or data.table) a data frame containing the occurrence records to be flagged. Must contain columns for species, longitude, and latitude. |
species |
(character) the name of the column in |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
origin |
(character) vector specifying which origin categories should
be considered as part of the species' range. Options are: |
presence |
(character) vector specifying which presence type should
be considered as part of the species' range. Options are: |
buffer |
(numeric) buffer distance (in kilometers) to be applied around the region of distribution. Default is 20 km. |
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
verbose |
(logical) if |
Value
A data.frame that is the original occ data frame
augmented with a new column named iucn_flag. This column is
logical (TRUE/FALSE) indicating whether the record falls
within the expected distribution (plus buffer) based on the IUCN
data. Records for species not found in the IUCN data will have
NA in the iucn_flag column.
Examples
# Load example data
data("occurrences", package = "RuHere")
# Get only occurrences from Araucaria
occ <- occurrences[occurrences$species == "Araucaria angustifolia", ]
# Set folder where distributional datasets were saved
# Here, just a sample provided in tha package
# You must run 'iucn_here()' beforehand to download the necessary data files
dataset_dir <- system.file("extdata/datasets", package = "RuHere")
# Flag records using IUCN specialist information
occ_iucn <- flag_iucn(data_dir = dataset_dir, occ = occ)
Flag name dictionary
Description
A named character vector used to convert internal flag column names (produced by the package's flagging functions) into human-readable labels.
Usage
flag_names
Format
A named character vector of length 25.
The names correspond to the original flag codes (e.g., "correct_country",
"duplicated_flag", ".cen", "consensus_flag"), and the values are the
cleaned, human-readable labels (e.g., "Wrong country", "Duplicated",
"Country/Province centroid", "consensus").
Details
This object is used internally by functions such as mapview_here() and
remove_flagged()to display more intuitive flag names to users.
Identify records outside natural ranges according to the World Checklist of Vascular Plants
Description
Flags (validates) occurrence records based on known distribution data
from the World Checklist of Vascular Plants (WCVP) data. This function checks
if an occurrence point for a given species falls within its documented
distribution, allowing for user-defined buffers around the region. Records
are flagged as valid (TRUE) if they fall inside the documented
distribution (plus optional buffer) for the species in the WCVP dataset.
Usage
flag_wcvp(
data_dir,
occ,
species = "species",
long = "decimalLongitude",
lat = "decimalLatitude",
origin = "native",
buffer = 20,
progress_bar = FALSE,
verbose = FALSE
)
Arguments
data_dir |
(character) Required directory path where the |
occ |
(data.frame or data.table) a data frame containing the occurrence records to be flagged. Must contain columns for species, longitude, and latitude. |
species |
(character) the name of the column in |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
origin |
(character) vector specifying which origin categories should
be considered as part of the species' range. Options are: |
buffer |
(numeric) buffer distance (in kilometers) to be applied around the region of distribution. Default is 20 km. |
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
verbose |
(logical) if |
Value
A data.frame that is the original occ data frame
augmented with a new column named wcvp_flag. This column is
logical (TRUE/FALSE) indicating whether the record falls
within the expected distribution (plus buffer) based on the WCVP
data. Records for species not found in the WCVP data will have
NA in the wcvp_flag column.
Examples
# Load example data
data("occurrences", package = "RuHere")
# Filter occurrences for Araucaria
occ <- occurrences[occurrences$species == "Araucaria angustifolia", ]
# Set folder where distributional datasets were saved
# Here, just a sample provided in the package
# You must run 'wcvp_here()' beforehand to download the necessary data files
dataset_dir <- system.file("extdata/datasets", package = "RuHere")
# Flag records using WCVP specialist information
occ_wcvp <- flag_wcvp(data_dir = dataset_dir, occ = occ)
Flag records outside a year range
Description
This function identifies occurrence records collected before or after user-specified years.
Usage
flag_year(
occ,
year_column = "year",
lower_limit = NULL,
upper_limit = NULL,
flag_NA = FALSE
)
Arguments
occ |
(data.frame) a dataset with occurrence records, preferably
standardized using |
year_column |
(character) name of the column containing the year in which the occurrence was recorded. This column must be numeric. |
lower_limit |
(numeric) the minimum acceptable year. Records collected
before this value will be flagged. Default is |
upper_limit |
(numeric) the maximum acceptable year. Records collected
after this value will be flagged. Default is |
flag_NA |
(character) whether to flag records with missing year
information. Default is |
Value
A data.frame identical to occ but with an additional column named
year_flag. Records collected outside the year range specified are assigned
FALSE.
Examples
# Load example data
data("occurrences", package = "RuHere")
# Flag records collected before 1980 and after 2010
occ_year <- flag_year(occ = occurrences, lower_limit = 1980,
upper_limit = 2010)
Download the latest version of Flora e Funga do Brasil database
Description
This function downloads the Flora e Funga do Brasil database, which is
required for filtering occurrence records using specialists' information
via the flag_florabr() function.
Usage
florabr_here(
data_dir,
data_version = "latest",
solve_discrepancy = TRUE,
overwrite = TRUE,
remove_files = TRUE,
verbose = TRUE
)
Arguments
data_dir |
(character) a directory to save the data downloaded from Flora e Funga do Brasil. |
data_version |
(character) version of the Flora e Funga do Brasil database to download. Use "latest" to get the most recent version, updated weekly. Alternatively, specify an older version (e.g., data_version="393.319"). Default value is "latest". |
solve_discrepancy |
(logical) whether to resolve discrepancies between species and subspecies/varieties information. When set to TRUE, species information is updated based on unique data from varieties and subspecies. For example, if a subspecies occurs in a certain biome, it implies that the species also occurs in that biome. Default is TRUE. |
overwrite |
(logical) if TRUE, data is overwritten. Default = TRUE. |
remove_files |
(logical) whether to remove the downloaded files used in building the final dataset. Default is TRUE. |
verbose |
(logical) whether to display messages during function execution. Set to TRUE to enable display, or FALSE to run silently. Default is TRUE. |
Value
A message indicating that the data were successfully saved in the directory
specified by data_dir.
Examples
# Define a directory to save the data
data_dir <- tempdir() # Here, a temporary directory
# Download the latest version of the Flora e Funga do Brasil database
florabr_here(data_dir = data_dir)
Format and standardize column names and data types of an occurrence dataset
Description
Format and standardize column names and data types of an occurrence dataset
Usage
format_columns(
occ,
metadata,
extract_binomial = TRUE,
binomial_from = NULL,
include_subspecies = FALSE,
include_variety = FALSE,
check_numeric = TRUE,
numeric_columns = NULL,
check_encoding = TRUE,
data_source = NULL,
progress_bar = FALSE,
verbose = FALSE
)
Arguments
occ |
(data.frame or data.table) a dataset with occurrence records,
preferably obtained from |
metadata |
(character or data.frame) if a character, one of 'gbif',
'specieslink', 'bien', or 'idigbio', specifying which metadata template to
use (the corresponding data frames are available in
|
extract_binomial |
(logical) whether to create a column with the binomial name of the species. If FALSE, it will create a column "species" with the exact name stored in the scientificName column. Default is TRUE. |
binomial_from |
(character) the column name in metadata from which to
extract the binomial name. Only applicable if |
include_subspecies |
(logical) whether to include subspecies in the
binomial name. Only applicable if |
include_variety |
(logical) whether to include variety in the binomial
name. Only applicable if |
check_numeric |
(logical) whether to check and coerce the columns
specified in |
numeric_columns |
(character) a vector of column names that must be
numeric. Default is NULL, meaning that if |
check_encoding |
(logical) whether to check and fix the encoding of columns that typically contain special characters (see Details). Default is TRUE. |
data_source |
(character) the source of the occurrence records. Default
is NULL, meaning it will use the same string provided in |
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
verbose |
(logical) whether to print messages about the progress. Default is FALSE. |
Details
If a user-defined metadata data.frame is provided, it must include the following 21 columns: 'scientificName', 'collectionCode', 'catalogNumber', 'decimalLongitude', 'decimalLatitude', 'coordinateUncertaintyInMeters', 'elevation', 'country', 'stateProvince', 'municipality', 'locality', 'year', 'eventDate', 'recordedBy', 'identifiedBy', 'basisOfRecord', 'occurrenceRemarks', 'habitat', 'datasetName', 'datasetKey', and 'key'.
If check_encoding = TRUE, the function will inspect and, if necessary, fix
the encoding of these columns:
'collectionCode', 'catalogNumber', 'country', 'stateProvince',
municipality', 'locality', 'eventDate','recordedBy', 'identifiedBy',
'basisOfRecord', and 'datasetName'.
Value
A data.frame with standardized column names and data types according to the specified metadata.
Examples
# Example with GBIF
data("occ_gbif", package = "RuHere") #Import data example
gbif_standardized <- format_columns(occ_gbif, metadata = "gbif")
# Example with SpeciesLink
data("occ_splink", package = "RuHere") #Import data example
splink_standardized <- format_columns(occ_splink, metadata = "specieslink")
# Example with BIEN
data("occ_bien", package = "RuHere") #Import data example
bien_standardized <- format_columns(occ_bien, metadata = "bien")
# Example with idigbio
data("occ_idig", package = "RuHere") #Import data example
idig_standardized <- format_columns(occ_idig, metadata = "idigbio")
Download occurrence records from BIEN
Description
Wrapper function to access and download occurrence records from the Botanical Information and Ecology Network (BIEN) database. It provides a unified interface to query BIEN data by species, genus, family, or by geographic or political boundaries.
Usage
get_bien(by = "species", cultivated = FALSE,
new.world = NULL, all.taxonomy = FALSE, native.status = FALSE,
natives.only = TRUE, observation.type = FALSE, political.boundaries = TRUE,
collection.info = TRUE, only.geovalid = TRUE, min.lat = NULL, max.lat = NULL,
min.long = NULL, max.long = NULL, species = NULL, genus = NULL,
country = NULL, country.code = NULL, state = NULL, county = NULL,
state.code = NULL, county.code = NULL, family = NULL, sf = NULL, dir,
filename = "bien_output", file.format = "csv", compress = FALSE,
save = FALSE, verbose = TRUE, ...)
Arguments
by |
(character) type of query to perform ( |
cultivated |
(logical) whether to include cultivated records or exclude
them. Default is |
new.world |
(logical) if |
all.taxonomy |
(logical) if |
native.status |
(logical) if |
natives.only |
(logical) if |
observation.type |
(logical) if |
political.boundaries |
(logical) if |
collection.info |
(logical) if |
only.geovalid |
(logical) if |
min.lat |
(numeric) the minimum latitude (in decimal degrees) for a
bounding-box query when |
max.lat |
(numeric) the maximum latitude (in decimal degrees) for a
bounding-box query when |
min.long |
(numeric) the minimum longitude (in decimal degrees) for a
bounding-box query when |
max.long |
(numeric) the maximum longitude (in decimal degrees) for a
bounding-box query when |
species |
(character) species name(s) to query when |
genus |
(character) genus name(s) to query when |
country |
(character) country name when |
country.code |
(character) two-letter ISO country code corresponding
to |
state |
(character) state or province name when |
county |
(character) county or equivalent subdivision name
when |
state.code |
(character) state or province code corresponding
to |
county.code |
(character) county or equivalent subdivision code
corresponding to |
family |
(character) family name(s) to query when |
sf |
(object of class |
dir |
(character) directory path where the file will be saved.
Required if |
filename |
(character) name of the output file without extension.
Default is |
file.format |
(character) file format for saving output ( |
compress |
(logical) if |
save |
(logical) if |
verbose |
(logical) if |
... |
additional arguments passed to the underlying BIEN function. |
Value
A data.frame containing BIEN occurrence records that match
the specified query. The structure and available columns depend on the chosen
by value and the corresponding BIEN function.
Examples
# Example: download occurrence records for a single species
res_test <- get_bien(
by = "species",
species = "Paubrasilia echinata",
cultivated = TRUE,
native.status = TRUE,
observation.type = TRUE,
only.geovalid = TRUE
)
Identify Environmental Blocks and Group Nearby Records in Environmental Space
Description
This function creates a multidimensional grid in environmental space by
splitting each environmental variable into n_bins equally sized intervals.
It then assigns each occurrence record to an environmental block (bin
combination) and identifies records that fall into the same block (i.e.,
records that are close to each other in environmental space).
The results can be visualized using the plot_env_bins() function.
Usage
get_env_bins(
occ,
species = "species",
long = "decimalLongitude",
lat = "decimalLatitude",
env_layers,
n_bins = 5
)
Arguments
occ |
(data.frame or data.table) a data frame containing the occurrence records for a single species. Must contain columns for species, longitude, and latitude. |
species |
(character) the name of the column in |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
env_layers |
(SpatRaster) object containing environmental variables. |
n_bins |
(numeric) number of bins into which each environmental variable will be divided. |
Value
A list with:
-
data: a data frame including extracted environmental values, bin indices, and a unique
block_idfor each record. -
breaks: a named list of numeric vectors containing the break points for each variable (used by
plot_env_bins()).
Examples
# Load example data
data("occurrences", package = "RuHere")
# Get only occurrences from Araucaria
occ <- occurrences[occurrences$species == "Araucaria angustifolia", ]
# Load example of raster variables
data("worldclim", package = "RuHere")
# Unwrap Packed raster
r <- terra::unwrap(worldclim)
# Get bins
b <- get_env_bins(occ = occ, env_layers = r, n_bins = 5)
get_idigbio
Description
Downloads species occurrence records from the iDigBio (Integrated Digitized Biocollections) database with flexible taxonomic and geographic filtering options.
Usage
get_idigbio(species = NULL, fields = "all", genus = NULL,
family = NULL, order = NULL, phylum = NULL, kingdom = NULL, country = NULL,
county = NULL, limit = NULL, offset = NULL, dir, filename = "idigbio_output",
save = FALSE, compress = FALSE, file.format = "csv", verbose = TRUE, ...)
Arguments
species |
(character) scientific name(s) of species to search for.
Default is |
fields |
(character) fields to retrieve from iDigBio. Default is |
genus |
(character) genus name for filtering results. Default is |
family |
(character) family name for filtering results. Default is |
order |
(character) order name for filtering results. Default is |
phylum |
(character) phylum name for filtering results. Default is |
kingdom |
(character) kingdom name for filtering results. Default is |
country |
(character) country name for geographic filtering. Default is |
county |
(character) county name for geographic filtering. Default is |
limit |
(numeric) maximum number of records to retrieve. Default is
|
offset |
(numeric) number of records to skip before starting retrieval.
Default is |
dir |
(character) directory path where the file will be saved.
Required if |
filename |
(character) name of the output file without extension.
Default is |
save |
(logical) if |
compress |
(logical) if |
file.format |
(character) file format for saving output ( |
verbose |
(logical) if |
... |
additional arguments passed to |
Value
A data.frame containing occurrence records from iDigBio with the requested
fields.
Examples
## search for a single species
records_basic <- get_idigbio(species = "Arecaceae")
## search for multiple species
records_multiple <- get_idigbio(
species = c("Araucaria angustifolia"),
limit = 100)
## save results as a compressed RDS file
records_saved_rds <- get_idigbio(
species = "Anacardiaceae",
limit = 50,
dir = tempdir(),
filename = "anacardiaceae_records",
save = TRUE,
compress = TRUE,
file.format = "rds")
Download occurrence records from SpeciesLink
Description
Retrieves occurrence data from the speciesLink network using user-defined filters. The function allows querying by taxonomic, geographic, and collection-related parameters.
Usage
get_specieslink(species = NULL, key = NULL, dir,
filename = "specieslink_output",save = FALSE,
basisOfRecord = NULL, family = NULL, institutionCode = NULL,
collectionID = NULL, catalogNumber = NULL,
kingdom = NULL, phylum = NULL, class = NULL,
order = NULL, genus = NULL, specificEpithet = NULL,
infraspecificEpithet = NULL, collectionCode = NULL,
identifiedBy = NULL, yearIdentified = NULL,
country = NULL, stateProvince = NULL, county = NULL,
typeStatus = NULL, recordedBy = NULL,
recordNumber = NULL, yearCollected = NULL,
locality = NULL, occurrenceRemarks = NULL,
barcode = NULL, bbox = NULL, landuse_1 = NULL,
landuse_year_1 = NULL, landuse_2 = NULL,
landuse_year_2 = NULL, phonetic = FALSE,
coordinates = NULL, scope = NULL, synonyms = NULL,
typus = FALSE, images = FALSE, redlist = NULL,
limit = NULL, file.format = "csv",
compress = FALSE, verbose = TRUE)
Arguments
species |
(character) species name. Default is |
key |
(character) API key or authentication token if required. Default
is |
dir |
(character) directory where files will be saved (if |
filename |
(character) name of the output file without extension.
Default is |
save |
(logical) whether to save the results to file. Default is |
basisOfRecord |
(character) filter by basis of record. Default is |
family |
(character) family name. Default is |
institutionCode |
(character) code of the institution that holds the
specimen. Default is |
collectionID |
(character) unique identifier for the collection.
Default is |
catalogNumber |
(character) catalog number of the specimen or record.
Default is |
kingdom |
(character) kingdom name. Default is |
phylum |
(character) phylum name. Default is |
class |
(character) class name. Default is |
order |
(character) order name. Default is |
genus |
(character) genus name. Default is |
specificEpithet |
(character) specific epithet of the species. Default
is |
infraspecificEpithet |
(character) infraspecific epithet. Default
is |
collectionCode |
(character) code identifying the collection within an
institution. Default is |
identifiedBy |
(character) name of the person who identified the
specimen. Default is |
yearIdentified |
(numeric) year of identification. Default is |
country |
(character) country name. Default is |
stateProvince |
(character) state or province name. Default is |
county |
(character) county or municipality name. Default is |
typeStatus |
(character) type status. Default is |
recordedBy |
(character) collector name. Default is |
recordNumber |
(numeric) collector’s record number. Default is |
yearCollected |
(numeric) year of collection. Default is |
locality |
(character) locality description. Default is |
occurrenceRemarks |
(character) text field for remarks about the
occurrence. Default is |
barcode |
(character) barcode or unique specimen identifier. Default is
|
bbox |
(character) bounding box coordinates in the format
|
landuse_1 |
(character) land use category for the first year.
Default is |
landuse_year_1 |
(numeric) year corresponding to |
landuse_2 |
(character) land use category for the second year.
Default is |
landuse_year_2 |
(numeric) year corresponding to |
phonetic |
(logical) whether to use phonetic matching for taxon names.
Default is |
coordinates |
(character) whether to include only records with
geographic coordinates ( |
scope |
(character) scope of the query ( |
synonyms |
(chacarter) whether to include synonyms of the specified
taxon ( |
typus |
(logical) whether to filter only type specimens. Default is
|
images |
(logical) whether to restrict to records with associated
images. Default is |
redlist |
(character) filter by IUCN Red List category. Default is
|
limit |
(numeric) maximum number of records to return. Default is
|
file.format |
(character) file format for saving output ( |
compress |
(logical) whether to compress the output file into |
verbose |
(logical) if #' @details The speciesLink API key can be set permanently using:
set_specieslink_credentials("your_api_key")
|
Value
A data.frame containing the occurrence data fields returned
by speciesLink.
Examples
## Not run:
# Retrieve records for Arecaceae in São Paulo
res <- get_specieslink(
family = "Arecaceae",
country = "Brazil",
stateProvince = "São Paulo",
basisOfRecord = "PreservedSpecimen",
limit = 10
)
# Save results as compressed CSV
get_specieslink(
family = "Arecaceae",
country = "Brazil",
save = TRUE,
dir = tempdir(),
filename = "arecaceae_sp",
compress = TRUE
)
## End(Not run)
Static Visualization of Occurrence Flags with ggplot
Description
This function creates a static map of occurrence records using ggplot2, highlighting which points were flagged by data-validation functions. This visualization helps users quickly inspect spatial patterns of flagged and unflagged records and diagnose potential data-quality issues.
The function can also be used to plot the heatmap generated by the
spatial_kde() function.
Usage
ggmap_here(
occ,
species = NULL,
long = "decimalLongitude",
lat = "decimalLatitude",
flags = "all",
additional_flags = NULL,
names_additional_flags = NULL,
col_additional_flags = NULL,
show_no_flagged = TRUE,
col_points = NULL,
size_points = 1,
heatmap = NULL,
low_color = "blue",
mid_color = "yellow",
high_color = "red",
midpoint = 0.5,
alpha_heatmap = 0.5,
continent = NULL,
continent_fill = "gray70",
continent_linewidth = 0.3,
continent_border = "white",
ocean_fill = "aliceblue",
extension = NULL,
facet_wrap = FALSE,
theme_plot = ggplot2::theme_minimal(),
...
)
Arguments
occ |
(data.frame or data.table) a dataset containing occurrence records that has been processed by one or more flagging functions. See Details for available flag types. |
species |
(character) name of the species to subset and plot. Default is
|
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
flags |
(character) the flags to be used for coloring the records. Use
|
additional_flags |
(character) an optional named character vector with
the names of additional logical columns to be used as flags. Default is |
names_additional_flags |
(character) an optional different name to the
flag provided in |
col_additional_flags |
(character) if |
show_no_flagged |
(logical) whether to display records that did not receive any flag.Default is TRUE. |
col_points |
(character) A named vector assigning colors to each
flag. If |
size_points |
(numeric) point size for plotting occurrences. Default is 6. |
heatmap |
(SpatRaster) an optional heatmap containing the estimated density
of occurrence records, typically generated by the |
low_color |
(character) color used for the lowest density values in the heatmap. Only applicable if a heatmap is provided. Default is "blue". |
mid_color |
(character) color used for the midpoint of the heatmap gradient. Default is "yellow". |
high_color |
(character) color used for the highest density values in the heatmap. Default is "red". |
midpoint |
(numeric) the central value of the heatmap gradient,
corresponding to |
alpha_heatmap |
(numeric) Alpha transparency applied to the heatmap layer, ranging from 0 (fully transparent) to 1 (fully opaque). Default is 0.5. |
continent |
(SpatVector) optional polygon layer representing continent
boundaries. If |
continent_fill |
(character) fill color for the continent polygons. Default is "gray70". |
continent_linewidth |
(numeric) line width for continent boundaries. Default is 0.3. |
continent_border |
(character) color of the continent polygon borders. Default is "white". |
ocean_fill |
(character) background color used to represent the ocean. Default is "aliceblue". |
extension |
(SpatExtent or numeric) optional map extent specified as a
|
facet_wrap |
(logical) whether to plots each flag in a separate panel
using |
theme_plot |
(theme) a |
... |
other arguments passed to |
Details
This function expects an occurrence dataset that has already been processed
by one or more flagging routines from RuHere or related packages such as
CoordinateCleaner. Any logical column in occ can be used as a flag.
The following built-in flag names are recognized:
From RuHere:
correct_country, correct_state, cultivated, florabr, faunabr,
wcvp, iucn, bien, duplicated, thin_geo, thin_env, consensus
From CoordinateCleaner:
.val, .equ, .zer, .cap, .cen, .sea, .urb, .otl, .gbf,
.inst, .aohi
Users may also supply additional logical columns using
additional_flags, optionally providing alternative display names
(names_additional_flags) and colors (col_additional_flags).
If continent is not provided, the background map is a simplified world
polygon included with the package (a modified version of
rnaturalearthdata::map_units110). To inspect this object, run:
terra::unwrap(getExportedValue("RuHere", "world"))
When facet_wrap = TRUE, each flag is plotted in a separate panel,
allowing direct comparison among different types of data issues.
Value
An ggplot object displaying flagged and optionally unflagged occurrence records.
Examples
# Load example data
data("occ_flagged", package = "RuHere")
# Visualize all flags with ggplot
ggmap_here(occ = occ_flagged)
# Visualize each flag in a separate panel
ggmap_here(occ = occ_flagged, facet_wrap = TRUE)
Static Visualization of Richness and Trait Maps
Description
This function is the dedicated plotting tool for outputs from richness_here().
It automatically handles single-layer rasters (e.g., species richness) and
multi-layer rasters (e.g., multiple biological traits or flags), creating
a standardized visual using ggplot2.
Usage
ggrid_here(
raster,
low_color = "blue",
mid_color = "yellow",
high_color = "red",
alpha = 0.8,
continent = NULL,
continent_fill = "gray70",
continent_linewidth = 0.3,
continent_border = "white",
ocean_fill = "aliceblue",
extension = NULL,
theme_plot = ggplot2::theme_minimal(),
...
)
Arguments
raster |
(SpatRaster) A raster object generated by |
low_color |
(character) color for the lowest values. Default is "blue". |
mid_color |
(character) color for the midpoint. Default is "yellow". |
high_color |
(character) color for the highest values. Default is "red". |
alpha |
(numeric) transparency of the grid (0-1). Default is 0.8. |
continent |
(SpatVector) optional polygon layer for boundaries. |
continent_fill |
(character) fill color for continents. Default is "gray70". |
continent_linewidth |
(numeric) line width for continent boundaries. Default is 0.3. |
continent_border |
(character) color of the continent polygon borders. Default is "white". |
ocean_fill |
(character) background color for the ocean. Default is "aliceblue". |
extension |
(SpatExtent or numeric) optional map extent. |
theme_plot |
(theme) a |
... |
other arguments passed to |
Value
A ggplot object.
Examples
# Load example data
data("occ_flagged", package = "RuHere")
# Simple richness map
r_records <- richness_here(occ_flagged, summary = "records", res = 2)
ggrid_here(r_records)
# Density of specific flags
# Let's see where 'florabr' flags are concentrated
r_flags <- richness_here(occ_flagged, summary = "records",
field = "florabr_flag",
field_name = "Records flagged by florabr",
fun = function(x, ...) sum(!x, na.rm = TRUE),
res = 2)
ggrid_here(r_flags)
Import a download requested from GBIF
Description
This function imports a dataset downloaded from GBIF using a request key
generated by the request_gbif() function. It optionally allows saving
the imported occurrences to disk in CSV or GZIP format.
Usage
import_gbif(
request_key,
write_file = FALSE,
output_dir = NULL,
file.format = "gz",
select_columns = TRUE,
columns_to_import = NULL,
overwrite = FALSE,
...
)
Arguments
request_key |
an object of class 'request_key' returned by the
|
write_file |
whether to save the downloaded occurrences to disk.
Default is FALSE. If TRUE, you must specify the |
output_dir |
(character) a directory to save the data downloaded from
GBIF. Only applicable if |
file.format |
(character) the format to save the file. Options available
are 'csv' (comma-separated values) and 'gz' (compressed GZIP). Only
applicable if |
select_columns |
(logical) whether to import only specific columns (TRUE) or all columns (FALSE) from the occurrence table. Default is TRUE. |
columns_to_import |
(character) vector of column names to import.
Default is NULL, meaning it will import the column names specified in
|
overwrite |
(logical) whether to overwrite the file in the 'output_dir' if it already exists. Default is FALSE. |
... |
other arguments passed to |
Value
A data frame containing the GBIF occurrence records. If write_file = TRUE,
the function also saves the dataset to disk in the specified format.
Note
This function requires an active internet connection.
Examples
## Not run:
# Prepare data to request GBIF download
gbif_prepared <- prepare_gbif_download(species = "Araucaria angustifolia")
# Submit a request to download occurrences
gbif_requested <- request_gbif(gbif_info = gbif_prepared)
# Check progress
rgbif::occ_download_wait(gbif_requested)
# After succeeded, import data
occ_gbif <- import_gbif(request_key = gbif_requested)
## End(Not run)
Download species distribution information from IUCN
Description
This function downloads information on species distributions from the IUCN
Red List, required for filtering occurrence records using specialists'
information via the flag_iucn() function.
Usage
iucn_here(
data_dir,
species,
synonyms = NULL,
iucn_credential = NULL,
overwrite = FALSE,
progress_bar = FALSE,
verbose = FALSE,
return_data = TRUE
)
Arguments
data_dir |
(character) directory to save the data downloaded from IUCN. |
species |
(character) a vector of species names for which to retrieve distribution information. |
synonyms |
(data.frame) an optional data.frame containing synonyms of
the target species. The first column must contain the target species names,
and the second column their corresponding synonyms. Default is |
iucn_credential |
(character) your IUCN API key. Default is |
overwrite |
(logical) whether to overwrite existing files. Default is
|
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
verbose |
(logical) whether to display progress messages. Default is
|
return_data |
(logical) whether to return a data frame containing the
species distribution information downloaded from IUCN. Default is |
Details
This function uses the rredlist::rl_species() function to retrieve
distribution data from the IUCN Red List. The data include information at
the country and regional levels, following the World Geographical Scheme for
Recording Plant Distributions (WGSRPD) — but applicable to both plants and
animals.
Unfortunately, the range polygons available at https://www.iucnredlist.org/resources/spatial-data-download cannot be accessed automatically.
Because taxonomic information in IUCN may be outdated, you can optionally
provide a table of synonyms to broaden the search. The synonyms data.frame
should have the accepted species in the first column and their synonyms in
the second. See RuHere::synonys for an example.
The function also downloads the WGSRPD map used to represent distribution regions.
Value
A message indicating that the data were successfully saved in the directory
specified by data_dir.
If return_data = TRUE, the function additionally returns a data frame
containing the species distribution information retrieved from IUCN.
Examples
## Not run:
# Define a directory to save the data
data_dir <- tempdir() # Here, a temporary directory
# Download species distribution information from IUCN
iucn_here(data_dir = data_dir, species = "Araucaria angustifolia")
## End(Not run)
Interactive Visualization of Occurrence Flags with mapview
Description
This function creates an interactive map of occurrence records using mapview, visually highlighting flags. This tool helps users explore which records were flagged by one or more validation functions and inspect them directly on the map.
Usage
map_here(
occ,
species = NULL,
long = "decimalLongitude",
lat = "decimalLatitude",
flags = "all",
additional_flags = NULL,
names_additional_flags = NULL,
col_additional_flags = NULL,
show_no_flagged = TRUE,
cex = 6,
lwd = 2,
col_points = NULL,
label = NULL,
...
)
Arguments
occ |
(data.frame or data.table) a dataset containing occurrence records that has been processed by one or more flagging functions. See Details for available flag types. |
species |
(character) name of the species to subset and plot. Default is
|
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
flags |
(character) the flags to be used for coloring the records. Use
|
additional_flags |
(character) an optional named character vector with
the names of additional logical columns to be used as flags. Default is |
names_additional_flags |
(character) an optional different name to the
flag provided in |
col_additional_flags |
(character) if |
show_no_flagged |
(logical) whether to display records that did not receive any flag.Default is TRUE. |
cex |
(numeric) point size for plotting occurrences. Default is 6. |
lwd |
(numeric) line width for point borders. Default is 2. |
col_points |
(character) A named vector assigning colors to each
flag. If |
label |
(character) column name in |
... |
additional arguments passed to |
Details
The following flags are available: correct_country, correct_state, cultivated, fossil, inaturalist, faunabr, florabr, wcvp, iucn, duplicated, thin_geo, thin_env, .val, .equ, .zer, .cap, .cen, .sea, .urb, .otl, .gbf, .inst, and .aohi.
These flags are typically generated by functions in the RuHere or
CoordinateCleanerworkflow to identify potential data-quality issues in
occurrence records.
Users may also supply additional logical columns using
additional_flags, optionally providing alternative display names
(names_additional_flags) and colors (col_additional_flags).
Value
An interactive mapview object displaying flagged and optionally unflagged occurrence records.
Examples
# Load example data
data("occ_flagged", package = "RuHere")
# Visualize flags interactively
map_here(occ = occ_flagged, label = "record_id")
Fast Moran's I Autocorrelation Index
Description
This function computes Moran's I autocorrelation coefficient for a numeric
vector x using a matrix of weights. The method follows Gittleman and Kot
(1990). This function is an implementation of ape::Moran.I(), but rewritten
in C++ to be substantially faster and more memory-efficient.
Usage
moranfast(
x,
weight,
na_rm = TRUE,
scaled = FALSE,
alternative = c("two.sided")
)
Arguments
x |
(numeric) A numeric vector (e.g., environmental values extracted from occurrence records). |
weight |
(matrix) A matrix of spatial weights (e.g., a distance or
inverse-distance matrix). The number of rows must be equal to the length of
|
na_rm |
(logical) whether to remove missing values from |
scaled |
(logical) whether to scale Moran's I so that it ranges between
–1 and +1. Default is |
alternative |
(character) The alternative hypothesis tested against
the null hypothesis of no autocorrelation. Must be one of |
Value
A list with the following components:
-
observed – The observed Moran's I.
-
expected – The expected value of Moran's I under the null hypothesis.
-
sd – The standard deviation of Moran's I under the null hypothesis.
-
p.value – The p-value of the test based on the chosen
alternative.
References
Gittleman, J. L., & Kot, M. (1990). Adaptation: statistics and a null model for estimating phylogenetic effects. Systematic Zoology, 39(3), 227–241.
Examples
# Load example data
data("occurrences", package = "RuHere")
# Filter occurrences of Araucaria
occ <- occurrences[occurrences$species == "Araucaria angustifolia", ]
# Load example of raster variables
data("worldclim", package = "RuHere")
# Unwrap Packed raster
r <- terra::unwrap(worldclim)
# Extract values for bio_1
bio_1 <- terra::extract(r$bio_1,
occ[, c("decimalLongitude", "decimalLatitude")],
ID = FALSE, xy = TRUE)
#Remove NAs
bio_1 <- na.omit(bio_1)
# Convert values to numeric
v <- as.numeric(bio_1$bio_1)
# Compute geographic distance matrix
d <- fields::rdist.earth(x1 = as.matrix(bio_1[, c("x", "y")]), miles = FALSE)
# Inverse-distance weights
d <- 1/d
# Fill diagonal with 0
diag(d) <- 0
# Remove finite values
d[is.infinite(d)] <- 0
# Compute Moran's I
m <- moranfast(x = v, weight = d, scale = TRUE)
# Print results
m
Occurrence records of Yellow Trumpet Tree from BIEN
Description
A cleaned dataset of occurrence records for Yellow Trumpet Tree
(Handroanthus serratifolius) retrieved from the BIEN database.
The raw data were downloaded using get_bien()
The dataset was subsequently processed with the package’s internal
flagging workflow (flag_duplicates() and remove_flagged()) to remove
duplicated records.
Usage
occ_bien
Format
A data frame containing spatial coordinates, taxonomic information, and metadata returned by BIEN, after cleaning. Columns include (but may not be limited to):
-
scrubbed_species_binomial: Cleaned species name -
longitude,latitude: Geographic coordinates -
country,state_province, and other political boundary fields
See Also
get_bien()
Examples
# View dataset
head(occ_bien)
# Number of records
nrow(occ_bien)
Flagged occurrence records of Araucaria angustifolia
Description
A dataset containing the occurrence records of Araucaria angustifolia after applying several of the package’s flagging and data-quality assessment functions.
Usage
occ_flagged
Format
A data frame where each row corresponds to a georeferenced occurrence of A. angustifolia.
See Also
occurrences,
standardize_countries(), standardize_states(),
flag_florabr(), flag_wcvp(), flag_iucn(),
flag_cultivated(), flag_inaturalist(),
flag_duplicates(), mapview_here()
Examples
# First rows
head(occ_flagged)
# Count flagged vs. unflagged records
table(occ_flagged$correct_country)
Occurrence records of Araucaria angustifolia from GBIF
Description
A cleaned dataset of occurrence records for Araucaria angustifolia (Parana pine) retrieved from GBIF.
Records were downloaded using the package’s GBIF workflow
(prepare_gbif_download(), request_gbif(), import_gbif()), and then
cleaned using the internal flagging workflow (duplicate detection and
removal).
Usage
occ_gbif
Format
A data frame containing georeferenced GBIF occurrence records for A. angustifolia after all cleaning steps.
See Also
prepare_gbif_download(), request_gbif(), import_gbif(),
flag_duplicates(), remove_flagged()
Examples
# Preview dataset
head(occ_gbif)
# Number of cleaned records
nrow(occ_gbif)
Occurrence records of azure jay from iDigBio
Description
A cleaned dataset of occurrence records for azure jay (Cyanocorax caeruleus)
retrieved from the iDigBio using get_idigbio().
Records were cleaned using the package's internal duplicate-flagging workflow.
Usage
occ_idig
Format
A data frame containing georeferenced iDigBio occurrence records for C. caeruleus after all cleaning steps.
See Also
get_idigbio(), flag_duplicates(), remove_flagged()
Examples
# First rows
head(occ_idig)
# Number of cleaned records
nrow(occ_idig)
Occurrence records of azure jay from SpeciesLink
Description
A cleaned dataset of occurrence records for azure jay (Cyanocorax caeruleus)
retrieved from the SpeciesLink using get_specieslink().
Records were cleaned using the package's internal duplicate-flagging workflow.
Usage
occ_splink
Format
A data frame containing georeferenced SpeciesLink occurrence records for C. caeruleus after all cleaning steps.
See Also
get_specieslink(), flag_duplicates(), remove_flagged()
Examples
# First rows
head(occ_splink)
# Number of cleaned records
nrow(occ_splink)
Integrated occurrence dataset for three example species
Description
A harmonized, multi-source occurrence dataset containing cleaned georeferenced records for three species:
-
Araucaria angustifolia (Parana pine)
-
Cyanocorax caeruleus (Azure jay)
-
Handroanthus serratifolius (Yellow trumpet tree)
Records were retrieved from GBIF, speciesLink, BIEN, and iDigBio, standardized through the package workflow, merged, and cleaned to remove duplicates.
Usage
occurrences
Format
A data frame where each row represents a georeferenced occurrence record for one of the three species.
Columns correspond to the standardized output of
format_columns(), including:
-
species: Cleaned binomial species name -
decimalLongitude,decimalLatitude: Coordinates -
year: Year of collection/observation Various taxonomic, temporal, locality, and metadata fields
Source identifiers added by
format_columns()(e.g.,data_source)
See Also
format_columns(), bind_here(), flag_duplicates(), remove_flagged()
Examples
# Show the first rows
head(occurrences)
# Number of occurrences per species
table(occurrences$species)
Plot Environmental Bins (2D Projection)
Description
Visualize the output of get_env_bins() by plotting environmental blocks
(bins) along two selected environmental variables. Each block is shown as
a colored rectangle, and points falling inside the same rectangle share the
same block_id.
Usage
plot_env_bins(
env_bins,
x_var,
y_var,
alpha_blocks = 0.3,
color_points = "black",
size_points = 2,
alpha_points = 0.5,
stroke_points = 1,
xlab = NULL,
ylab = NULL,
theme_plot = ggplot2::theme_minimal()
)
Arguments
env_bins |
(list) output list from
|
x_var |
(character) name of the environmental variable used on the x-axis. |
y_var |
(character) name of the environmental variable used on the y-axis. |
alpha_blocks |
(numeric) transparency level of the block rectangles. Must be between 0 and 1. Default is 0.3. |
color_points |
(character) color of the points representing occurrence
records. Default is |
size_points |
(numeric) size of the points representing occurrence records. Default is 2. |
alpha_points |
(numeric) transparency level of the points. Must be between 0 and 1. Default is 0.5.. |
stroke_points |
(numeric) size of the border of the points. Default is 1. |
xlab |
(character) label for the x-axis. Default is |
ylab |
(character) label for the y-axis. Default is |
theme_plot |
(theme) a |
Value
A ggplot object showing the environmental blocks (colored rectangles) and the occurrence records in the selected environmental space.
Examples
# Load example data
data("occurrences", package = "RuHere")
# Get only occurrences from Araucaria
occ <- occurrences[occurrences$species == "Araucaria angustifolia", ]
# Load example of raster variables
data("worldclim", package = "RuHere")
# Unwrap Packed raster
r <- terra::unwrap(worldclim)
# Get bins
b <- get_env_bins(occ = occ, env_layers = r, n_bins = 10)
# Plot
plot_env_bins(b, x_var = "bio_1", y_var = "bio_12",
xlab = "Temperature", ylab = "Precipitation")
Prepare data to request GBIF download
Description
Prepare data to request GBIF download
Usage
prepare_gbif_download(
species,
rank = NULL,
kingdom = NULL,
phylum = NULL,
class = NULL,
order = NULL,
family = NULL,
genus = NULL,
strict = FALSE,
progress_bar = FALSE,
...
)
Arguments
species |
(character) a vector of species name(s). |
rank |
(character) optional taxonomic rank (for example, 'species' or 'genus'). Default is NULL, meaning it will return species matched across all ranks. |
kingdom |
(character) optional taxonomic kingdom (for example, 'Plantae' or 'Animalia'). Default is NULL, meaning it will return species matched across all kingdoms. |
phylum |
(character) optional taxonomic phylum. Default is NULL, meaning it will return species matched across all phyla. |
class |
(character) optional taxonomic class. Defaults is NULL, meaning it will return species matched across all classes. |
order |
(character) optional taxonomic order. Defaults is NULL, meaning it will return species matched across all orders |
family |
(character) optional taxonomic family. Defaults is NULL, meaning it will return species matched across all families. |
genus |
(character) optional taxonomic genus. Defaults is NULL, meaning it will return species matched across all genus. |
strict |
(logical) If TRUE, it (fuzzy) matches only the given name, but never a taxon in the upper classification. Default is FALSE. |
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
... |
other parameters passed to |
Value
A data.frame with species information, including the number of occurrences and other related details.
Note
This function requires an active internet connection to access GBIF data.
Examples
gbif_prepared <- prepare_gbif_download(species = "Araucaria angustifolia")
Metadata templates used internally by format_columns()
Description
A named list of data frames containing metadata templates for the main biodiversity data providers supported by the package (GBIF, SpeciesLink, iDigBio, and BIEN).
These templates are used internally by format_columns() to harmonize
columns.
Usage
prepared_metadata
Format
A named list of four data frames:
-
$gbif— template for GBIF dataset. -
$specieslink— template for SpeciesLink dataset. -
$idigbio— template for iDigBio dataset. -
$bien— template for BIEN dataset.
Details
Each element of prepared_metadata is a single-row data frame where:
-
column names correspond to the package’s standardized output fields
-
values in the row represent the original column names used by each data provider
These mappings allow format_columns() to:
rename fields (e.g.,
scientificname→scientificName)identify which variables are missing or provider-specific
coerce classes consistently (e.g., dates, coordinates)
ensure compatibility when combining datasets from different sources
See Also
format_columns()
Examples
# View template for GBIF records
prepared_metadata$gbif
Occurrence records of Puma concolor from AtlanticR
Description
A subset of Atlantic mammals records obtained from the
atlanticr::atlantic_mammals dataset, containing occurrences of
Puma concolor.
This dataset is provided as an example to illustrate how to create
user-defined metadata templates for occurrence records from external
sources using the package’s create_metadata() function.
Usage
puma_atlanticr
Format
A data frame where each row represents a single occurrence record of
Puma concolor. Columns include species name, location, and other
relevant metadata fields provided by the atlantic_mammals dataset.
See Also
create_metadata(),
format_columns()
Examples
# Preview first rows
head(puma_atlanticr)
# Count occurrences per year
table(puma_atlanticr$year)
Relocate a column in a data frame
Description
These functions move one column to a new position in a data frame,
either immediately after or before another column, while preserving
the order of all remaining columns. They are lightweight base-R utilities
equivalent to dplyr::relocate(), but without external dependencies.
Usage
relocate_after(df, col, after)
relocate_before(df, col, before)
Arguments
df |
(data.frame) a data.frame whose columns will be reordered. |
col |
(character) the name of the column to move. |
after |
(character) for |
before |
(character) for |
Value
A data.frame with columns reordered.
Remove accents and special characters from strings
Description
This function removes accents and replaces special characters from strings, returning a plain-text version suitable for data cleaning or standardization.
Usage
remove_accent(s)
Arguments
s |
(character) a character vector containing the strings to process. |
Value
A vector string without accents or special characters.
Examples
remove_accent(c("Colômbia", "São Paulo"))
Remove flagged records
Description
This function removes occurrence records flagged as invalid by one or more flagging functions. Additional manual control is available to force keeping or removing specific records, regardless of their flag values.
Usage
remove_flagged(
occ,
flags = "all",
additional_flags = NULL,
force_keep = NULL,
force_remove = NULL,
remove_NA = FALSE,
column_id = "record_id",
save_flagged = FALSE,
output_dir = NULL,
overwrite = FALSE,
output_format = ".gz"
)
Arguments
occ |
(data.frame or data.table) a dataset with occurrence records that has been processed by two or more flagging functions. See details. |
flags |
(character) a character vector with the names of the flag columns to be used for filtering records. See details for the available options. Default is "all". |
additional_flags |
(character) an optional named character vector with
the names of additional logical columns to be used as flags. Default is |
force_keep |
(character) an optional character vector with the IDs of
records that were flagged but should still be kept. Default is |
force_remove |
(character) an optional character vector with the IDs of
records that were not flagged but should still be removed. Default is |
remove_NA |
(logical) whether to remove records that have NA in the flags specified. Default is FALSE. |
column_id |
(character) the name of the column containing unique record
IDs. Required if |
save_flagged |
(logical) whether to save the flagged (removed) records.
If |
output_dir |
(character) path to an existing directory where removed
flagged records will be saved. Only used when |
overwrite |
(logical) whether to overwrite existing files in
|
output_format |
(character) output format for saving removed records.
Options are |
Details
The following flags are available: correct_country, correct_state, cultivated, fossil, inaturalist, faunabr, florabr, wcvp, iucn, duplicated, thin_geo, thin_env, .val, .equ, .zer, .cap, .cen, .sea, .urb, .otl, .gbf, .inst, and .aohi.
Value
A data.frame containing only the valid (kept) records according to the
flags and additional criteria.
Examples
# Load example data
data("occ_flagged", package = "RuHere")
# Remove all flagged records
occ_valid <- remove_flagged(occ = occ_flagged)
# Remove flagged records and force removal of some unflagged records
to_remove <- c("gbif_5987", "specieslink_2301", "gbif_18761")
occ_valid2 <- remove_flagged(occ = occ_flagged,
force_remove = to_remove)
# Remove flagged records but keep some flagged ones
to_keep <- c("gbif_14501", "gbif_12002", "gbif_5168")
occ_valid3 <- remove_flagged(occ = occ_flagged,
force_keep = to_keep)
Identify and remove invalid coordinates
Description
This function identifies and removes invalid geographic coordinates, including non-numeric values, NA or empty values, and coordinates outside the valid range for Earth (latitude > 90 or < -90, and longitude > 180 or < -180).
Usage
remove_invalid_coordinates(
occ,
long = "decimalLongitude",
lat = "decimalLatitude",
return_invalid = TRUE,
save_invalid = FALSE,
output_dir = NULL,
overwrite = FALSE,
output_format = ".gz",
verbose = FALSE
)
Arguments
occ |
(data.frame or data.table) a dataset with occurrence records. |
long |
(character) column name in |
lat |
(character) column name in |
return_invalid |
(logical) whether to return a list containing the valid and invalid coordinates. Default is TRUE. |
save_invalid |
(logical) whether to save the invalid (removed) records.
If |
output_dir |
(character) path to an existing directory where records with
invalid coordinates will be saved. Only used when |
overwrite |
(logical) whether to overwrite existing files in
|
output_format |
(character) output format for saving removed records.
Options are |
verbose |
(logical) whether to print messages about function progress.
Default is |
Value
If return_invalid = FALSE, returns the occurrence dataset containing only
valid coordinates.
If return_invalid = TRUE (default), returns a list with two elements:
-
valid– the dataset with valid coordinates. -
invalid– the dataset with invalid coordinates removed.
Examples
# Create fake data example
occ <- data.frame("species" = "spp",
"decimalLongitude" = c(10, -190, 20, 50, NA),
"decimalLatitude" = c(20, 20, 240, 50, NA))
# Split valid and invalid coordinates
occ_valid <- remove_invalid_coordinates(occ)
Submit a request to download occurrence data from GBIF.
Description
Submit a request to download occurrence data from GBIF.
Usage
request_gbif(gbif_info, hasCoordinate = TRUE,
hasGeospatialIssue = FALSE, format = "DWCA",
gbif_user = NULL, gbif_pwd = NULL, gbif_email = NULL,
additional_predicates = NULL)
Arguments
gbif_info |
an object of class 'gbif_info' resulted by the
|
hasCoordinate |
(logical) whether to retrieve only records with coordinates. Default is TRUE. |
hasGeospatialIssue |
(logical) whether to retrieve records identified with geospatial issue. Default is FALSE. |
format |
(character) the download format. Options available are 'DWCA', 'SIMPLE_CSV', or 'SPECIES_LIST', Default is DWCA'. |
gbif_user |
(character) user name within GBIF's website. Default is
NULL, meaning it will try to obtain this information from the R enviroment.
(check |
gbif_pwd |
(character) user password within GBIF's website. Default is NULL, meaning it will try to obtain this information from the R enviroment. |
gbif_email |
(character) user email within GBIF's website. Default is NULL, meaning it will try to obtain this information from the R enviroment. |
additional_predicates |
(character or occ_predicate) additional
supported predicates that can be combined to build more complex download requests. See
|
Details
You can use the object returned by this function to check the download
request progress with rgbif::occ_download_wait()
Value
A download request key returned by the GBIF API, which can be used to monitor or retrieve the download.
Note
This function requires an active internet connection and valid GBIF credentials.
Examples
## Not run:
# Prepare data to request GBIF download
gbif_prepared <- prepare_gbif_download(species = "Araucaria angustifolia")
# Submit a request to download occurrences
gbif_requested <- request_gbif(gbif_info = gbif_prepared)
# Check progress
rgbif::occ_download_wait(gbif_requested)
## End(Not run)
Species Richness and Occurrence Summary Mapping
Description
This function generates spatial grids (rasters) of species richness, record density, or summarized biological traits from occurrence data. It supports custom resolutions, masking, and automatic coordinate reprojection to match reference rasters.
Usage
richness_here(
occ,
species = "species",
long = "decimalLongitude",
lat = "decimalLatitude",
records = "record_id",
raster_base = NULL,
res = NULL,
crs = "epsg:4326",
mask = NULL,
summary = "records",
field = NULL,
field_name = NULL,
fun = mean,
verbose = TRUE
)
Arguments
occ |
(data.frame) a dataset containing occurrence records. Must include columns for species names and geographic coordinates. |
species |
(character) the name of the column in |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
records |
(character) the name of the column in |
raster_base |
(SpatRaster) an optional reference raster. If provided,
the output will match its resolution, extent, and CRS. Default is |
res |
(numeric) the desired resolution (in decimal degrees if WGS84)
for the output grid. Only used if |
crs |
(character) the coordinate reference system of the raster.
(see ?terra::crs). Default is "epsg:4326". Only applicable if |
mask |
(SpatRaster or SpatVector) an optional layer to mask the
final output. Default is |
summary |
(character) the type of summary to calculate.
Either |
field |
(character or named vector) column in |
field_name |
(character) a custom name used to build the legend when
plotting the result with |
fun |
(function) the function to aggregate |
verbose |
(logical) whether to print messages about the progress.
Default is |
Value
A SpatRaster object representing the calculated richness,
density, or trait summary.
Examples
# Load example data
data("occ_flagged", package = "RuHere")
# Mapping the density of records
r_density <- richness_here(occ_flagged, summary = "records", res = 0.5)
ggrid_here(r_density)
# We can also summarize key features:
# 1. Identifying problematic regions by summing error flags
# We create a variable to store the sum of logical flags (TRUE = 1, FALSE = 0)
total_flags <- occ_flagged$florabr_flag +
occ_flagged$wcvp_flag +
occ_flagged$iucn_flag +
occ_flagged$cultivated_flag +
occ_flagged$inaturalist_flag +
occ_flagged$duplicated_flag
names(total_flags) <- occ_flagged$record_id
# Using summary = "records" with to see the average accumulation of errors
# with fun = mean to see the average accumulation
r_flags <- richness_here(occ_flagged, summary = "records",
field = total_flags,
field_name = "Number of flags",
fun = mean, res = 0.5)
ggrid_here(r_flags)
# 2. Or we can summarize organisms traits spatially
# Simulating a trait (e.g., mass) for each unique record
spp <- unique(occ_flagged$record_id)
sim_mass <- setNames(runif(length(spp), 10, 50), spp)
r_trait <- richness_here(occ_flagged, summary = "records",
field = sim_mass, field_name = "Mass",
fun = mean, res = 0.5)
ggrid_here(r_trait)
Store GBIF credentials
Description
This function sets GBIF credentials (username, email and password) as environment variables in the R environment. These credentials are required to retrieve occurrence records from GBIF.
Usage
set_gbif_credentials(
gbif_username,
gbif_email,
gbif_password,
permanently = FALSE,
overwrite = FALSE,
open_Renviron = FALSE,
verbose = TRUE
)
Arguments
gbif_username |
(character) your GBIF username. |
gbif_email |
(character) your GBIF email address. |
gbif_password |
(character) your GBIF password. |
permanently |
(logical) whether to add the GBIF credentials permanently
to the R environment. Default is |
overwrite |
(logical) whether to overwrite GBIF credentials if they
already exist. Only applicable if permanently is set to |
open_Renviron |
(logical) whether to open the .Renviron file after
saving the credentials. Only applicable if permanently is set to |
verbose |
(logical) if |
Value
If permanently and open_Renviron are set to TRUE, it opens the .Renviron
file. Otherwise, the credentials are saved silently.
Examples
## Not run:
set_gbif_credentials(gbif_username = "my_username",
gbif_email = "my_email@example.com",
gbif_password = "my_password")
## End(Not run)
Store SpeciesLink credential
Description
This function sets the IUCN API key as an environment variable in the R environment. This key is required to obtain distributional data from IUCN.
Usage
set_iucn_credentials(
iucn_key,
permanently = FALSE,
overwrite = FALSE,
open_Renviron = FALSE,
verbose = TRUE
)
Arguments
iucn_key |
(character) your IUCN API key. See Details. |
permanently |
(logical) whether to add the SpeciesLink API key
permanently to the R environment. Default is |
overwrite |
(logical) whether to overwrite IUCN credential if it
already exists. Only applicable if |
open_Renviron |
(logical) whether to open the .Renviron file after
saving the credential. Only applicable if |
verbose |
(logical) if |
Details
To check your API key, visit: https://api.iucnredlist.org/users/edit.
Value
If permanently and open_Renviron are set to TRUE, it opens the .Renviron
file. Otherwise, the credentials are saved silently.
Examples
## Not run:
set_iucn_credentials(iucn_key = "my_key")
## End(Not run)
Store SpeciesLink credential
Description
This function sets the SpeciesLink API key as an environment variable in the R environment. This API key is required to retrieve occurrence records from SpeciesLink.
Usage
set_specieslink_credentials(
specieslink_key,
permanently = FALSE,
overwrite = FALSE,
open_Renviron = FALSE,
verbose = TRUE
)
Arguments
specieslink_key |
(character) your SpeciesLink API key. |
permanently |
(logical) whether to add the SpeciesLink API key
permanently to the R environment. Default is |
overwrite |
(logical) whether to overwrite SpeciesLink credential if it
already exists. Only applicable if |
open_Renviron |
(logical) whether to open the .Renviron file after
saving the credential. Only applicable if |
verbose |
(logical) if |
Details
To check your API key, visit: https://specieslink.net/aut/profile/apikeys.
Value
If permanently and open_Renviron are set to TRUE, it opens the .Renviron
file. Otherwise, the credentials are saved silently.
Examples
## Not run:
set_specieslink_credentials(specieslink_key = "my_key")
## End(Not run)
Kernel Density Estimation (Heatmap) for occurrence data
Description
This function creates density heatmaps using kernel density estimation. The algorithm is inspired by the SpatialKDE R package and the "Heatmap" tool from QGIS. Each occurrence contributes to the density surface within a circular neighborhood defined by a specified radius.
Usage
spatial_kde(
occ,
long = "decimalLongitude",
lat = "decimalLatitude",
radius = 0.2,
resolution = NULL,
buffer_extent = 500,
crs = "epsg:4326",
raster_ref = NULL,
kernel = "quartic",
scaled = TRUE,
decay = 1,
mask = NULL,
zero_as_NA = FALSE,
weights = NULL
)
Arguments
occ |
(data.frame, data.table, or SpatVector) a data frame or SpatVector containing the occurrences. Must contain columns longitude and latitude. |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
radius |
(numeric) a positive numeric value specifying the smoothing radius for the kernel density estimate. This parameter determines the circular neighborhood around each point where that point will have an influence. See details. Default is 0.2. |
resolution |
(numeric) a positive numeric value specifying the resolution
(in degrees or meters, depending on the |
buffer_extent |
(numeric) width of the buffer (in kilometers) to draw around the occurrences to define the area for computing the heatmap. Default is 500. |
crs |
(character) the coordinate reference system of the raster heatmap
(see ?terra::crs). Default is "epsg:4326". Only applicable if |
raster_ref |
(SpatRaster) an optional raster to use as reference for resolution, CRS, and extent. Default is NULL. |
kernel |
(character) type of kernel to use. Available kernerls are "uniform", "quartic", "triweight", "epanechnikov", or "triangular". Default is "quartic". |
scaled |
(logical) whether to scale output values to vary between 0 and
|
decay |
(numeric) decay parameter for "triangular" kernel. Only
applicable if |
mask |
(SpatRaster or SpatExtent) optional spatial object to define the
extent of the area for the heatmap. Default is NULL, in which case the
extent is derived from |
zero_as_NA |
(logical) whether to convert regions with value 0 to NA. Default is FALSE. |
weights |
(numeric) optional vector of weights for individual points.
Must be the same length as the number of occurrences in |
Details
The radius parameter controls how far the influence of each observation
extends. Smaller values produce fine-grained peaks; larger values produce
smoother, more spread-out heatmaps. Units depend on the CRS: degrees for
geographic coordinates (default), meters for projected coordinates.
If raster_ref is not provided, the extent is calculated from the convex
hull of occ plus buffer_extent.
Kernels define the weight decay of points:
"uniform" = constant, "quartic"/"triweight"/"epanechnikov" = smooth, and
"triangular" = linear decay (using decay parameter).
Value
A SpatRaster containing the kernel density values.
References
Hart, T., & Zandbergen, P. (2014). Kernel density estimation and hotspot mapping: Examining the influence of interpolation method, grid cell size, and radius on crime forecasting. Policing: An International Journal of Police Strategies & Management, 37(2), 305-323.
Nelson, T. A., & Boots, B. (2008). Detecting spatial hot spots in landscape ecology. Ecography, 31(5), 556-566.
Chainey, S., Tompson, L., & Uhlig, S. (2008). The utility of hotspot mapping for predicting spatial patterns of crime. Security journal, 21(1), 4-28.
Caha J (2023). SpatialKDE: Kernel Density Estimation for Spatial Data. https://jancaha.github.io/SpatialKDE/index.html.
Examples
# Load example data
data("occ_flagged", package = "RuHere")
# Remove flagged records
occ <- remove_flagged(occ_flagged)
# Generate heatmap
heatmap <- spatial_kde(occ = occ, resolution = 0.25, buffer_extent = 50,
radius = 2)
# Plot heatmap with terra
terra::plot(heatmap)
# Plot heatmap with ggplot
ggmap_here(occ = occ, heatmap = heatmap)
Spatialize occurrence records
Description
Convert a data.frame (or data.table) of occurrence records into a SpatVector object.
Usage
spatialize(
occ,
long = "decimalLongitude",
lat = "decimalLatitude",
crs = "epsg:4326",
force_numeric = TRUE
)
Arguments
occ |
(data.frame or data.table) a data frame containing the occurrence records to be flagged. Must contain columns for longitude, and latitude. |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
crs |
(character) the coordinate reference system (see |
force_numeric |
(logical) whether to coerce the longitude and latitude
columns to numeric if they are not already. Default is |
Value
A SpatVector object containing the spatialized occurrence records.
Examples
# Load example data
data("occurrences", package = "RuHere")
# Spatialize the occurrence records
pts <- spatialize(occurrences)
# Plot the resulting SpatVector
terra::plot(pts)
Standardize country names
Description
This function standardizes country names using both names and codes present in a specified column.
Usage
standardize_countries(
occ,
country_column = "country",
max_distance = 0.1,
user_dictionary = NULL,
lookup_na_country = FALSE,
long = "decimalLongitude",
lat = "decimalLatitude",
return_dictionary = TRUE
)
Arguments
occ |
(data.frame) a dataset with occurrence records, preferably
standardized using |
country_column |
(character) the column name containing the country information. |
max_distance |
(numeric) maximum allowed distance (as a fraction) when
searching for suggestions for misspelled country names. Can be any value
between 0 and 1. Higher values return more suggestions. See |
user_dictionary |
(data.frame) optional data.frame with two columns:
'country_name' and 'country_suggested'. If provided, this dictionary will be
combined with the package’s default country dictionary
( |
lookup_na_country |
(logical) whether to extract the country from coordinates when the country column has missing values. If TRUE, longitude and latitude columns must be provided. Default is FALSE. |
long |
(character) column name with longitude. Only applicable if
|
lat |
(character) column name with latitude. Only applicable if
|
return_dictionary |
(logical) whether to return the dictionary of countries that were (fuzzy) matched. |
Details
Country names are first standardized by exact matching against a list of
country names in several languages from rnaturalearthdata::map_units110.
Any unmatched names are then processed using a fuzzy matching algorithm to
find potential candidates for misspelled country names. If unmatched names
remain and lookup_na_country = TRUE, the country is extracted from
coordinates using a map retrieved from rnaturalearthdata::map_units110.
Value
A list with two elements:
data |
The original |
dictionary |
If |
Examples
# Import and standardize GBIF
data("occ_gbif", package = "RuHere") #Import data example
gbif_standardized <- format_columns(occ_gbif, metadata = "gbif")
# Import and standardize SpeciesLink
data("occ_splink", package = "RuHere") #Import data example
splink_standardized <- format_columns(occ_splink, metadata = "specieslink")
# Import and standardize BIEN
data("occ_bien", package = "RuHere") #Import data example
bien_standardized <- format_columns(occ_bien, metadata = "bien")
# Import and standardize idigbio
data("occ_idig", package = "RuHere") #Import data example
idig_standardized <- format_columns(occ_idig, metadata = "idigbio")
# Merge all
all_occ <- bind_here(gbif_standardized, splink_standardized,
bien_standardized, idig_standardized)
# Standardize countries
occ_standardized <- standardize_countries(occ = all_occ)
Standardize state names
Description
This function standardizes state names using both names and codes present in a specified column.
Usage
standardize_states(
occ,
state_column = "stateProvince",
country_column = "country_suggested",
max_distance = 0.1,
lookup_na_state = FALSE,
long = "decimalLongitude",
lat = "decimalLatitude",
return_dictionary = TRUE
)
Arguments
occ |
(data.frame) a dataset with occurrence records, preferably
standardized using |
state_column |
(character) the column name containing the state information. |
country_column |
(character) the column name containing the country information. |
max_distance |
(numeric) maximum allowed distance (as a fraction) when
searching for suggestions for misspelled state names. Can be any value
between 0 and 1. Higher values return more suggestions. See |
lookup_na_state |
(logical) whether to extract the state from coordinates when the state column has missing values. If TRUE, longitude and latitude columns must be provided. Default is FALSE. |
long |
(character) column name with longitude. Only applicable if
|
lat |
(character) column name with latitude. Only applicable if
|
return_dictionary |
(logical) whether to return the dictionary of states that were (fuzzy) matched. |
Details
States names are first standardized by exact matching against a list of
state names in several languages from rnaturalearthdata::states50.
Any unmatched names are then processed using a fuzzy matching algorithm to
find potential candidates for misspelled state names. If unmatched names
remain and lookup_na_state = TRUE, the state is extracted from
coordinates using a map retrieved from rnaturalearthdata::states50.
Value
A list with two elements:
data |
The original |
dictionary |
If |
Examples
# Import and standardize GBIF
data("occ_gbif", package = "RuHere") #Import data example
gbif_standardized <- format_columns(occ_gbif, metadata = "gbif")
# Import and standardize SpeciesLink
data("occ_splink", package = "RuHere") #Import data example
splink_standardized <- format_columns(occ_splink, metadata = "specieslink")
# Import and standardize BIEN
data("occ_bien", package = "RuHere") #Import data example
bien_standardized <- format_columns(occ_bien, metadata = "bien")
# Import and standardize idigbio
data("occ_idig", package = "RuHere") #Import data example
idig_standardized <- format_columns(occ_idig, metadata = "idigbio")
# Merge all
all_occ <- bind_here(gbif_standardized, splink_standardized,
bien_standardized, idig_standardized)
# Standardize countries
occ_standardized <- standardize_countries(occ = all_occ)
# Standardize states
occ_standardized2 <- standardize_states(occ = occ_standardized$occ)
Administrative Units (States, Provinces, and Regions)
Description
A simplified PackedSpatVector containing state-level polygons (e.g.,
provinces, departments, regions) for countries worldwide. Names and parent
countries (geonunit) were cleaned (lowercase, accents removed).
Usage
states
Format
A PackedSpatVector object with polygons of administrative divisions
and one attribute:
- name
State/province/region name.
Details
The dataset was generated from rnaturalearth::ne_states(). The following
processing steps were applied:
kept only administrative types:
"Province","State","Department","Region","Federal District";selected only
"name"and"geonunit"columns;both fields were cleaned via
tolower()andremove_accent();records where state name = country name were removed;
geometries were simplified using
terra::simplifyGeom(tolerance = 0.05);wrapped with
terra::wrap()for internal storage.
Source
Natural Earth data, via rnaturalearth.
Examples
data(states)
states <- terra::unwrap(states)
terra::plot(states)
States dictionary for standardizing state and province names and codes
Description
Provides lookup tables used to standardize subnational administrative units (states and provinces) in occurrence datasets.
Generated from rnaturalearth::ne_states(), it includes a wide range of
name variants (in multiple languages, transliterations, and common
abbreviations), as well as postal codes for each unit.
This dictionary allows consistent mapping of user-provided names such as
"são paulo", "sao paulo", "SP", "illinois", "ill.", "bayern",
"bavaria" to a single standardized state or province name.
Usage
states_dictionary
Format
A named list with two data frames:
- states_name
A data frame with columns:
- state_name
Character. Name variants of states or provinces from
ne_states(), lowercased and accent-stripped.- state_suggested
Character. Standardized state/province name, also lowercased and accent-stripped.
- country
Character. Country associated with the state/province, lowercased and accent-stripped.
- states_code
A data frame with columns:
- state_code
Character. Postal codes from
ne_states(), cleaned and converted to uppercase.- state_suggested
Character. Standardized state/province name corresponding to the code.
- country
Character. Country associated with the code.
Details
The dictionary is constructed by:
selecting administrative units of type
"State"or"Province";extracting multiple name fields, including alternative names and multilingual fields;
normalizing names to lowercase and removing accents;
normalizing codes to uppercase;
removing duplicates and ambiguous entries;
removing rows with missing names or codes.
Examples
data(states_dictionary)
head(states_dictionary$states_name)
head(states_dictionary$states_code)
Extract state from coordinates
Description
Extracts the state for each occurrence record based on coordinates.
Usage
states_from_coords(
occ,
long = "decimalLongitude",
lat = "decimalLatitude",
from = "all",
state_column = "stateProvince",
output_column = "state_xy",
append_source = FALSE
)
Arguments
occ |
(data.frame) a dataset with occurrence records, preferably
standardized using |
long |
(character) column name with longitude. Default is 'decimalLongitude'. |
lat |
(character) column name with latitude. Default is 'decimalLatitude'. |
from |
(character) whether to extract the state for all records ('all') or only for records missing state information ('na_only'). If 'na_only', you must provide the name of the column with state information. Default is 'all'. |
state_column |
(character) the column name containing the state. Only
applicable if |
output_column |
(character) column name created in |
append_source |
(logical) whether to create a new column in |
Details
The states are extracted from coordinates using a map retrieved from
rnaturalearthdata::states50.
Value
The original occ data.frame with an additional column containing the
states extracted from coordinates.
Examples
# Import and standardize GBIF
data("occ_gbif", package = "RuHere") #Import data example
gbif_standardized <- format_columns(occ_gbif, metadata = "gbif")
gbif_states <- states_from_coords(occ = gbif_standardized)
Summarize flags
Description
This functions returns a dataframe and a bar plot summarizing the number of records flagged by each flagging function.
Usage
summarize_flags(
occ = NULL,
flagged_dir = NULL,
output_format = ".gz",
flags = "all",
additional_flags = NULL,
names_additional_flags = NULL,
plot = TRUE,
show_unflagged = TRUE,
occ_unflagged = NULL,
fill = "#0072B2",
sort = TRUE,
decreasing = TRUE,
add_n = TRUE,
size_n = 3.5,
theme_plot = ggplot2::theme_minimal(),
...
)
Arguments
occ |
(data.frame or data.table) a dataset containing occurrence records that has been processed by one or more flagging functions. See Details for available flag types. |
flagged_dir |
(character) optional path to a directory containing files
with flagged records saved using the |
output_format |
(character) output format used to read the removed records.
Options are |
flags |
(character) the flags to be summarized. Use |
additional_flags |
(character) an optional named character vector with
the names of additional logical columns to be used as flags. Default is |
names_additional_flags |
(character) an optional different name to the
flag provided in |
plot |
(logical) whether to return a |
show_unflagged |
(logical) whether to include the number of unflagged
records in the plot. Default is |
occ_unflagged |
(data.frame or data.table) an optional dataset
containing unflagged occurrence records. Only applicable if |
fill |
(character) fill color for the bar plot. Default is |
sort |
(logical) whether to sort bars according to the number of records.
Default is |
decreasing |
(logical) whether to sort bars in decreasing order (flags
with more records appear at the top of the plot). Default is |
add_n |
(logical) whether to display the number of flagged records on
the bars. Default is |
size_n |
(numeric) size of the text showing the number of records. Only
used when |
theme_plot |
(theme) a |
... |
additional arguments passed to |
Details
This function expects an occurrence dataset that has already been processed
by one or more flagging routines from RuHere or related packages such as
CoordinateCleaner. Any logical column in occ can be used as a flag.
The following built-in flag names are recognized:
From RuHere:
correct_country, correct_state, cultivated, florabr, faunabr,
wcvp, iucn, bien, duplicated, thin_geo, thin_env, consensus
From CoordinateCleaner:
.val, .equ, .zer, .cap, .cen, .sea, .urb, .otl, .gbf,
.inst, .aohi
Users may also supply additional logical columns using
additional_flags, optionally providing alternative display names
(names_additional_flags) and colors (col_additional_flags).
Value
If plot = TRUE, a list with two elements:
- df_summary
A data frame summarizing the number of records per flag.
- plot_summary
A
ggplot2object showing the summary as a bar plot.
If plot = FALSE, only the summary data frame is returned.
Examples
# Load example data
data("occ_flagged", package = "RuHere")
# Summarize flags
sum_flags <- summarize_flags(occ = occ_flagged)
# Plot
sum_flags$plot_summary
Flag records that are close to each other in the enviromnetal space
Description
Flags occurrence records for thinning by keeping only one record per species within the same environmental block/bin.
Usage
thin_env(
occ,
species = "species",
long = "decimalLongitude",
lat = "decimalLatitude",
env_layers,
n_bins = 5,
prioritary_column = NULL,
decreasing = TRUE,
flag_for_NA = FALSE
)
Arguments
occ |
(data.frame or data.table) a data frame containing the occurrence records. Must contain columns for species, longitude, and latitude. |
species |
(character) the name of the column in |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
env_layers |
(SpatRaster) object containing environmental variables. |
n_bins |
(numeric) number of bins into which each environmental variable will be divided. |
prioritary_column |
(character) name of a numeric columns in |
decreasing |
(logical) whether to sort records in decreasing order using
the |
flag_for_NA |
(logical) whether to treat records falling in |
Details
This function used get_env_bins() to create a multidimensional grid in
environmental space by splitting each environmental variable into n_bins
equally sized intervals. Records falling into the same environmental bin are
considered redundant; only one is kept (based on retention priority when
provided), and the remaining records are flagged.
Value
The original occ data frame with two additional columns:
-
thin_env_flag: logical indicating whether each record is retained (TRUE) or flagged as redundant (FALSE). -
bin: environmental bin ID assigned to each record. Each component of the ID corresponds to the bin of one environmental variable.
Examples
# Load example data
data("occurrences", package = "RuHere")
# Get only occurrences from Araucaria
occ <- occurrences[occurrences$species == "Araucaria angustifolia", ]
# Load example of raster variables
data("worldclim", package = "RuHere")
# Unwrap Packed raster
r <- terra::unwrap(worldclim)
# Flag records that are close to each other in the enviromnetal space
occ_env_thin <- thin_env(occ = occ, env_layers = r)
# Number of flagged (redundant) records
sum(!occ_env_thin$thin_env_flag) #Number of flagged records
Flag records that are close to each other in the geographic space
Description
Marks occurrence records for thinning by keeping only one record per species within a radius of 'd' kilometers.
Usage
thin_geo(
occ,
species = "species",
long = "decimalLongitude",
lat = "decimalLatitude",
d,
prioritary_column = NULL,
decreasing = TRUE,
remove_invalid = TRUE,
optimize_memory = FALSE,
verbose = TRUE
)
Arguments
occ |
(data.frame or data.table) a data frame containing the occurrence records to be flagged. Must contain columns for species, longitude, and latitude. |
species |
(character) the name of the column in |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
d |
(numeric) thinning distance in kilometers (e.g., 10 for 10km). |
prioritary_column |
(character) name of a numeric columns in |
decreasing |
(logical) whether to sort records in decreasing order using
the |
remove_invalid |
(logical) whether to remove invalid coordinates.
Default is |
optimize_memory |
(logical) whether to compute the distance matrix using a C++ implementation that reduces memory usage at the cost of increased computation time. Recommended for large datasets (> 10,000 records). Default is FALSE. |
verbose |
(logical) whether to display messages during function execution. Set to TRUE to enable display, or FALSE to run silently. Default is TRUE. |
Details
This function is similar to the thin() function from the spThin package,
but with an important difference: it allows specifying a priority order for
retaining records.
When a thinning distance is provided (e.g., 10 km), the function identifies
clusters of records within this distance. Within each cluster, it keeps the
record with the highest priority according to the column defined in
prioritary_column (for example, keeping the most recent record if
prioritary_column = "year"), and flags the remaining nearby records for
removal.
If prioritary_column is NULL, the priority follows the original order of
rows in the input occ data.frame.
Value
The original occ data frame augmented with a new logical column named
thin_geo_flag. Records that are retained after thinning receive
TRUE, while records identified as too close to a higher-priority
record receive FALSE.
Examples
# Load example data
data("occurrences", package = "RuHere")
# Subset occurrences for Araucaria angustifolia
occ <- occurrences[occurrences$species == "Araucaria angustifolia", ]
# Thin records using a 10 km distance threshold
occ_thin <- thin_geo(occ = occ, d = 10)
sum(!occ_thin$thin_geo_flag) # Number of records flagged for removal
# Prioritizing more recent records within each cluster
occ_thin_recent <- thin_geo(occ = occ, d = 10, prioritary_column = "year")
sum(!occ_thin_recent$thin_geo_flag) # Number of records flagged for removal
Download distribution data from the World Checklist of Vascular Plants (WCVP)
Description
This function downloads the World Checklist of Vascular Plants database,
which is required for filtering occurrence records using specialists'
information via the flag_wcvp() function.
Usage
wcvp_here(
data_dir,
overwrite = TRUE,
remove_files = TRUE,
timeout = 300,
verbose = TRUE
)
Arguments
data_dir |
(character) a directory to save the data downloaded from WCVP. |
overwrite |
(logical) If TRUE, data is overwritten. Default is TRUE. |
remove_files |
(logical) whether to remove the downloaded files used in building the final dataset. Default is TRUE. |
timeout |
(numeric) maximum time (in seconds) allowed for downloading. Default is 300. Slower internet connections may require higher values. |
verbose |
(logical) whether to display messages during function execution. Set to TRUE to enable display, or FALSE to run silently. Default is TRUE. |
Value
A message indicating that the data were successfully saved in the directory
specified by data_dir.
Examples
# Define a directory to save the data
data_dir <- tempdir() # Here, a temporary directory
# Download the WCVP database
wcvp_here(data_dir = data_dir)
World Countries
Description
A "PackedSpatVector" containing country polygons from Natural Earth,
processed and cleaned for use within the package. Country names were
converted to lowercase and had accents removed.
Usage
world
Format
A PackedSpatVector object with country polygons and one attribute:
- name
Country name.
Details
The dataset is sourced from rnaturalearthdata::map_units110, then:
converted to a
SpatVectorusing terra,attribute
"name"cleaned (tolower(),remove_accent()),wrapped using
terra::wrap()for robust internal storage.
Source
Natural Earth data, via rnaturalearthdata.
Examples
data(world)
world <- terra::unwrap(world)
terra::plot(world)
Bioclimatic Variables from WorldClim (bio_1, bio_7, bio_12)
Description
A PackedSpatRaster containing three bioclimatic variables from the
WorldClim, cropped to a region of interest South America.
Usage
worldclim
Format
A SpatRaster with 3 layers and the following characteristics:
- Dimensions
151 rows × 183 columns
- Resolution
0.08333333° × 0.08333333°
- Extent
xmin = -57.08333, xmax = -41.83333, ymin = -32.08333, ymax = -19.5
- CRS
WGS84 (EPSG:4326)
- Layers
-
- bio_1
Mean Annual Temperature (°C × 10)
- bio_7
Temperature Annual Range (°C × 10)
- bio_12
Annual Precipitation (mm)
Details
This raster corresponds to three standard bioclimatic variables from the WorldClim 2.1 dataset.
Source
Examples
data(worldclim)
bioclim <- terra::unwrap(worldclim)
terra::plot(bioclim)