6. tidyped Class Structure and Extension
Notes
This document describes the structural contract of the
tidyped class in visPedigree 1.8.0. It is intended for
maintenance and extension work.
1. Class identity
tidyped is an S3 class layered on top of
data.table.
Expected class vector:
c("tidyped", "data.table", "data.frame")
The class is created through new_tidyped() (internal
constructor) and checked with is_tidyped().
2. Core design goals
tidyped is designed to be:
- safe for C++: integer pedigree indices
(
IndNum, SireNum, DamNum) are
always aligned with row order, so C++ routines can index directly
without translation;
- fast for large pedigrees: the fast path skips
redundant validation when the input is already a
tidyped;
- compatible with
data.table: in-place
modification via := and set() preserves class
and metadata without copying;
- explicit about structural degradation: row subsets
that break pedigree completeness are downgraded to plain
data.table with a warning.
3. The head invariant: IndNum == row index
The single most important structural rule in visPedigree:
IndNum[i] must equal i for every
row.
This means SireNum and DamNum are direct
row pointers: the sire of individual i lives at row
SireNum[i], and 0L encodes a missing
parent.
Every C++ function in visPedigree — inbreeding coefficients,
relationship matrices, BFS tracing, topological sorting — relies on this
invariant. If it breaks, C++ will read wrong parents.
This invariant is enforced at three levels:
tidyped(): builds indices from scratch
during construction.
[.tidyped: rebuilds indices in-place
after valid row subsets.
ensure_tidyped() /
ensure_complete_tidyped(): detect and repair stale
indices when class was accidentally dropped.
4. Column contract
4.1 Minimal structural columns
These four columns define a valid tidyped:
Ind |
character |
Unique individual ID |
Sire |
character |
Sire ID, NA for unknown |
Dam |
character |
Dam ID, NA for unknown |
Sex |
character |
"male", "female", or
"unknown" |
Checked by validate_tidyped().
4.2 Integer pedigree columns
IndNum |
integer |
Row index (== row number, see §3) |
SireNum |
integer |
Row index of sire, 0L for missing |
DamNum |
integer |
Row index of dam, 0L for missing |
These exist whenever tidyped() is called with
addnum = TRUE (default). They are the interface between R
and C++.
4.3 Other common columns
Gen |
Generation number |
Family |
Family group code |
FamilySize |
Number of offspring in the family |
Cand |
TRUE for candidate individuals |
f |
Inbreeding coefficient (added by inbreed()) |
4.4 Column naming convention
All data columns use PascalCase (Ind,
SireNum, MeanF, ECG), matching
the core column style.
6. Structural invariants
The following invariants must hold for a valid
tidyped:
- IndNum == row index (see §3).
- Ind is unique — no duplicate individual IDs.
- Completeness — every non-
NA Sire and
Dam appears in Ind.
- Acyclicity — no individual is its own
ancestor.
- SireNum / DamNum consistency —
0L for
missing parents, valid row indices otherwise.
- ped_meta is the sole metadata container — no
scattered attributes.
Invariants 1–5 are established by tidyped() and guarded
by [.tidyped. Invariant 6 is a development convention.
7. Constructor pipeline
tidyped() currently has two distinct tracing paths:
- Raw-input path (
data.frame /
data.table) — uses igraph for loop detection, candidate
tracing, and topological sorting before integer indices are
finalized.
- Fast path (
tidyped +
cand) — skips graph rebuilding and uses C++ for candidate
tracing and topological sorting on existing integer pedigree
indices.
7.2 Fast path: tidyped(tp, cand = ids)
When the input is already a tidyped and
cand is supplied:
- Skipped: ID validation, loop detection, sex
inference, founder injection.
- Executed: C++ BFS tracing → C++ topo sort → C++
generation assignment → rebuild indices →
new_tidyped() +
ped_meta.
The fast path is the preferred workflow for repeated local tracing
from a previously validated master pedigree:
tp_master <- tidyped(raw_ped)
tp_local <- tidyped(tp_master, cand = ids, trace = "up", tracegen = 3)
7.3 new_tidyped() — internal constructor
new_tidyped() attaches the "tidyped" class
via setattr() (no copy) and clears data.table’s invisible
flag via x[]. It does not attach
ped_meta — that is the caller’s responsibility. It should
only be called when the caller has already ensured structural
validity.
8. Three-tier guard system
Analysis functions must guard their inputs. visPedigree provides
three guard levels, chosen based on what each function needs.
8.1 validate_tidyped() — visualization guard
- Attempts silent class recovery via
ensure_tidyped().
- Checks only that
Ind, Sire,
Dam, Sex exist.
- Does not require pedigree completeness.
- Used by:
visped(), plot.tidyped(),
summary.tidyped().
8.2 ensure_tidyped() — structure-light guard
- If already
tidyped: returns as-is.
- If class was dropped but 8 core columns (
Ind,
Sire, Dam, Sex, Gen,
IndNum, SireNum, DamNum) are
present: rebuilds IndNum if stale, restores class, emits a
message.
- Does not check pedigree completeness.
- Used by:
pedsubpop(), splitped(),
pedne(method = "demographic"),
pedstats(ecg = FALSE, genint = FALSE),
pedfclass() (when f column already
exists).
8.3 ensure_complete_tidyped() — complete-pedigree
guard
- Everything
ensure_tidyped() does,
plus:
- Calls
require_complete_pedigree() — verifies that every
non-NA Sire/Dam is present in Ind. Stops with
an error if not.
- Required by any function that recurses through pedigree structure in
C++.
- Used by:
inbreed(), pedecg(),
pedgenint(), pedrel(),
pedne(method = "inbreeding" | "coancestry"),
pedcontrib(), pedancestry(),
pedfclass() (when f must be computed),
pedpartial(), pediv(), pedmat(),
pedhalflife().
8.4 Choosing the right guard
validate_tidyped() |
yes |
no |
Visualization |
ensure_tidyped() |
yes |
no |
Summaries on existing columns |
ensure_complete_tidyped() |
yes |
yes |
Pedigree recursion in C++ |
Some functions are conditionally guarded: they use
ensure_tidyped() by default but escalate to
ensure_complete_tidyped() when a parameter triggers
pedigree recursion (for example pedstats(ecg = TRUE),
pedne(method = "coancestry")).
9. Safe subsetting contract
[.tidyped is the key protection layer.
9.1 := operations
Modify-by-reference is passed through safely. Class and metadata are
preserved via setattr(). No copy occurs.
9.2 Column-only selections
If the selection removes core pedigree columns, the result is
returned as a plain data.table without warning.
9.3 Row subsets
After row subsetting, [.tidyped checks pedigree
completeness:
- Complete subset (all referenced parents still
present):
IndNum, SireNum, DamNum
are rebuilt in-place, class and ped_meta are
preserved.
- Incomplete subset (parent records missing): result
is downgraded to plain
data.table with a warning guiding
the user to tidyped(tp, cand = ids, trace = "up").
This downgrade is deliberate. It prevents stale integer indices from
reaching C++ routines.
10. Computational boundaries: C++ vs igraph
visPedigree delegates heavy pedigree recursion to C++ and uses igraph
where a graph object is still the simplest representation.
10.1 C++ — core computation path
| Ancestry / descendant tracing |
cpp_trace_ancestors,
cpp_trace_descendants |
| Topological sorting |
cpp_topo_order |
| Generation assignment |
cpp_assign_generations_top,
cpp_assign_generations_bottom |
| Inbreeding coefficients |
cpp_calculate_inbreeding (Meuwissen & Luo) |
| Relationship matrices |
cpp_addmat, cpp_dommat,
cpp_aamat, cpp_ainv |
All C++ functions consume SireNum / DamNum
integer vectors and assume the head invariant (§3).
10.2 igraph — graph-specific tasks
| Pedigree visualization |
visped() pipeline |
graph_from_data_frame,
layout_with_sugiyama, plot.igraph |
| Connected components |
splitped() |
graph_from_edgelist, components |
| Loop detection |
tidyped() raw-input path |
graph_from_edgelist, is_dag |
| Loop diagnosis |
tidyped() error path |
which_loop, shortest_paths,
neighbors, components |
| Candidate tracing |
tidyped() raw-input path |
neighborhood |
| Topological sorting |
tidyped() raw-input path |
topo_sort |
igraph is not used in the core numerical pedigree analysis routines
such as inbreed(), pedmat(),
pedecg(), or pedrel(), but it is still part of
the preprocessing and visualization stack.
11. Extension rules
When extending the class, follow these rules.
11.1 Do not add new pedigree-level attributes
Prefer adding fields to ped_meta instead of scattering
new standalone attributes.
11.2 Keep computed state derivable
If a column can be rebuilt from pedigree structure, prefer derivation
over storing opaque cached state.
11.3 Preserve data.table semantics
Use :=, set(), and setattr()
carefully. Avoid patterns that trigger full copies unless
unavoidable.
11.4 Respect downgrade semantics
Any future method that subsets rows must preserve the current
rule:
valid complete subset -> tidyped; incomplete subset
-> plain data.table.
11.5 Document C++ assumptions
Any feature using IndNum, SireNum, or
DamNum should document whether it requires:
- topologically ordered rows,
- dense consecutive indices,
0L encoding for missing parents.
12. User-facing inspection helpers
is_tidyped(x) |
TRUE if class is present |
is_complete_pedigree(x) |
TRUE if all Sire/Dam are in Ind |
pedmeta(x) |
The ped_meta named list |
has_inbreeding(x) |
TRUE if f column exists |
has_candidates(x) |
TRUE if Cand column exists |
Future extensions should prefer helper functions over direct
attribute access.
13. Maintenance checklist
Before merging a structural change to tidyped,
check:
- Does class identity remain
c("tidyped", "data.table", "data.frame")?
- Is the head invariant
IndNum == row index preserved
after every code path?
- Are
ped_meta fields preserved correctly?
- Does
[.tidyped still handle := without
copy issues?
- Do incomplete row subsets still downgrade with warning?
- Are integer pedigree columns rebuilt whenever a subset remains
valid?
- Does
tidyped(tp_master, cand = ...) match the full path
result?
- After
setorder() or merge(), are indices
rebuilt before reaching C++?
- Do package tests and vignettes build cleanly?
14. Recommended workflow
For large pedigrees, the intended usage pattern is:
# build one validated master pedigree
tp_master <- tidyped(raw_ped)
# reuse it for repeated local tracing (fast path)
tp_local <- tidyped(tp_master, cand = ids, trace = "up", tracegen = 3)
# modify analysis columns in place
tp_master[, phenotype := pheno]
# split only when disconnected components matter
parts <- splitped(tp_master)
This keeps workflows explicit, fast, and safe.