6. tidyped Class Structure and Extension Notes

This document describes the structural contract of the tidyped class in visPedigree 1.8.0. It is intended for maintenance and extension work.

1. Class identity

tidyped is an S3 class layered on top of data.table.

Expected class vector:

c("tidyped", "data.table", "data.frame")

The class is created through new_tidyped() (internal constructor) and checked with is_tidyped().

2. Core design goals

tidyped is designed to be:

  1. safe for C++: integer pedigree indices (IndNum, SireNum, DamNum) are always aligned with row order, so C++ routines can index directly without translation;
  2. fast for large pedigrees: the fast path skips redundant validation when the input is already a tidyped;
  3. compatible with data.table: in-place modification via := and set() preserves class and metadata without copying;
  4. explicit about structural degradation: row subsets that break pedigree completeness are downgraded to plain data.table with a warning.

3. The head invariant: IndNum == row index

The single most important structural rule in visPedigree:

IndNum[i] must equal i for every row.

This means SireNum and DamNum are direct row pointers: the sire of individual i lives at row SireNum[i], and 0L encodes a missing parent.

Every C++ function in visPedigree — inbreeding coefficients, relationship matrices, BFS tracing, topological sorting — relies on this invariant. If it breaks, C++ will read wrong parents.

This invariant is enforced at three levels:

4. Column contract

4.1 Minimal structural columns

These four columns define a valid tidyped:

Column Type Description
Ind character Unique individual ID
Sire character Sire ID, NA for unknown
Dam character Dam ID, NA for unknown
Sex character "male", "female", or "unknown"

Checked by validate_tidyped().

4.2 Integer pedigree columns

Column Type Description
IndNum integer Row index (== row number, see §3)
SireNum integer Row index of sire, 0L for missing
DamNum integer Row index of dam, 0L for missing

These exist whenever tidyped() is called with addnum = TRUE (default). They are the interface between R and C++.

4.3 Other common columns

Column Description
Gen Generation number
Family Family group code
FamilySize Number of offspring in the family
Cand TRUE for candidate individuals
f Inbreeding coefficient (added by inbreed())

4.4 Column naming convention

All data columns use PascalCase (Ind, SireNum, MeanF, ECG), matching the core column style.

5. Metadata layer

Pedigree-level metadata is stored in a single attribute:

attr(x, "ped_meta")

Built by build_ped_meta(), accessed by pedmeta().

Field Type Description
selfing logical Whether self-fertilization mode was used
bisexual_parents character IDs appearing as both sire and dam
genmethod character "top" or "bottom" generation numbering

No other pedigree-level attributes should be added outside ped_meta.

6. Structural invariants

The following invariants must hold for a valid tidyped:

  1. IndNum == row index (see §3).
  2. Ind is unique — no duplicate individual IDs.
  3. Completeness — every non-NA Sire and Dam appears in Ind.
  4. Acyclicity — no individual is its own ancestor.
  5. SireNum / DamNum consistency0L for missing parents, valid row indices otherwise.
  6. ped_meta is the sole metadata container — no scattered attributes.

Invariants 1–5 are established by tidyped() and guarded by [.tidyped. Invariant 6 is a development convention.

7. Constructor pipeline

tidyped() currently has two distinct tracing paths:

7.1 Full path: tidyped(raw_input)

When the input is a raw data.frame or data.table:

  1. validate_and_prepare_ped() — normalize IDs, detect duplicates and bisexual parents, inject missing founders.
  2. Loop detection — igraph builds a directed graph and checks is_dag(); which_loop() and shortest_paths() are used only on the error path to report informative loop diagnostics.
  3. Candidate tracing — if cand is supplied, igraph neighborhood search is used on the raw-input path.
  4. Topological sort — igraph topo_sort() on the raw-input path.
  5. Generation assignment — C++ (cpp_assign_generations_top / cpp_assign_generations_bottom) using the pedigree implied by the sorted rows.
  6. Sex inference — resolve unknowns from parental roles.
  7. Build integer indices — IndNum, SireNum, DamNum.
  8. new_tidyped() + attach ped_meta.

7.2 Fast path: tidyped(tp, cand = ids)

When the input is already a tidyped and cand is supplied:

The fast path is the preferred workflow for repeated local tracing from a previously validated master pedigree:

tp_master <- tidyped(raw_ped)
tp_local  <- tidyped(tp_master, cand = ids, trace = "up", tracegen = 3)

7.3 new_tidyped() — internal constructor

new_tidyped() attaches the "tidyped" class via setattr() (no copy) and clears data.table’s invisible flag via x[]. It does not attach ped_meta — that is the caller’s responsibility. It should only be called when the caller has already ensured structural validity.

8. Three-tier guard system

Analysis functions must guard their inputs. visPedigree provides three guard levels, chosen based on what each function needs.

8.1 validate_tidyped() — visualization guard

8.2 ensure_tidyped() — structure-light guard

8.3 ensure_complete_tidyped() — complete-pedigree guard

8.4 Choosing the right guard

Guard Recovers class? Requires completeness? When to use
validate_tidyped() yes no Visualization
ensure_tidyped() yes no Summaries on existing columns
ensure_complete_tidyped() yes yes Pedigree recursion in C++

Some functions are conditionally guarded: they use ensure_tidyped() by default but escalate to ensure_complete_tidyped() when a parameter triggers pedigree recursion (for example pedstats(ecg = TRUE), pedne(method = "coancestry")).

9. Safe subsetting contract

[.tidyped is the key protection layer.

9.1 := operations

Modify-by-reference is passed through safely. Class and metadata are preserved via setattr(). No copy occurs.

9.2 Column-only selections

If the selection removes core pedigree columns, the result is returned as a plain data.table without warning.

9.3 Row subsets

After row subsetting, [.tidyped checks pedigree completeness:

This downgrade is deliberate. It prevents stale integer indices from reaching C++ routines.

10. Computational boundaries: C++ vs igraph

visPedigree delegates heavy pedigree recursion to C++ and uses igraph where a graph object is still the simplest representation.

10.1 C++ — core computation path

Task C++ function
Ancestry / descendant tracing cpp_trace_ancestors, cpp_trace_descendants
Topological sorting cpp_topo_order
Generation assignment cpp_assign_generations_top, cpp_assign_generations_bottom
Inbreeding coefficients cpp_calculate_inbreeding (Meuwissen & Luo)
Relationship matrices cpp_addmat, cpp_dommat, cpp_aamat, cpp_ainv

All C++ functions consume SireNum / DamNum integer vectors and assume the head invariant (§3).

10.2 igraph — graph-specific tasks

Task Where igraph functions
Pedigree visualization visped() pipeline graph_from_data_frame, layout_with_sugiyama, plot.igraph
Connected components splitped() graph_from_edgelist, components
Loop detection tidyped() raw-input path graph_from_edgelist, is_dag
Loop diagnosis tidyped() error path which_loop, shortest_paths, neighbors, components
Candidate tracing tidyped() raw-input path neighborhood
Topological sorting tidyped() raw-input path topo_sort

igraph is not used in the core numerical pedigree analysis routines such as inbreed(), pedmat(), pedecg(), or pedrel(), but it is still part of the preprocessing and visualization stack.

11. Extension rules

When extending the class, follow these rules.

11.1 Do not add new pedigree-level attributes

Prefer adding fields to ped_meta instead of scattering new standalone attributes.

11.2 Keep computed state derivable

If a column can be rebuilt from pedigree structure, prefer derivation over storing opaque cached state.

11.3 Preserve data.table semantics

Use :=, set(), and setattr() carefully. Avoid patterns that trigger full copies unless unavoidable.

11.4 Respect downgrade semantics

Any future method that subsets rows must preserve the current rule:

valid complete subset -> tidyped; incomplete subset -> plain data.table.

11.5 Document C++ assumptions

Any feature using IndNum, SireNum, or DamNum should document whether it requires:

12. User-facing inspection helpers

Function Returns
is_tidyped(x) TRUE if class is present
is_complete_pedigree(x) TRUE if all Sire/Dam are in Ind
pedmeta(x) The ped_meta named list
has_inbreeding(x) TRUE if f column exists
has_candidates(x) TRUE if Cand column exists

Future extensions should prefer helper functions over direct attribute access.

13. Maintenance checklist

Before merging a structural change to tidyped, check:

  1. Does class identity remain c("tidyped", "data.table", "data.frame")?
  2. Is the head invariant IndNum == row index preserved after every code path?
  3. Are ped_meta fields preserved correctly?
  4. Does [.tidyped still handle := without copy issues?
  5. Do incomplete row subsets still downgrade with warning?
  6. Are integer pedigree columns rebuilt whenever a subset remains valid?
  7. Does tidyped(tp_master, cand = ...) match the full path result?
  8. After setorder() or merge(), are indices rebuilt before reaching C++?
  9. Do package tests and vignettes build cleanly?