gc_play_by_play() and wsc_play_by_play()
return the same cleaned public play-by-play schema, but they do not
start from the same raw feed. gc_play_by_play() starts from
the GameCenter play-by-play feed. wsc_play_by_play() starts
from the World Showcase feed and uses GameCenter metadata to keep the
output aligned with the same roster, team, and HTML report context. The
pipeline is intentionally source-aware:
situationCode.shift_chart() is not used to populate on-ice player IDs
inside these functions. It is a separate downstream source for shift
timing via add_shift_times().This article walks through the process step by step.
For a single game, the functions fetch the raw API play-by-play plus
the HTML play-by-play report. wsc_play_by_play() also
fetches the WSC play-by-play feed itself. Those requests are now issued
in parallel, because network latency dominates total runtime much more
than the in-memory cleaning steps do.
The HTML report is fetched for every game because it is the only source that consistently exposes the full on-ice player sets.
Once the raw feed is downloaded, the package standardizes the columns before doing any enrichment. That includes:
gameId into the tableperiodNumber,
eventTypeCode, and eventTypeDescKeyshootingPlayerId values on goal rows when the scorer is
knownThe aim here is to make the downstream logic operate on one internal structure, even when the upstream feed formats differ.
The API play-by-play remains authoritative, but not every upstream
sortOrder is logically consistent. Before any HTML matching
happens, the pipeline repairs clear boundary mistakes. The guiding
principle is conservative: only fix sequences that are plainly
impossible from the game clock and event context.
Examples:
period-start and the
opening faceoffThis matters because HTML matching becomes much more reliable once the API timeline itself is internally coherent.
situationCode.The raw situationCode is parsed into the public state
columns:
homeIsEmptyNetawayIsEmptyNethomeSkaterCountawaySkaterCountmanDifferentialstrengthStateThese are the public summary columns that describe the intended rules
context of the play. The raw situationCode itself is kept
in the output as-is. The HTML report can later supply a
different-looking player identity set, but it does not rewrite these
situationCode-derived state columns.
After the structural cleanup, the play-by-play gets the geometric and shot-context features used elsewhere in the package. That includes:
These features are derived from the cleaned API event log, not from the HTML report.
The HTML report is parsed into a second event table that contains:
The parser also handles the known HTML-side quirks, including the fact that the HTML report records blocked shots from the defending perspective while the package standardizes blocked shots from the shooting perspective.
The package does not join HTML to API rows only on time, and it does
not require the HTML on-ice counts to agree with
situationCode. It builds a richer event signature on both
sides and uses that to align the reports. The matching logic uses
combinations of:
This step is where the earlier ordering repair pays off. The cleaner the API event sequence is, the safer the HTML match becomes.
A matched HTML row is used for the on-ice player ID columns. It is
not used to recalculate the situationCode-derived context
columns. This keeps two related but different concepts separate:
situationCode, homeSkaterCount,
awaySkaterCount, empty-net flags,
manDifferential, and strengthState describe
the intended rules state.homeGoaliePlayerId, awayGoaliePlayerId,
and the skater ID columns describe the players listed in the HTML report
for that event.The HTML report can list more or fewer players than the intended state implies. That can happen around line changes, unusual bench situations, or report noise. The package preserves the listed identities without letting those counts change the intended state context.
Penalty-shot and shootout rows with 0101 or
1010 are different. Those states are genuinely one-on-one
plays. Even if the HTML report lists extra players, the package records
only the shooter and the defending goalie for those rows.
Once a matched HTML row is accepted, the package writes the scalar on-ice player ID columns into the play-by-play row. That includes:
homeGoaliePlayerId and
awayGoaliePlayerIdhomeSkater1PlayerId through
homeSkater5PlayerId by default, with extra skater slots
added only when the game needs themawaySkater1PlayerId through
awaySkater5PlayerId by default, with extra skater slots
added only when the game needs them...For and ...Against
columnsThe base schema tracks the standard five skaters. If the HTML report
shows an extra attacker or any other overflow row, the package expands
dynamically to skater6, skater7,
skater8, and so on instead of truncating the row.
Two edge-case families need their own rules.
Rows with one-on-one states such as 0101 and
1010 populate only the shooter and the defending goalie.
Extra HTML-listed players on those rows are treated as report noise
because the play is supposed to be one shooter against one goalie.
Some supported delayed-penalty rows do not appear in the HTML report at all. In those cases, the package can backfill the on-ice player IDs from the nearest prior populated row in the same period when:
This fixes cases where the HTML report skips the delayed-penalty marker but clearly preserves the same live-play skaters immediately before the whistle.
The last step is to expose the cleaned public-facing schema and hide
the internal staging details. Both gc_play_by_play() and
wsc_play_by_play() return one row per event with:
The only intentional difference is source-specific metadata such as
utc in the WSC output and GameCenter clip fields in the GC
output.
shift_chart() Fits Inshift_chart() is related, but it solves a different
problem. It provides shift windows, not event identities. In practical
use:
pbp <- nhlscraper::gc_play_by_play(2023030417)
shifts <- nhlscraper::shift_chart(2023030417)
pbp_with_shift_times <- nhlscraper::add_shift_times(pbp, shifts)This is why the package keeps the HTML play-by-play report as the
primary on-ice identity source inside gc_play_by_play() and
wsc_play_by_play(), while shift_chart()
remains the right tool for shift-timing context after the play-by-play
is already built.
If you want the shortest mental model, it is this:
situationCode for intended manpower contextshift_chart() later when you need shift timing
rather than event-level player identityThat balance is what lets the final play-by-play stay both practical and auditable.