Appendix 2: NuSEDS Data Processing

The R script containing the code for the NuSEDS data processing procedure described below is available on github.

Cleaning Procedure

The objective of the procedure is to obtain the yearly counts of each salmon populations in their respective stream. The NuSEDS data is separated in two datasets. The all_areas_nuseds dataset contains the observed yearly counts (related fields: NATURAL_ADULT_SPAWNERS, NATURAL_SPAWNERS_TOTAL, etc.) for each population (related fields: SPECIES, POP_ID, POPULATION) in their respective locations (related fields: AREA, GFE_ID, WATERBODY, GAZETTED_NAME, etc.), along with the associated methods used (related fields: ESTIMATE_METHOD, ESTIMATE_CLASSIFICATION, ENUMERATION_METHODS). The conservation_unit_system_sites dataset links each population (related fields: POP_ID, population) to their respective Conservation Unit (related fields: CU_NAME, FULL_CU_IN, CU_LONGT, CU_LAT, etc.) and streams (related fields: GFE_ID, SYSTEM_SITE, X_LONGT, Y_LAT). Ideally the two datasets could be merged using the population and location identification number POP_ID and GFE_ID, respectively. Unfortunately, numerous time series (i.e. a series of yearly counts for a given population in a given location) are problematic, which occurs when:

  • A time series is present in all_areas_nuseds but its POP_ID and GFE_ID association is absent in conservation_unit_system_sites.

  • Multiple time series of a same population (i.e. same POP_ID) are observed in multiple locations (i.e. different GFE_IDs), which should not occur because the a POP_ID should be defined for a unique location.

The observation of problematic time series revealed inconsistencies such as missing, duplicated, and conflicting data points.

In order to rescue as much data as possible, we attempted to solve these issues by either (i) deleting a problematic series, (ii) combining or (iii) summing it to an alternative series (i.e., a series present in the focal dataset that shares either the same POP_ID or GFE_ID and species), or (iv) creating a new reference to the time series in conservation_unit_system_sites. In each case, we used all the information available in the different fields and made assumptions based on our professional judgement. Below are more details about the different interventions applied to problematic time series:

  • We deleted a time series (or a portion of it) from all_areas_nuseds when (1) it was a duplicate of another time series, (2) it had less than four data points and was not spotted during the cleaning process, or (3) the lack of information prevented to combine or merge it to an alternative series or to create a new series in conservation_unit_system_sites (e.g. if run timing was unknown).

  • We combined a time series with an alternative series when data points where not in conflict (i.e. different value for a same year) and the information available in the other fields was not prescriptive (e.g. the population belong to the same CU, the locations have similar names or spatial coordinates).

  • We summed a time series to an alternative series having different POP_ID but same GFE_ID and CU_NAME and with conflicting data points.

  • We created a new reference of the problematic time series in conservation_unit_system_sites (i.e. a new row with a unique POP_ID - GFE_ID combination) when it could not be deleted, merged or combined to an alternative series.

The diverse and idiosyncratic nature of these problematic cases prevented the establishment of a systematic and simple cleaning process. Many of these cases were resolved individually by observing and comparing time series and accounting for the information available. We proceeded in consecutive steps in which we applied the different types of intervention mentionned above, either after visually inspecting individual time series or by applying systematic procedure. These steps are detailed below:

  1. Remove the time series (i.e. unique POP_ID - GFE_ID associations) with only NAs and/or 0s in all_areas_nuseds (n = 4491; 25%)

  2. Assess all the references to time series present in conservation_unit_system_sites but absent in all_areas_nuseds (n = 259)

2.2. Cases with presence of alternative series in NUSEDS (n = 4)

  • Each case was assessed visually.

2.3. Cases with no alternative time series in all_areas_nuseds (n = 254)

  • The rows corresponding to these series are removed from conservation_unit_system_sites.

2.4. Case with multiple GFE_ID for a single POP_ID (n = 1)

  • The problematic series was simply removed from conservation_unit_system_sites.
  1. Case of time series with a same POP_ID but different GFE_IDs in conservation_unit_system_sites and present in all_areas_nuseds (n = 1).
  • Two time series had the same POP_ID but different GFE_ID. Both series were kept because they are not compatible and have many and recent data points.
  1. Assess all the time series present in all_areas_nuseds but absent in conservation_unit_system_sites (n = 171)

4.1. Remove series with \(\leqslant\) 3 data points (n = 94)

4.2. Cases of series with alternative POP_ID and no alternative GFE_ID (n = 6)

  • Each case was assessed visually.

4.3. Cases of series with alternative GFE_ID with or without alternative POP_ID (n = 34)

  • Problematic series with the same POP_IDs (e.g. series with POP_ID = 47925 and GFE_ID = 15, 31516 ad 31740, respectively) having the same alternative series (e.g. series with POP_ID = 47925 and GFE_ID = 14) are assessed together.

4.4. Cases with no alternative series (n = 37)

  • These series were removed from all_areas_nuseds.

  • In future we could investigate and attempt to rescue the series with > 10 data points.

  1. Merge all_areas_nuseds and conservation_unit_system_sites.

  2. Assess time series with a same CU_NAME and GFE_ID (n = 62).

  • Case 1: there is only one data point in one of the series; if it is compatible, then merge to the alternative series, otherwise remove it from all_areas_nuseds.

  • Case 2: One of the series is 100% duplicated with the other one: to remove from all_areas_nuseds.

  • Case 3: the rest of the series are merged; adding the values when data point are in conflict.

Estimation of population counts

The observed count for each series is the maximum value of the following fields in all_areas_nuseds: NATURAL_ADULT_SPAWNERS, NATURAL_JACK_SPAWNERS, NATURAL_SPAWNERS_TOTAL, ADULT_BROODSTOCK_REMOVALS, JACK_BROODSTOCK_REMOVALS, TOTAL_BROODSTOCK_REMOVALS, OTHER_REMOVALS and TOTAL_RETURN_TO_RIVER.

Coordinates adjustment

There are 41 locations (with a unique GFE_ID) with identical coordinates in conservation_unit_system_sites. This is because the Y_LAT and X_LONGT field does not corresponds to the survey locations but to the “location of the mouth of the waterbody if flowing, or the centroid if not”. Consequently, all locations within a water body have the same coordinates. In large rivers (e.g. Thompson river), such imprecision can shift the coordinates 100s of km away from the true assessment location. We conseqeuntly manually adjusted the coordiantes of certain locations to prevent duplication and to improve spatial accuracy.