Appendix 2: NuSEDS Data Processing
This is a summary of the NuSEDS cleaning procedure and attribution of the data to Conservation Units (CUs) as defined in the Pacific Salmon Explorer (PSE). The complete details of the cleaning procedure can be found at 1_nuseds_collation.html and the corresponding Rmd script can be accessed from github/…/spawner-surveys/code/1_nuseds_collation.rmd.
These is the first step in preparing the spawner-survey data and it concerns the cleaning of NuSEDS exclusively. The other consecutive steps and associated documentation are provided at the end of this page.
Cleaning Procedure
The objective of the procedure is to obtain the yearly counts (i.e. “time series”) of each salmon population – defined as a group of salmon belonging to the same conservation unit (CU) and spawning in the same stream – and associate these populations to their CU. The NuSEDS data is separated into two datasets. The All Areas NuSEDS dataset contains the observed yearly counts (related fields: NATURAL_ADULT_SPAWNERS, NATURAL_SPAWNERS_TOTAL, etc.) for each population (related fields: SPECIES, POP_ID, POPULATION) in their respective site (related fields: AREA, GFE_ID, WATERBODY, GAZETTED_NAME, etc.), along with the associated methods used (related fields: ESTIMATE_METHOD, ESTIMATE_CLASSIFICATION, ENUMERATION_METHODS). Note that as the 2025-11-03 NuSEDS update, the CU-related field FULL_CU_IN is now in All Areas NuSEDS as well.
The second dataset, Conservation Unit Census Sites, links each population (related fields: POP_ID) to their respective CU (related fields: CU_NAME, FULL_CU_IN, CU_LONGT, CU_LAT, etc.) and site (related fields: GFE_ID, CENSUS_SITE, X_LONGT, Y_LAT). Ideally, attributing each time series in All Areas NuSEDS its corresponding CU and location’s coordinates in Conservation Unit Census Sites would simply consist in merging the two datasets using the population and location identification number POP_ID and GFE_ID, respectively. Unfortunately, numerous time series are problematic, which occurs when:
A time series is present in All Areas NuSEDS but its
POP_IDandGFE_IDassociation is absent in Conservation Unit Census Sites (there are 4447 populations in that case).Multiple time series of the same population (i.e. same
POP_ID) are observed in multiple locations (i.e. differentGFE_IDs), which should not occur because aPOP_IDshould be defined for a unique location.Multiple populations (i.e. different
POP_ID) of a same CU are observed in the same location (i.e. sameGFE_ID), suggesting that these populations should form one unique population (thesePOP_IDare probably related to different surveys of a same population).
The observation of problematic time series revealed inconsistencies such as missing, duplicated, and conflicting data points (i.e. different counts in the same year). The goal of the procedure is to fix these time series to rescue as many data points as possible.
Determine total count (MAX_ESTIMATE)
We first define the unique yearly count field MAX_ESTIMATE for each population as the maximum value of the count-related fields in All Areas NuSEDS, i.e., NATURAL_ADULT_SPAWNERS, NATURAL_JACK_SPAWNERS, NATURAL_SPAWNERS_TOTAL, ADULT_BROODSTOCK_REMOVALS, JACK_BROODSTOCK_REMOVALS, TOTAL_BROODSTOCK_REMOVALS, OTHER_REMOVALS and TOTAL_RETURN_TO_RIVER. MAX_ESTIMATE is the only count-related field we use in the rest of the procedure. A population’s MAX_ESTIMATE data points is referred to as its “time series”.
Remove duplicated and conflictal rows
There are 60 duplicated rows in All Areas NuSEDS when considering the fields related to population (SPECIES, POP_ID, POPULATION), location (GFE_ID) and counts (Year, NATURAL_ADULT_SPAWNERS, NATURAL_JACK_SPAWNERS, NATURAL_SPAWNERS_TOTAL, ADULT_BROODSTOCK_REMOVALS, JACK_BROODSTOCK_REMOVALS, TOTAL_BROODSTOCK_REMOVALS, OTHER_REMOVALS, TOTAL_RETURN_TO_RIVER, ENUMERATION_METHODS, ESTIMATE_CLASSIFICATION), most of them having NA for MAX_ESTIMATE. These rows are removed. There are no duplicated rows in Conservation Unit Census Sites.
There are two instances in All Areas NuSEDS where a same POP_ID has two different MAX_ESTIMATE values in a same Year. We keep the value corresponding to the better method (ESTIMATE_METHOD) or the most recent entry.
Find missing stream coordinates
There are nine locations (GFE_ID) without coordinates (Y_LAT, X_LONGT) in Conservation Unit Census Sites and 23 in All Areas NuSEDS (the coordinates could not be found the other DFO files with GFE_ID that were sent to us). We define these coordinates manually using the best information available.
Remove time series only made of NAs
There are 4425 time series in All Areas NuSEDS that only have NAs for MAX_ESTIMATE. The corresponding 10,2104 rows (24.4%) are removed (the time series with NAs AND 0s are kept).
Time series not in Conservation Unit Census Sites
There are 264 time series in All Areas NuSEDS whose reference (i.e. POP_ID and GFE_ID association) is not in Conservation Unit Census Sites. Among those, only 53 have a CU_NAME and FULL_CUN_IN, which we use to find the corresponding PSE’s cuid and cu_name_pse. For the remaining 211 time series without a CU_NAME and FULL_CUN_IN (corresponding to 208 POP_ID), we first find their cuid and cu_name_pse by intersection the their stream coordinates (X_LONGT, Y_LAT) with the CUs’ shape files used in the PSE. When more than one CU layer is intersected (for a same species), we use the information in POPULATION and WATERBODY to manually select the correct CU. Once their cuid is found, we can find their CU_NAME and FULL_CUN_IN.
After the procedure, there remain (1) five populations for which we found their cuid and cu_name_pse but did not find the corresponding CU_NAME and FULL_CU_IN, and (2) two time series with a CU_NAME and FULL_CU_IN for which we could not find a cuid and cu_name_pse.
The reference of these time series is then added to Conservation Unit Census Sites.
Find the cuid and cu_name_pse of the remaining time series
We now find the cuid and cu_name_pse to all the remining time series using FULL_IN_IN and CU_NAME in both All Areas NuSEDS and Conservation Unit Census Sites.
After the procedure, there remain (1) the five time series with a cuid and cu_name_pse but no CU_NAME and FULL_CU_IN (the ones mention in the section above) and (2) 86 time series (corresponding to 22 FULL_CU_IN and CU_NAME) for which we could not find a cuid and cu_name_pse. These series are kept at this stage.
Cases where a CU has multiple time series in a single location
There are 79 instances where multiple time series of a single CU are associated to a one location (GFE_ID). Checking all these cases reveals clear duplicated data points or single data point that are not worth keeping. To fix these issues we proceed as follow:
Case 1: one of the duplicated series has only one data point:
if it is complementary: merge to the other (longer) series
if it is in conflict or a duplicate: remove the focal series
Case 2: the shorter series is 100% duplicated: removed the focal series
Case 3: for the rest of the duplicated series:
points that are conflictual or duplicated are summed up
points that are complementary are merged
In the few cases where conflictual data points are summed, we assume that the different runs (e.g., “Chinook Run 1” and “Chinook Run 2”) can be considered a single population. For example, the Bridge River has “Summer” and “Late run” sockeye surveys, but these are both the MIDDLE FRASER river-type sockeye CU.
In the few instances where data points are summed, we define the ESTIMATE_CLASSIFICATION (e.g., “RELATIVE ABUNDANCE (TYPE-3)”) as the value corresponding to the highest MAX_ESTIMATE value between the two data points.
Cases where a CU has multiple time series in a single location
There are 37 instances where a single POP_ID is associated to multiple locations (GFE_ID). Similarly as in the previous section, checking all these cases reveals inconsistencies in the data. For instance, the POPULATION “Fennel Creek Early Summer Sockeye” (POP_ID = 3416) has two complementary time series, one in the WATERBODAY “FENNEL CREEK AND SAKUM CREEK” (GFE_ID = 2746) and in “FENNEL CREEK” (GFE_ID = 261), and these two locations have the same coordinates (Y_LAT and X_LONGT). In this type of cases, we merge the two series by replacing the location-related fields (i.e. WATERBODAY, CENSUS_SITE, GFE_ID, etc.) of one time series by the values of the other time series in All Areas NuSEDS. We only make changes to the data when the issue is obvious and the appropriate information is available to make the correction.
Additional corrections for the Northern Transboundary
PSF formed a technical working group (TWG) specifically to compile data for the Transboundary region (cf. transboundary-data). As part of the work the following modifications were requested:
TATSAMENIE RIVER coho (
POP_ID = 45152) for 1994 and earlier changed to TATSATUA RIVER (POP_ID = 45154) and remaining records 1995+ are removedany records of
POP_ID = 45151(Tatsamenie River lake-type sockeye) for 1994 or earlier get changed toPOP_ID = 45153(TATSATUA RIVER river-type sockeye)one record with
POP_ID = 45165(Chinook Run 2) change to45164(Chinook Run 1) in Nahlin river
Merge the two dataset
We merge All Areas NuSEDS and Conservation Unit Census Sites and the resulting file is exported.
Next steps
There are other consecutive scripts that process the spawner-survey before it is ready for the PSE:
2_nuseds_cuid_pse.rmd: to do additional corrections of time series, modifications related to the PSE; for all details: 2_nuseds_cuid_pse.html; the different versions of the dataset can be downloaded from zenodo/…/2_nuseds_cuid_streamid.
3_data_extra_Reynolds_lab.rmd: to include the Reynolds’s Lab data for several populations in the Central Coast; for all details: 3_data_extra_Reynolds_lab.html.
yukon-data: to compile additional data for all the CUs.
columbia-data: to compile additional data for the lake sockeye “Osoyoos” CU.
transboundary-data: to compile additional data and corrections of data from NuSEDS for multiple CUs
central-coast-data: to compile additional data for the lake sockeye “South Atnarko Lake” CU.
steelhead-data: to compile the data for all CUs across all regions.
4_datasets_for_PSE.R: to combine all the datasets generated in the different repositories above.