cluster:scoring:data_flow

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:scoring:data_flow [2024/10/27 20:28] benediktcluster:scoring:data_flow [2024/10/27 20:29] (current) benedikt
Line 5: Line 5:
  
 There are 3 general rules that we try to stick to. There are 3 general rules that we try to stick to.
-1) The scorer/user **never** works directly with the 'raw copy' or backup copy, but rather the data being worked with should be copied to the workspace folder. + 
-2) Datasets should be 'checked out' to scorers, with one or more scorers using a folder on the workspace to make changes (scoring) to the data. +  - The scorer/user **never** works directly with the 'raw copy' or backup copy, but rather the data being worked with should be copied to the workspace folder. 
-3) We always keep a 'raw copy' of the data, in the state that we collected it/received it. We try to maintain backups of each 'stage of the process the data goes through'. After scoring/changing raw data, the data is **not** placed back in the raw data folder, but rather moved to the 'scored data' folder.+  Datasets should be 'checked out' to scorers, with one or more scorers using a folder on the workspace to make changes (scoring) to the data. 
 +  We always keep a 'raw copy' of the data, in the state that we collected it/received it. We try to maintain backups of each 'stage of the process the data goes through'. After scoring/changing raw data, the data is **not** placed back in the raw data folder, but rather moved to the 'scored data' folder. 
  
 These 3 rules, if followed consistently, and enforced, stand a good chance of limiting issues such as duplicate datasets, duplicate datasets with disparate scorings (i.e. 1 dataset X with scoring A, and one copy of X with scoring B), lost data, and irrecoverably corrupted data. These 3 rules, if followed consistently, and enforced, stand a good chance of limiting issues such as duplicate datasets, duplicate datasets with disparate scorings (i.e. 1 dataset X with scoring A, and one copy of X with scoring B), lost data, and irrecoverably corrupted data.
  • cluster/scoring/data_flow.1730060889.txt.gz
  • Last modified: 2024/10/27 20:28
  • by benedikt