This page describes how data flows across the scoring part of the cluster.
There are 3 general rules that we try to stick to.
The scorer/user never works directly with the 'raw copy' or backup copy, but rather the data being worked with should be copied to the workspace folder.
Datasets should be 'checked out' to scorers, with one or more scorers using a folder on the workspace to make changes (scoring) to the data.
We always keep a 'raw copy' of the data, in the state that we collected it/received it. We try to maintain backups of each 'stage of the process the data goes through'. After scoring/changing raw data, the data is not placed back in the raw data folder, but rather moved to the 'scored data' folder.
These 3 rules, if followed consistently, and enforced, stand a good chance of limiting issues such as duplicate datasets, duplicate datasets with disparate scorings (i.e. 1 dataset X with scoring A, and one copy of X with scoring B), lost data, and irrecoverably corrupted data.