How to prevent the number of files in chunk changing when submit jobs
In order to processing the huge amount of files, there are some options to do this.
1) start one project to process the big dataset;
2) split the big dataset into pieces and process them one by one.
I prefer to use the second option. However, the number of files for each chunk is changing.
The question is how to prevent the number of files changing for each chunk.
#2 Updated by Paola Buitrago almost 6 years ago
- Status changed from New to Resolved
When processing a large number of files, data handling experts recommend to do it by splitting the big dataset in small chunks of around ~20K files. In order to avoid having the initial big DS change it's size (as the processing advances and output files reach SAM) there are two options:
1) Define the big dataset not as a draining DS. Create the subsets from the big (non dynamic) definition.
2) Define the big DS as a draining DS and take a snapshot. Create the subsets from the snapshot.
The commnads to create small definitions from a predefined big definition are:
with limit n Limit the number of results to n
with offset n Skip the first n results
- samweb list-files --help-dimensions