We are interested in establishing an application that uses the async memory copy and concurrent kernel operations to overlap device transfer and kernel program execution. We want the application to be configured with a data block size. This is the amount of data that will be pushed to the GPU for processing and returned. The operation on the data initially can be very simple, such as add one to every number. The data blocks can be randomly generated.

Here is the results we want:

  1. report of throughput (MB/sec) for many data block sizes in the range of 100KB to 256MB. Because of the large range of sizes, a logarithmic distribution of values will be best. If you increase the size of the block by 1.5 times each time, you'll have 15 samples from 100Kb up to about 230MB. You'll need to repeat the measurement at least a few times (5, maybe more if it is really fast) in order to give some statistical significance to the results. For this, and for other such measurements, put the data into a simple ASCII tabular format. We'll be wanting to read it with our standard data analysis program (R), and reading tabular data is easy. No fancy formatting is necessary; it would probably only make the data harder to read into R.
  2. the report in (1) for 1 though n concurrent streams. You will need to determine the max number of concurrent streams. It is probably something like 16.
  3. show that the memory copies to and from the device and kernel executions are actually overlapping.

Why is interesting and useful? Well there are two scenarios we have in mind

  1. Darkside50 events are 6MB, made up of five 1.2MB fragment. It will be interesting to know if pipelining the transfers and processing will be at all useful, or at what size data chunks it will be useful.
  2. Our plans for processing framework are to have multiple processing channels or streams so many events can be active at once in a system. It may be possible to use multiple CUDA streams to accommodate multiple events at the same time.
  3. We want a benchmark program for transfers to see what are max processing rate is.