Studies on DAQ transfer rate from LArTF to dCache in FCC¶
DAQ Event Builder Configuration and Transfer Speeds¶
The MicroBooNE DAQ is designed with the Event Builder machine (ubdaq-prod-evb) as the server where data from all detector components is accumulated and built into the event structure. The events are then streamed onto a volume (/data/uboonedaq/rawdata) which is a 33 TB, RAID-6 volume on a 3Ware 9750-8i RAID controller. Once the data has been written to this volume, the files are registered with the PUBS database for generation of metadata, checksum calculation, and then transfer to dCache and permanent tape storage at Feynman Computing Center. The current event size reading from the MicroBooNE detector is ~34 Megabytes (MB) per event. When the Booster is delivering beam to the BNB Horn at a rate of 5 Hz, this translates to a data rate of 170 MB/s when no online filtering is being performed. But with additional readout from NuMI beam, external triggers, Muon Counter System triggers, the readout rate has consistently been found to be in excess of 5.8 Hz for the month of December 2015. This leads to data rate is excess of 200 MB/s. And while the 3Ware 9750 RAID controller is listed as having greater than 500 Megabyte/s read and write performance with a single stream, the real time performance has shown that the RAID controller cannot keep up with both writing and reading at rates in excess of ~5.3 Hz. As well, when the RAID controller performs it's weekly verify, the performance of the RAID degrades significantly and so the read rate for transfers to dCache drops to 150 MB/s leaving a potential backlog of 50 MB/s of data not transferred to storage from the Event Builder during this verify.
Determining the maximum readout rate of the RAID controller¶
Similar to traffic patterns on the highways, there is not a linear degradation in the performance of the RAID controller versus the readout rate of the DAQ. Instead, there is a threshold at which the RAID is unable to deliver data fast enough to transfer to permanent storage. Using ganglia plots, we can look at the readout rate of the DAQ and compare it with the slope of the /data volume occupancy. Looking at the plots for these two quantities at the one day scale (Dec 15, 2015) and the one week scale (Dec 8-15, 2015) gives a measure of where this threshold occurs.
Disk usage vs readout rate Dec 15, 2015
These two plots show the disk usage and the readout rate in Hz over 24 hours. It should be noted the the percentage label on the disk usage is incorrect, but the trend and values are valid. The small scale saw-tooth pattern that you see is a consequence of the PUBS workflow where by files are only periodically deleted from the /data volume since their transfer is validated in batches of 100 files. So while the data is streamed continuously, the files are deleted in batches of 100 files. The readout rate varies from as low as 4.0 Hz up to 6.6 Hz. There are several times during the day that the DAQ is stopped and there are large drops in the disk usage (e.g. Midnight, 05:00, 6:30, and 9:00). More interesting is that the volume usage changes from increasing to decreasing when the readout rate drops below 5.3 Hz.¶
Disk usage vs readout rate Dec 8-15, 2015¶
Based upon the information gained from looking at the finer detail from the 24 hour plot of readout rate and /data volume usage, it becomes easier to understand the DAQ readout rate threshold at which the RAID controller can no longer deliver enough read bandwidth. The initial spike that rapidly decreases Wednesday at ~04:00 is due to a beam downtime and the high bandwidth provided when data is not being written by the DAQ. An additional piece of information is that files are not deleted when the /data volume usage is below 10% occupancy. This seen during the "flat" usages period Thursday midday until Friday evening. During this time the readout was rather inconsistent so during stoppages in data taking the transfers were able to catch up. But during steady-state readout above 5.3 Hz, you can see that the occupancy always increases. While Monday midday, the readout was steady at 5.2 Hz and the occupancy was steady and above 10%. This means that the transfer was fast enough to keep pace with data taking, but not so fast as to clean out files still waiting to be transferred.
Transfer rate curing RAID verify Dec 21, 2015¶
The RAID array is configured from the factory to perform a verify of the RAID 6 parity information every Friday at midnight. This verify causes some load on the I/O performance of the RAID array, and while the array is configured to minimize the affect on the data I/O it can still be seen in the transfer rate from LArTF to FCC. The plot above shows the transfer rate during normal data taking with the DAQ readout at 5.8 Hz. The data rate is ~1.6 Gigabit/s (~200 MB/s) and doesn't keep rate with the readout at ~203 MB/s. The RAID verify was manually started at 11:15 am, and the almost immediate drop in the transfer speed from 1.6 Gb/s to 1.2 Gb/s can be seen. There is one small spike during this time that is associated with a run transition. With this degraded performance, the /data volume is expected to fill at a rate of approximately 50 MB/s and with a total volume of 33 TB it will be completely full in 7 days. Fortunately, the verify only takes approximately 2.5 days, but when complete the /data volume would be 50% full, leaving only 26 hours of disk buffer if the network connection to FCC were to be interrupted.
Quota for an improved RAID array and server for the Event Builder¶
This quote is from KOI for a machine with 36 HDD (4 TB, 7200 rpm) which increases the estimated disk I/O by a factor of 2 and increases the storage capacity of the event builder by a factor of 2. The price can easily be lowered by more than $2400 by installing 24 HDD instead of 36.