Performance of locateFiles
locateFiles expected to perform with long lists of files? I was hoping for performance similar to the initial
translateConstraints that generated the list.
Is there some practical limit to the number of files requested that we should try to stay under? (By splitting up one big request into multiple sequential ones).
I've had ten files complete in around a minute, which is still slower than locating them one-by-one. Thousands of files takes longer than anyone has yet bothered to wait.
I now have 10 files not returning at all (tens of minutes). Though maybe there are other SAM problems currently, since I'm unable to start a project either.
#1 Updated by Christopher Backhouse over 4 years ago
SAM is healthier this morning and 10 or 100 files locate in a second or so again.
But the time does seem to increase super-lineraly with the number of files passed. Locating 1000 files took aroiund a minute.
For now I'm "chunking" my queries to get the best performance (making multiple queries for a subset of the files). If locateFiles() was linear or better the best thing to do would be to request all files at once, but I actually do better with much smaller chunks (though too many chunks and you're paying the network latency costs I was trying to avoid in the first place).
Locating the 15086 files of dataset
prod_mrcccaf_FA14-12-29_nd_genie_nonswap_downsampled I get these times for different chunk sizes:
200 5m42 100 3m23 75 2m43 50 2m25 25 3m01 12 8m35
So it looks like the sweet spot is around 50 files/query. The numbers are noisy though. I got 3m01 and 3m34 when retesting with 50 later.
#3 Updated by Robert Illingworth over 4 years ago
This is probably at least partially due to memory overhead in the SQLAlchemy ORM layer. The upcoming new version of SQLAlchemy is supposed to reduce the overhead, so it may improve this. It would be possible to rewrite this to skip the ORM and use the DB query results more directly, but this would be quite a lot of work (and although I would expect it to improve the linearity, I'm more doubtful there would be a significant improvement over using the current implementation with the optimum chunk size).
#4 Updated by Christopher Backhouse over 4 years ago
Round trip time to the server makes a big difference to what the optimum chunk size is. I just looked up 15086 files (in chunks of 50) in 30s from Fermilab, while the same thing took 5 minutes from Caltech.
It would be nice not to have to worry about the chunking on our side. Even if the server-side implementation is just to do a bunch of chunked queries and string all the results together in the response.
Presumably the optimal chunk size server-side is smaller (you have even less latency to try to overcome) and so the throughput should be higher doing it there than in our code.
#5 Updated by Robert Illingworth over 4 years ago
That's an interesting point which I hadn't considered. I don't know how you're running this, but you might benefit from using persistent http connections, since that eliminates the three-way TCP handshake and the three (or even four)-way connection closedown. The samweb client will automatically do this if you're using the python api and the requests library (http://docs.python-requests.org/en/latest/) is available; I don't think ifdh can do it.
Breaking large lists down in the server and streaming the results back is a possibility. It'd require some code restructuring, but less than the idea of rewriting it to not use the ORM like I mentioned before.