Project

General

Profile

Bug #7671

Performance of locateFiles

Added by Christopher Backhouse almost 5 years ago. Updated over 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
01/22/2015
Due date:
% Done:

0%

Estimated time:
Duration:

Description

How is locateFiles expected to perform with long lists of files? I was hoping for performance similar to the initial translateConstraints that generated the list.

Is there some practical limit to the number of files requested that we should try to stay under? (By splitting up one big request into multiple sequential ones).

I've had ten files complete in around a minute, which is still slower than locating them one-by-one. Thousands of files takes longer than anyone has yet bothered to wait.

I now have 10 files not returning at all (tens of minutes). Though maybe there are other SAM problems currently, since I'm unable to start a project either.

I'm querying
http://samweb.fnal.gov:8480/sam/nova/api/files/locations

with postdata

file_name=neardet_genie_fhc_nonswap_post-shutdown-geantonly_2000_r00010565_s38_FA14-12-29_v1_20141211_145203_sim.mrcccaf.root&file_name=neardet_genie_fhc_nonswap_2000_r00010391_s1637_FA14-12-29_v1_20141028_103044_fnpc6015.fnal.gov_1414605728_31048_0_sim.mrcccaf.root&file_name=neardet_genie_fhc_nonswap_post-shutdown-geantonly_2000_r00010571_s63_FA14-12-29_v1_20141211_145203_sim.mrcccaf.root&file_name=neardet_genie_fhc_nonswap_2000_r00010403_s665_FA14-12-29_v1_20141028_103044_fcdfcaf3021.fnal.gov_1414626059_26619_0_sim.mrcccaf.root&file_name=neardet_genie_fhc_nonswap_2000_r00010403_s578_FA14-12-29_v1_20141028_103044_fnpc4006.fnal.gov_1414727120_1733_0_sim.mrcccaf.root&file_name=neardet_genie_fhc_nonswap_2000_r00010404_s615_FA14-12-29_v1_20141028_103044_fcdfcaf3050.fnal.gov_1414628403_5854_0_sim.mrcccaf.root&file_name=neardet_genie_fhc_nonswap_2000_r00010377_s1340_FA14-12-29_v1_20141028_103044_fnpc4041.fnal.gov_1414766978_35529_0_sim.mrcccaf.root&file_name=neardet_genie_fhc_nonswap_2000_r00010401_s439_FA14-12-29_v1_20141028_103044_fcdfcaf3007.fnal.gov_1414697670_3110_0_sim.mrcccaf.root&file_name=neardet_genie_fhc_nonswap_2000_r00010377_s3443_FA14-12-29_v1_20141028_103044_fcdfcaf3032.fnal.gov_1414622970_28547_0_sim.mrcccaf.root&file_name=neardet_genie_fhc_nonswap_post-shutdown-geantonly_2000_r00010555_s127_FA14-12-29_v1_20141211_145203_sim.mrcccaf.root&format=json

History

#1 Updated by Christopher Backhouse almost 5 years ago

SAM is healthier this morning and 10 or 100 files locate in a second or so again.

But the time does seem to increase super-lineraly with the number of files passed. Locating 1000 files took aroiund a minute.

For now I'm "chunking" my queries to get the best performance (making multiple queries for a subset of the files). If locateFiles() was linear or better the best thing to do would be to request all files at once, but I actually do better with much smaller chunks (though too many chunks and you're paying the network latency costs I was trying to avoid in the first place).

Locating the 15086 files of dataset prod_mrcccaf_FA14-12-29_nd_genie_nonswap_downsampled I get these times for different chunk sizes:

200 5m42
100 3m23
 75 2m43
 50 2m25
 25 3m01
 12 8m35

So it looks like the sweet spot is around 50 files/query. The numbers are noisy though. I got 3m01 and 3m34 when retesting with 50 later.

#2 Updated by Marc Mengel over 4 years ago

  • Project changed from IF Data Handling Client Tools (ifdhc) to SAM Web services

Reassigning this to sam web services, I don't think the client side is affecting
the performance here.

#3 Updated by Robert Illingworth over 4 years ago

This is probably at least partially due to memory overhead in the SQLAlchemy ORM layer. The upcoming new version of SQLAlchemy is supposed to reduce the overhead, so it may improve this. It would be possible to rewrite this to skip the ORM and use the DB query results more directly, but this would be quite a lot of work (and although I would expect it to improve the linearity, I'm more doubtful there would be a significant improvement over using the current implementation with the optimum chunk size).

#4 Updated by Christopher Backhouse over 4 years ago

Round trip time to the server makes a big difference to what the optimum chunk size is. I just looked up 15086 files (in chunks of 50) in 30s from Fermilab, while the same thing took 5 minutes from Caltech.

It would be nice not to have to worry about the chunking on our side. Even if the server-side implementation is just to do a bunch of chunked queries and string all the results together in the response.

Presumably the optimal chunk size server-side is smaller (you have even less latency to try to overcome) and so the throughput should be higher doing it there than in our code.

#5 Updated by Robert Illingworth over 4 years ago

That's an interesting point which I hadn't considered. I don't know how you're running this, but you might benefit from using persistent http connections, since that eliminates the three-way TCP handshake and the three (or even four)-way connection closedown. The samweb client will automatically do this if you're using the python api and the requests library (http://docs.python-requests.org/en/latest/) is available; I don't think ifdh can do it.

Breaking large lists down in the server and streaming the results back is a possibility. It'd require some code restructuring, but less than the idea of rewriting it to not use the ORM like I mentioned before.

#6 Updated by Christopher Backhouse over 4 years ago

I'm using ifdh from C++. Do I have any other options from C++?

#7 Updated by Robert Illingworth over 4 years ago

I think you'd have to do something like use cURL through its C API. I didn't find a lot of options when I was looking for C++ HTTP clients (I ended up writing my own, but it's very specialized for what I was doing, and not at all user friendly).



Also available in: Atom PDF