Project

General

Profile

Bug #7224

Art allows metadata containing a comma, and sam_metadata_dumper produces invalid JSON output

Added by Christopher Backhouse almost 5 years ago. Updated almost 5 years ago.

Status:
Closed
Priority:
Normal
Category:
I/O
Target version:
Start date:
10/27/2014
Due date:
% Done:

100%

Estimated time:
32.00 h
Spent time:
Occurs In:
Scope:
Internal
Experiment:
NOvA
SSI Package:
-
Duration:

Description

If you include a comma in a metadata parameter, sam_metadata_dumper includes it unescaped in the output, leading to invalid JSON syntax.

The answer to this may be "well, don't do that then". But mistakes like this would be caught sooner if art refused to accept metadata parameters with a comma in (and probably a list of other special characters, like single and double quote, that indicate that the caller probably has some escaping issues of their own).

History

#1 Updated by Christopher Green almost 5 years ago

  • Category set to I/O
  • Status changed from New to Feedback
  • Assignee set to Christopher Green
  • Estimated time set to 8.00 h
  • Experiment NOvA added
  • Experiment deleted (-)
  • SSI Package art added
  • SSI Package deleted ()

It was intended that specifically the comma be handled by the string being wrapped in double-quotes, so this will need to be investigated.

Generally-speaking, we would expect to be able to sanitize and/or escape metadata values and reject nonconforming parameter names. Does this sound acceptable?

Please supply an example of input that produces illegal JSON output currently, and what you would expect to be the behavior of the system upon receiving this input.

#2 Updated by Christopher Backhouse almost 5 years ago

I believe it was this

nova --sam-file-type=\"importedDetector\", -c my.fcl

due to a bug on my side. I'm not sure about the slashes or not though.

and then the sam_metadata_dumper output looked like:

file_type: "importedDetector",,

Note the doubled comma.

In this case I provided a value that was bad. But we know there are only a handful of valid file_types, so it seems strange to sanitize my input just to allow it to bite me later. For me in this case the ideal would have been for ART to abort with a message like "invalid file_type specified".

#3 Updated by Christopher Green almost 5 years ago

Unfortunately, we cannot easily query the SAM system for the current list of valid types.

Currently, the code assumes that if the first character of the value is a {, [, or ", then the value is a valid JSON entity. If not, we attempt to interpret the value as a number and if that doesn't work we wrap it in double quotes, escaping any double quotes found inside.

Your case has " as the first character and therefore triggers the assumption that it is a valid JSON entity: a double-quoted string. We can extend this to requiring that a value starting with " ends in an un-escaped ", but the other assumptions ({, [) will remain. Is that acceptable to you as a solution?

We could incorporate a JSON parser into art and verify each identified sequence or dictionary as a correct JSON entity, but all we could do in the event of a failure is warn or throw an exception, either during the art job itself or when sam_metadata_dumper encounters the faulty information in the in-file SQLite database.

#4 Updated by Christopher Backhouse almost 5 years ago

Checking for the closing quote, or closing } or ], makes sense.

Warning or exception is ideal, but maybe including a JSON parser is too much complexity for such a trivial thing.

#5 Updated by Rob Kutschke almost 5 years ago

Does it make sense to provide a way for an experiment to register a callback (detail class or otherwise) to do the work? Or to do additional work over and above what you do? That would give each of us control and a possibility for fast turn around. And we can all migrate tests up to art do minimize duplicated work?

#6 Updated by Christopher Green almost 5 years ago

  • Target version set to 1.12.04
  • % Done changed from 0 to 100
  • SSI Package - added
  • SSI Package deleted (art)

Adding a plugin system to the sam_metadata_dumper application seems to me to be a seriously heavyweight solution, since the experiment is always free to take the output of sam_metadata_dumper and deal with it after the fact.

Various changes and additions have been made to alleviate this issue:

  • Functionality has been added to allow for JSON syntax checking during art execution: set services.FileCatalogMetadata.checkSyntax: true. An exception will be thrown at the earliest useful opportunity: upon calling art::FileCatalogMetadata::addMetadata() with faulty input, or for FileCatalogMetadata plugins, after all plugins have been called for their metadata.
  • All metadata added by the art system, including that provided by command-line argument or via user-provided FHiCL, is canonicalized where appropriate (wrapped in double quotes and suitably escaped.
  • The new function art::FileCatalogMetadata::addMetadataString() will automatically canonicalize the provided value as a double-quoted string.
  • cet::canonicalize_string() has a new signature which returns the canonicalized string rather than requiring a reference for append.
  • sam_metadata_dumper is more careful about deciding what is already in JSON-ready form:
    • Keys are automatically canonicalized.
    • Values are canonicalized unless they start with a `{' or `[' or they have a leading and trailing `"'.

A possibly breaking change for some users is that most string values are now canonicalized even when human-readable output is selected for sam_metadata_dumper, as that is how they are stored in the metadata database.

#7 Updated by Christopher Backhouse almost 5 years ago

A possibly breaking change for some users is that most string values are now canonicalized even when human-readable output is selected for sam_metadata_dumper, as that is how they are stored in the metadata database.

Does this mean that even simple strings will come out with quotes around them? Or is this just about various special characters?

If the former, this is indeed a breaking change for me. Basically this problem is how I ran into trouble in first place (it was the switch by default from human-readable to json output).

#8 Updated by Christopher Green almost 5 years ago

To ensure that we are on exactly the same page, please read the SAM_metadata_facilities wiki page and let us know exactly what does not yet meet your requirements. If you need for the short term, we can "uncanonicalize" the human-readable output, but for the long term we definitely recommend using syntax-checked JSON and a parser per the summary.

Please let us know what you need.

#9 Updated by Christopher Backhouse almost 5 years ago

"The value is printed exactly as it has been read from the metadata database, including with any double-quotes if they were present. Note that, as explained above, any strings added by art are canonicalized and will therefore be printed with surrounding quotes"

That makes it sound like I can't specifically rely on the quotes being present or not.

The context is that I need to inherit the SAM file_type field from the parent file. But because I specify data_tier on the command line, art insists that I specify several other parameters, including the file_type, explicitly. I believe there's an Issue open on that, because I still don't understand why that needs to be the case.

This is what I came up with. A bit of a hack, but it did the trick:

FILETYPE=`sam_metadata_dumper -H $INPUT | grep fileType | sed 's/.* //'`
SAMSTR="--sam-application-family=nova --sam-application-version=$SRT_BASE_RELEASE --sam-file-type=$FILETYPE --sam-data-tier=out1:mytier --sam-stream-name=out1:all" 
nova $SAMSTR -c myjob.fcl $INPUT

Initially I didn't have the "-H", and that worked for a while until I upgraded to a version where the default output was JSON format. That was the trigger for my initial report, that the screwed-up file_type I wound up putting in was allowed to propagate through so many steps before causing problems. That's fixed with your recent commits.

I can add more code to strip out quotes if they're present, but the whole thing now feels pretty unstable.

The best fix for my specific issue would be to stop requiring all that superfluous information on the command line.

Otherwise, I'm sure there are use-cases elsewhere where people might want something like

$ sam_metadata_dumper --field file_type myfile.root
importedDetector

which would cut way down on the parsing having to be done by the user script. Especially if it was guaranteed that quotes and escaping would be undone.

#10 Updated by Christopher Backhouse almost 5 years ago

Is there a particular command-line JSON parser you recommend that would allow something like:

$ sam_metadata_dumper myfile.root | json --get file_type
importedDetector

I guess there's nothing in SL6 by default? Should something get packaged up and distributed?

#11 Updated by Christopher Green almost 5 years ago

Since it's late, I will answer your easiest question now and defer a response to your longer comment to Monday.

yum --enablerepo=epel install jq
should get you where you need to be.

#12 Updated by Christopher Green almost 5 years ago

The requiring of the presence of particular items of metadata if SAM input were selected was part of the initial user-led requirements for SAM support within art. There were several issues subsequent to the initial implementation (including at least one related to "linked" metadata items) that were discussed in a wider context at a resolution meeting, and I believe those issues (as modified by that discussion where appropriate) were resolved without complaint several months ago. If you have a request for a further modification of this behavior though, please enter a separate issue so that it can be discussed at a stakeholder meeting.

As soon as it became apparent that complex input was required for certain SAM metadata items, we offered the JSON output as an alternative to attempting to machine-parse the human-readable output, for three reasons:

  1. The difficulty of parsing human-readable output that may have spaces, quotes and other "interesting" features.
  2. The requirement from SAM that the metadata be uploaded in JSON format; and
  3. The myriad tools available for parsing and manipulating JSON.

We would recommend against any use of the human-readable output format, because there is no precise grammar for that format. That said, if people still using it have a particular preference for enforcing canonicalized string values or unquoted, un-escaped string values, we can do that, with the understanding that, since we can't read the mind of the author of any given piece of metadata, we cannot guarantee to be able to do the right thing in all cases. If you need that, please transition to using the JSON output feature of sam_metadata_dumper and the syntax-checking feature of the art::FileCatalogMetadata service.

Please let us know what you want us to do with human-readable string output, and I can make this change in time for the imminent release of 1.12.04.

#13 Updated by Christopher Backhouse almost 5 years ago

In the human-readable output, I think strings should be displayed exactly as initially specified. ie if any escaping or wrapping happens internal to art it should be undone.

#14 Updated by Brian Rebel almost 5 years ago

I talked with Ryan Patterson about this issue and we agree that it is not something that should hold up any scheduled ART releases. Ryan and I also agreed that this issue should be taken offline from the art discussion for the time being so that NOvA could discuss it internally and come up with a consensus on how to proceed.

In the meantime, I have communicated to Chris Green that providing a human-readable output that contains the quotes as expected by JSON for strings would be acceptable in the next release of ART.

Once NOvA has reached a consensus, we can revisit that decision if necessary.

#15 Updated by Christopher Green almost 5 years ago

  • Status changed from Feedback to Resolved
  • Estimated time changed from 8.00 h to 32.00 h

Resolved per Brian's decision on canonicalization of strings in human-readable output.

#16 Updated by Christopher Green almost 5 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF