Dimension Syntax

The dimension syntax is composed of simple predicates grouped together to make more
complicated queries. The dimension parser takes care of doing the appropriate
database table joins, etc. to make the query into a full SQL query.

Simple dimension predicates

These are simple dimension_name operator value predicates like

file_name = "myfile" 
end_time >= 2010/10/20T13:30:00

The operator can be left out, which implies "=".

The list of dimension names for these predicates can be obtained by looking at
with EXPERIMENT replaced by your experiment name (i.e. "nova", "mu2e", "minerva", etc.).

The operators can be any of:
op meaning
= equals
!= not equals
< less than
> greater than
>= greater than or equal
<= less than or equal
like wildcard string match (can also use =)
not like inverse wildcard string match (can also use !=)

Wildcards are ? for any single character, % for one or more of any character.

There are also list based predicates dimension_name op list for operators

op meaning
in is one of
not in is not one of

Compound predicates

predicates can be combined with and, or, and not, and grouped with parenthesis to make compound queries.

(file_name like "foo%bar.root" and file_size > 1024) or 
  (file_name like "foo%baz.root" and file_size > 2048)

Set operations

Predicates like predicate minus predicate refer to a set-minus operation of the
two specified datasets.

Referring to existing definitions

Predicates of the form

defname: "foo"
refer to an existing defined dataset named foo. This effectively inserts the definition of foo at this point.

Query modifiers

Modifiers that alter the behaviour of all or part of the query are specified using the with operator, which must come at the end of a clause. A single with clause can specify multiple modifier terms.

The available modifiers are:

  • limit n which limits the maximum number of files to return
  • offset n which skips the first n files in the result set
  • stride n which returns every n'th file in the result set
  • availability x[,y[,z]] which limits the results to files that match the given availability status

For the availability modifier, possible availability: flags are in three categories. Multiple flags can be given as a comma separated list

  • active flags:
    • active - only list active files (default)
    • retired - only list files that have been retired
  • status flags:
    • good - only list files with a good content status (default)
    • bad - only list files with a bad content status
    • anystatus - list files regardless of content status
  • location flags:
    • virtual - only list files with no locations
    • physical - only list files with at least one location (default at the top level)
    • anylocation - list files regardless of locations (default within isparentof and ischildof clauses - see below)

Ancestry predicates

A predicate can be prefixed with isparentof: or ischildof: to state that the predicate refers to
the immediate parent or child file of the one of which we are speaking. So if you want to look for
a large file derived from a small raw file, you could say:

file_size > 2048 and ischildof: (data_tier = "raw" and file_size < 1024 )

The default behaviour inside the ancestry clause is to include virtual files (those with no location). To override this, include an availability term with in the clause, such as availability: physical to only consider files with a location.

Alternatively isdescendantof: or isancestorof: can be used to look at all ancestors or descendants of the selected file set. These operators should be used with care as the file lineage tree can potentially get very big. They cannot be mixed with other lineage operators in the same query.

Note on negation

A subtle point is that referring to a named dimension forces the evaluator to require the existence of the relevant value for a file to match. In most cases this doesn't matter, but it can have some surprising effects when constraints are negated. For example

data_tier A and not some.parameter B

will only return files that have some.parameter defined in their metadata as long as it doesn't have the value B. To return all files that don't have some.parameter at all as well as some.parameter != B

data_tier A minus some.parameter B

The right hand side of the minus clause is evaluated separately from the left hand side so all files that match are removed from the results without imposing any further constraints on the results.

Interesting Examples

Recovery dataset for one or more projects

(project_name in ('mengelTest1355428720','mengelTest1355427845') 
     and not (consumed_status like 'consumed'))

Files analyzed already

  data_tier in ('raw', 'binary-raw') and
  run_number >= 3515 and run_number < 3522 and
  isparentof: (data_tier = 'rawdigits') and
  physical_datastream_name in ('numib', 'numil', 'numip')
  and not quality.minerva = 'bad'

Every 10th file from a previously defined dataset definition, up to a maximum of 100

defname: a_existing_definition_name with limit 100 stride 10

Use Latest Snapshot On A Definition Name

dataset_def_name_newest_snapshot a_existing_definition_name

More On Dimensions Using The Client

samweb list-files --help-dimensions