Reading data files

Opening files

You will normally access data from a run, which is stored as a directory containing HDF5 files. You can open a run using RunDirectory() with the path of the directory, or using open_run() with the proposal number and run number to look up the standard data paths on the Maxwell cluster.

extra_data.RunDirectory(path, include='*', file_filter=<function lc_any>)

Open data files from a ‘run’ at European XFEL.

run = RunDirectory("/gpfs/exfel/exp/XMPL/201750/p700000/raw/r0001")

A ‘run’ is a directory containing a number of HDF5 files with data from the same time period.

Returns a DataCollection object.

Parameters
  • path (str) – Path to the run directory containing HDF5 files.

  • include (str) – Wildcard string to filter data files.

  • file_filter (callable) – Function to subset the list of filenames to open. Meant to be used with functions in the extra_data.locality module.

extra_data.open_run(proposal, run, data='raw', include='*', file_filter=<function lc_any>)

Access EuXFEL data on the Maxwell cluster by proposal and run number.

run = open_run(proposal=700000, run=1)

Returns a DataCollection object.

Parameters
  • proposal (str, int) – A proposal number, such as 2012, ‘2012’, ‘p002012’, or a path such as ‘/gpfs/exfel/exp/SPB/201701/p002012’.

  • run (str, int) – A run number such as 243, ‘243’ or ‘r0243’.

  • data (str) – ‘raw’ or ‘proc’ (processed) to access data from one of those folders. The default is ‘raw’.

  • include (str) – Wildcard string to filter data files.

  • file_filter (callable) – Function to subset the list of filenames to open. Meant to be used with functions in the extra_data.locality module.

New in version 0.5.

You can also open a single file. The methods described below all work for either a run or a single file.

extra_data.H5File(path)

Open a single HDF5 file generated at European XFEL.

file = H5File("RAW-R0017-DA01-S00000.h5")

Returns a DataCollection object.

Parameters

path (str) – Path to the HDF5 file

Data structure

A run (or file) contains data from various sources, each of which has keys. For instance, SA1_XTD2_XGM/XGM/DOOCS is one source, for an ‘XGM’ device which monitors the beam, and its keys include beamPosition.ixPos and beamPosition.iyPos.

European XFEL produces ten pulse trains per second, each of which can contain up to 2700 X-ray pulses. Each pulse train has a unique train ID, which is used to refer to all data associated with that 0.1 second window.

class extra_data.DataCollection
train_ids

A list of the train IDs included in this data. The data recorded may not be the same for each train.

control_sources

A set of the control source names in this data, in the format "SA3_XTD10_VAC/TSENS/S30100K". Control data is always recorded exactly once per train.

instrument_sources

A set of the instrument source names in this data, in the format "FXE_DET_LPD1M-1/DET/15CH0:xtdf". Instrument data may be recorded zero to many times per train.

all_sources

A set of names for both instrument and control sources. This is the union of the two sets above.

keys_for_source(source)

Get a set of key names for the given source

If you have used select() to filter keys, only selected keys are returned.

Only one file is used to find the keys. Within a run, all files should have the same keys for a given source, but if you use union() to combine two runs where the source was configured differently, the result can be unpredictable.

get_data_counts(source, key)

Get a count of data points in each train for the given data field.

Returns a pandas series with an index of train IDs.

Parameters
  • source (str) – Source name, e.g. “SPB_DET_AGIPD1M-1/DET/7CH0:xtdf”

  • key (str) – Key of parameter within that device, e.g. “image.data”.

info(details_for_sources=())

Show information about the selected data.

Getting data by source & key

Where data will fit into memory, it’s usually quickest and most convenient to load it like this.

class extra_data.DataCollection
get_array(source, key, extra_dims=None, roi=())

Return a labelled array for a particular data field.

arr = run.get_array("SA3_XTD10_PES/ADC/1:network", "digitizers.channel_4_A.raw.samples")

This should work for any data. The first axis of the returned data will be labelled with the train IDs.

Parameters
  • source (str) – Device name with optional output channel, e.g. “SA1_XTD2_XGM/DOOCS/MAIN” or “SPB_DET_AGIPD1M-1/DET/7CH0:xtdf”

  • key (str) – Key of parameter within that device, e.g. “beamPosition.iyPos.value” or “header.linkId”.

  • extra_dims (list of str) – Name extra dimensions in the array. The first dimension is automatically called ‘train’. The default for extra dimensions is dim_0, dim_1, …

  • roi (slice, tuple of slices, or by_index) – The region of interest. This expression selects data in all dimensions apart from the first (trains) dimension. If the data holds a 1D array for each entry, roi=np.s_[:8] would get the first 8 values from every train. If the data is 2D or more at each entry, selection looks like roi=np.s_[:8, 5:10] .

See also

xarray documentation

How to use the arrays returned by get_array()

Reading data to analyse in memory

Examples using xarray & pandas with EuXFEL data

get_dask_array(source, key, labelled=False)

Get a Dask array for the specified data field.

Dask is a system for lazy parallel computation. This method doesn’t actually load the data, but gives you an array-like object which you can operate on. Dask loads the data and calculates results when you ask it to, e.g. by calling a .compute() method. See the Dask documentation for more details.

If your computation depends on reading lots of data, consider creating a dask.distributed.Client before calling this. If you don’t do this, Dask uses threads by default, which is not efficient for reading HDF5 files.

Parameters
  • source (str) – Source name, e.g. “SPB_DET_AGIPD1M-1/DET/7CH0:xtdf”

  • key (str) – Key of parameter within that device, e.g. “image.data”.

  • labelled (bool) – If True, label the train IDs for the data, returning an xarray.DataArray object wrapping a Dask array.

See also

Dask Array documentation

How to use the objects returned by get_dask_array()

Averaging detector data with Dask

An example using Dask with EuXFEL data

get_series(source, key)

Return a pandas Series for a particular data field.

s = run.get_series("SA1_XTD2_XGM/XGM/DOOCS", "beamPosition.ixPos")

This only works for 1-dimensional data.

Parameters
  • source (str) – Device name with optional output channel, e.g. “SA1_XTD2_XGM/DOOCS/MAIN” or “SPB_DET_AGIPD1M-1/DET/7CH0:xtdf”

  • key (str) – Key of parameter within that device, e.g. “beamPosition.iyPos.value” or “header.linkId”. The data must be 1D in the file.

get_dataframe(fields=None, *, timestamps=False)

Return a pandas dataframe for given data fields.

df = run.get_dataframe(fields=[
    ("*_XGM/*", "*.i[xy]Pos"),
    ("*_XGM/*", "*.photonFlux")
])

This links together multiple 1-dimensional datasets as columns in a table.

Parameters
  • fields (dict or list, optional) – Select data sources and keys to include in the dataframe. Selections are defined by lists or dicts as in select().

  • timestamps (bool) – If false (the default), exclude the timestamps associated with each control data field.

See also

pandas documentation

How to use the objects returned by get_series() and get_dataframe()

Reading data to analyse in memory

Examples using xarray & pandas with EuXFEL data

get_virtual_dataset(source, key, filename=None)

Create an HDF5 virtual dataset for a given source & key

A dataset looks like a multidimensional array, but the data is loaded on-demand when you access it. So it’s suitable as an interface to data which is too big to load entirely into memory.

This returns an h5py.Dataset object. This exists in a real file as a ‘virtual dataset’, a collection of links pointing to the data in real datasets. If filename is passed, the file is written at that path, overwriting if it already exists. Otherwise, it uses a new temp file.

To access the dataset from other worker processes, give them the name of the created file along with the path to the dataset inside it (accessible as ds.name). They will need at least HDF5 1.10 to access the virtual dataset, and they must be on a system with access to the original data files, as the virtual dataset points to those.

New in version 0.5.

Getting data by train

Some kinds of data, e.g. from AGIPD, are too big to load a whole run into memory at once. In these cases, it’s convenient to load one train at a time.

When accessing data like this, it’s worth selecting which sources you’re interested in, either using select(), or the devices= parameter. This avoids reading all the other data.

class extra_data.DataCollection
trains(devices=None, train_range=None, *, require_all=False)

Iterate over all trains in the data and gather all sources.

run = Run('/path/to/my/run/r0123')
for train_id, data in run.select("*/DET/*", "image.data").trains():
    mod0 = data["FXE_DET_LPD1M-1/DET/0CH0:xtdf"]["image.data"]
Parameters
  • devices (dict or list, optional) – Filter data by sources and keys. Refer to select() for how to use this.

  • train_range (by_id or by_index object, optional) – Iterate over only selected trains, by train ID or by index. Refer to select_trains() for how to use this.

  • require_all (bool) – False (default) returns any data available for the requested trains. True skips trains which don’t have all the selected data; this only makes sense if you make a selection with devices or select().

Yields
  • tid (int) – The train ID of the returned train

  • data (dict) – The data for this train, keyed by device name

train_from_id(train_id, devices=None)

Get train data for specified train ID.

Parameters
  • train_id (int) – The train ID

  • devices (dict or list, optional) – Filter data by sources and keys. Refer to select() for how to use this.

Returns

  • tid (int) – The train ID of the returned train

  • data (dict) – The data for this train, keyed by device name

Raises

KeyError – if train_id is not found in the run.

train_from_index(train_index, devices=None)

Get train data of the nth train in this data.

Parameters
  • train_index (int) – Index of the train in the file.

  • devices (dict or list, optional) – Filter data by sources and keys. Refer to select() for how to use this.

Returns

  • tid (int) – The train ID of the returned train

  • data (dict) – The data for this train, keyed by device name

Selecting & combining data

These methods all return a new DataCollection object with the selected data, so you use them like this:

sel = run.select("*/XGM/*")
# sel includes only XGM sources
# run still includes all the data
class extra_data.DataCollection
select(seln_or_source_glob, key_glob='*')

Select a subset of sources and keys from this data.

There are three possible ways to select data:

  1. With two glob patterns (see below) for source and key names:

    # Select data in the image group for any detector sources
    sel = run.select('*/DET/*', 'image.*')
    
  2. With an iterable of (source, key) glob patterns:

    # Select image.data and image.mask for any detector sources
    sel = run.select([('*/DET/*', 'image.data'), ('*/DET/*', 'image.mask')])
    

    Data is included if it matches any of the pattern pairs.

  3. With a dict of source names mapped to sets of key names (or empty sets to get all keys):

    # Select image.data from one detector source, and all data from one XGM
    sel = run.select({'SPB_DET_AGIPD1M-1/DET/0CH0:xtdf': {'image.data'},
                      'SA1_XTD2_XGM/XGM/DOOCS': set()})
    

    Unlike the others, this option doesn’t allow glob patterns. It’s a more precise but less convenient option for code that knows exactly what sources and keys it needs.

Returns a new DataCollection object for the selected data.

Note

‘Glob’ patterns may be familiar from selecting files in a Unix shell. * matches anything, so */DET/* selects sources with “/DET/” anywhere in the name. There are several kinds of wildcard:

  • *: anything

  • ?: any single character

  • [xyz]: one character, “x”, “y” or “z”

  • [0-9]: one digit character

  • [!xyz]: one character, not x, y or z

Anything else in the pattern must match exactly. It’s case-sensitive, so “x” does not match “X”.

deselect(seln_or_source_glob, key_glob='*')

Select everything except the specified sources and keys.

This takes the same arguments as select(), but the sources and keys you specify are dropped from the selection.

Returns a new DataCollection object for the remaining data.

select_trains(train_range)

Select a subset of trains from this data.

Choose a slice of trains by train ID:

from extra_data import by_id
sel = run.select_trains(by_id[142844490:142844495])

Or select a list of trains:

sel = run.select_trains(by_id[[142844490, 142844493, 142844494]])

Or select trains by index within this collection:

sel = run.select_trains(np.s_[:5])

Returns a new DataCollection object for the selected trains.

Raises

ValueError – If given train IDs do not overlap with the trains in this data.

union(*others)

Join the data in this collection with one or more others.

This can be used to join multiple sources for the same trains, or to extend the same sources with data for further trains. The order of the datasets doesn’t matter.

Returns a new DataCollection object.

Writing selected data

class extra_data.DataCollection
write(filename)

Write the selected data to a new HDF5 file

You can choose a subset of the data using methods like select() and select_trains(), then use this write it to a new, smaller file.

The target filename will be overwritten if it already exists.

write_virtual(filename)

Write an HDF5 file with virtual datasets for the selected data.

This doesn’t copy the data, but each virtual dataset provides a view of data spanning multiple sequence files, which can be accessed as if it had been copied into one big file.

This is not the same as building virtual datasets to combine multi-module detector data. See AGIPD, LPD & DSSC data for that.

Creating and reading virtual datasets requires HDF5 version 1.10.

The target filename will be overwritten if it already exists.

Missing data

What happens if some data was not recorded for a given train?

Control data is duplicated for each train until it changes. If the device cannot send changes, the last values will be recorded for each subsequent train until it sends changes again. There is no general way to distinguish this scenario from values which genuinely aren’t changing.

Parts of instrument data may be missing from the file. These will also be missing from the data returned by extra_data:

  • The train-oriented methods trains(), train_from_id(), and train_from_index() give you dictionaries keyed by source and key name. Sources and keys are only included if they have data for that train.

  • get_array(), and get_series() skip over trains which are missing data. The indexes on the returned DataArray or Series objects link the returned data to train IDs. Further operations with xarray or pandas may drop misaligned data or introduce fill values.

  • get_dataframe() includes rows for which any column has data. Where some but not all columns have data, the missing values are filled with NaN by pandas’ missing data handling.

Missing data does not necessarily mean that something has gone wrong: some devices send data at less than 10 Hz (the train rate), so they always have gaps between updates.

Data problems

If you encounter problems accessing data with extra_data, there may be problems with the data files themselves. Use the extra-data-validate command to check for this (see Checking data files).

Here are some problems we’ve seen, and possible solutions or workarounds:

  • Indexes point to data beyond the end of datasets: this has previously been caused by bugs in the detector calibration pipeline. If you see this in calibrated data (in the proc/ folder), ask for the relevant runs to be re-calibrated.

  • Train IDs are not strictly increasing: issues with the timing system when the data is recorded can create an occasional train ID which is completely out of sequence. Usually it seems to be possible to ignore this and use the remaining data, but if you have any issues, please let us know.

    • In one case, a train ID had the maximum possible value (264 - 1), causing info() to fail. You can select everything except this train using select_trains():

      from extra_data import by_id
      sel = run.select_trains(by_id[:2**64-1])
      

If you’re having problems with extra_data, you can also try searching previously reported issues to see if anyone has encountered similar symptoms.

Cached run data maps

When you open a run in extra_data, it needs to know what data is in each file. Each file has metadata describing its contents, but reading this from every file is slow, especially on GPFS. extra_data therefore tries to cache this information the first time a run is opened, and reuse it when opening that run again.

This should happen automatically, without the user needing to know about it. You only need these details if you think caching may be causing problems.

  • Caching is triggered when you use RunDirectory() or open_run().

  • There are two possible locations for the cached data map:

    • In the run directory: (run dir)/karabo_data_map.json.

    • In the proposal scratch directory: (proposal dir)/scratch/.karabo_data_maps/raw_r0032.json. This will normally be the one used on Maxwell, as users can’t write to the run directory.

  • The format is a JSON array, with an object for each file in the run.

    • This holds the list of train IDs in the file, and the lists of control and instrument sources.

    • It also stores the file size and last modified time of each data file, to check if the file has changed since the cache was created. If either of these attributes doesn’t match, extra_data ignores the cached information and reads the metadata from the HDF5 file.

  • If any file in the run wasn’t listed in the data map, or its entry was outdated, a new data map is written automatically. It tries the same two locations described above, but it will continue without error if it can’t write to either.

JSON was chosen as it can be easily inspected manually, and it’s reasonably efficient to load the entire file.

Issues reading archived data

Files at European XFEL storage migrate over time from GPFS (designed for fast access) to PNFS (designed for archiving). The data on PNFS is usually always available for reading. But sometimes, this may require staging from the tape to disk. If there is a staging queue, the operation can take an indefinitely long time (days or even weeks) and any IO operations will be blocked for this time.

To determine the files which require staging or are lost, use the script:

extra-data-locality <run directory>

It returns a list of files which are currently located only on slow media for some reasons and, separately, any which have been lost.

If the files are not essential for analysis, then they can be filtered out using filter lc_ondisk() from extra_data.locality:

from extra_data.locality import lc_ondisk
run = open_run(proposal=700000, run=1, file_filter=lc_ondisk)

file_filter must be a callable which takes a list as a single argument and returns filtered list.

Note: Reading the file locality on PNFS is an expensive operation. Use it only as a last resort.

If you find any files which are located only on tape or unavailable, please let know to ITDM. If you need these files for analysis mentioned that explicitly.