Multi-module detector data

Several X-ray pixel detectors are composed of multiple modules, which are stored as separate sources at EuXFEL. extra_data includes convenient interfaces to access data from AGIPD, LPD, DSSC and JUNGFRAU, pulling together the separate modules into a single array.

Note

These detectors can record a lot of data, so loading it all into memory at once may be impossible. You can use the split_trains() method to work on a chunk at a time, or work with Dask arrays.

class extra_data.components.AGIPD1M(data: DataCollection, detector_name=None, modules=None, *, min_modules=1)

An interface to AGIPD-1M data.

Parameters:
  • data (DataCollection) – A data collection, e.g. from RunDirectory().

  • modules (set of ints, optional) – Detector module numbers to use. By default, all available modules are used.

  • detector_name (str, optional) – Name of a detector, e.g. ‘SPB_DET_AGIPD1M-1’. This is only needed if the dataset includes more than one AGIPD detector.

  • min_modules (int) – Include trains where at least n modules have data. Default is 1.

The methods of this class are identical to those of LPD1M, below.

class extra_data.components.AGIPD500K(data: DataCollection, detector_name=None, modules=None, *, min_modules=1)

An interface to AGIPD-500K data

Detector names are like ‘HED_DET_AGIPD500K2G’, otherwise this is identical to AGIPD1M.

class extra_data.components.DSSC1M(data: DataCollection, detector_name=None, modules=None, *, min_modules=1)

An interface to DSSC-1M data.

Parameters:
  • data (DataCollection) – A data collection, e.g. from RunDirectory().

  • modules (set of ints, optional) – Detector module numbers to use. By default, all available modules are used.

  • detector_name (str, optional) – Name of a detector, e.g. ‘SCS_DET_DSSC1M-1’. This is only needed if the dataset includes more than one DSSC detector.

  • min_modules (int) – Include trains where at least n modules have data. Default is 1.

The methods of this class are identical to those of LPD1M, below.

class extra_data.components.LPD1M(data: DataCollection, detector_name=None, modules=None, *, min_modules=1, parallel_gain=False)

An interface to LPD-1M data.

Parameters:
  • data (DataCollection) – A data collection, e.g. from RunDirectory().

  • modules (set of ints, optional) – Detector module numbers to use. By default, all available modules are used.

  • detector_name (str, optional) – Name of a detector, e.g. ‘FXE_DET_LPD1M-1’. This is only needed if the dataset includes more than one LPD detector.

  • min_modules (int) – Include trains where at least n modules have data. Default is 1.

  • parallel_gain (bool) – Set to True to read this data as parallel gain data, where high, medium and low gain data are stored sequentially within each train. This will repeat the pulse & cell IDs from the first 1/3 of each train, and add gain stage labels from 0 (high-gain) to 2 (low-gain).

Selecting a key from the detector, e.g. det['image.data'], gives an object similar to a single-source KeyData, but with the modules arranged along the first axis. So det['image.data'].ndarray() will load all the selected data as a NumPy array.

get_array(key, pulses=slice(None, None, None), unstack_pulses=True, *, fill_value=None, subtrain_index='pulseId', roi=(), astype=None)

Get a labelled array of detector data

Parameters:
  • key (str) – The data to get, e.g. ‘image.data’ for pixel values.

  • pulses (slice, array, by_id or by_index) – Select the pulses to include from each train. by_id selects by pulse ID, by_index by index within the data being read. The default includes all pulses. Only used for per-pulse data.

  • unstack_pulses (bool) – Whether to separate train and pulse dimensions.

  • fill_value (int or float, optional) – Value to use for missing values. If None (default) the fill value is 0 for integers and np.nan for floats.

  • subtrain_index (str) – Specify ‘pulseId’ (default) or ‘cellId’ to label the frames recorded within each train. Pulse ID should allow this data to be matched with other devices, but depends on how the detector was manually configured when the data was taken. Cell ID refers to the memory cell used for that frame in the detector hardware.

  • roi (tuple) – Specify e.g. np.s_[10:60, 100:200] to select pixels within each module when reading data. The selection is applied to each individual module, so it may only be useful when working with a single module. For AGIPD raw data, each module records a frame as a 3D array with 2 entries on the first dimension, for data & gain information, so roi=np.s_[0] will select only the data part of each frame.

  • astype (Type) – data type of the output array. If None (default) the dtype matches the input array dtype

get_dask_array(key, subtrain_index='pulseId', fill_value=None, astype=None)

Get a labelled Dask array of detector data

Dask does lazy, parallelised computing, and can work with large data volumes. This method doesn’t immediately load the data: that only happens once you trigger a computation.

Parameters:
  • key (str) – The data to get, e.g. ‘image.data’ for pixel values.

  • subtrain_index (str, optional) – Specify ‘pulseId’ (default) or ‘cellId’ to label the frames recorded within each train. Pulse ID should allow this data to be matched with other devices, but depends on how the detector was manually configured when the data was taken. Cell ID refers to the memory cell used for that frame in the detector hardware.

  • fill_value (int or float, optional) – Value to use for missing values. If None (default) the fill value is 0 for integers and np.nan for floats.

  • astype (Type, optional) – data type of the output array. If None (default) the dtype matches the input array dtype

trains(pulses=slice(None, None, None), require_all=True)

Iterate over trains for detector data.

Parameters:
  • pulses (slice, array, by_index or by_id) – Select which pulses to include for each train. The default is to include all pulses.

  • require_all (bool) – If True (default), skip trains where any of the selected detector modules are missing data.

Yields:

train_data (dict) – A dictionary mapping key names (e.g. image.data) to labelled arrays.

data_availability(module_gaps=False)

Get an array indicating what image data is available

Returns a boolean array (modules, entries), True where a module has data for a given train, False for missing data.

select_trains(trains)

Select a subset of trains from this data as a new object.

Slice trains by position within this data:

sel = det.select_trains(np.s_[:5])

Or select trains by train ID, with a slice or a list:

from extra_data import by_id
sel1 = det.select_trains(by_id[142844490 : 142844495])
sel2 = det.select_trains(by_id[[142844490, 142844493, 142844494]])
split_trains(parts=None, trains_per_part=None, frames_per_part=None)

Split this data into chunks with a fraction of the trains each.

At least one of parts, trains_per_part or frames_per_part must be specified. You can pass any combination of these.

Parameters:
  • parts (int) – How many parts to split the data into. If trains_per_part is also specified, this is a minimum, and it may make more parts. It may also make fewer if there are fewer trains in the data.

  • trains_per_part (int) – A maximum number of trains in each part. Parts will often have fewer trains than this.

  • frames_per_part (int) – A target number of frames in each part. Each chunk should have up to this many frames, but chunks always contain complete trains, so if this is less than one train, you may get single train chunks with more frames. When frames_per_part is used, the final chunk may be much smaller than the others.

write_frames(filename, trains, pulses)

Write selected detector frames to a new EuXFEL HDF5 file

trains and pulses should be 1D arrays of the same length, containing train IDs and pulse IDs (corresponding to the pulse IDs recorded by the detector). i.e. (trains[i], pulses[i]) identifies one frame.

write_virtual_cxi(filename, fillvalues=None)

Write a virtual CXI file to access the detector data.

The virtual datasets in the file provide a view of the detector data as if it was a single huge array, but without copying the data. Creating and using virtual datasets requires HDF5 1.10.

Parameters:
  • filename (str) – The file to be written. Will be overwritten if it already exists.

  • fillvalues (dict, optional) – keys are datasets names (one of: data, gain, mask) and associated fill value for missing data (default is np.nan for float arrays and zero for integer arrays)

See also

Accessing LPD data: An example using the class above.

class extra_data.components.JUNGFRAU(data: DataCollection, detector_name=None, modules=None, *, min_modules=1, n_modules=None, first_modno=1)

An interface to JUNGFRAU data.

JNGFR, JF1M, JF4M all store data in a “data” group, with trains along the first and memory cells along the second dimension. This allows only a set number of frames to be stored for each train.

Parameters:
  • data (DataCollection) – A data collection, e.g. from RunDirectory().

  • detector_name (str, optional) – Name of a detector, e.g. ‘SPB_IRDA_JNGFR’. This is only needed if the dataset includes more than one JUNGFRAU detector.

  • modules (set of ints, optional) – Detector module numbers to use. By default, all available modules are used.

  • min_modules (int) – Include trains where at least n modules have data. Default is 1.

  • n_modules (int) – Number of detector modules in the experiment setup. Default is None, in which case it will be estimated from the available data.

  • first_modno (int) – The module number in the source name for the first detector module. e.g. FXE_XAD_JF500K/DET/JNGFR03:daqOutput should have first_modno = 3

Selecting a key from the detector, e.g. jf['data.adc'], gives an object similar to a single-source KeyData, but with the modules arranged along the first axis. So jf['data.adc'].ndarray() will load all the selected data as a NumPy array.

get_array(key, *, fill_value=None, roi=(), astype=None)

Get a labelled array of detector data

Parameters:
  • key (str) – The data to get, e.g. ‘data.adc’ for pixel values.

  • fill_value (int or float, optional) – Value to use for missing values. If None (default) the fill value is 0 for integers and np.nan for floats.

  • roi (tuple) – Specify e.g. np.s_[:, 10:60, 100:200] to select data within each module & each train when reading data. The first dimension is pulses, then there are two pixel dimensions. The same selection is applied to data from each module, so selecting pixels may only make sense if you’re using a single module.

  • astype (Type) – data type of the output array. If None (default) the dtype matches the input array dtype

get_dask_array(key, fill_value=None, astype=None)

Get a labelled Dask array of detector data

Dask does lazy, parallelised computing, and can work with large data volumes. This method doesn’t immediately load the data: that only happens once you trigger a computation.

Parameters:
  • key (str) – The data to get, e.g. ‘data.adc’ for pixel values.

  • fill_value (int or float, optional) – Value to use for missing values. If None (default) the fill value is 0 for integers and np.nan for floats.

  • astype (Type) – data type of the output array. If None (default) the dtype matches the input array dtype

trains(require_all=True)

Iterate over trains for detector data.

Parameters:

require_all (bool) – If True (default), skip trains where any of the selected detector modules are missing data.

Yields:

train_data (dict) – A dictionary mapping key names (e.g. ‘data.adc’) to labelled arrays.

data_availability(module_gaps=False)

Get an array indicating what image data is available

Returns a boolean array (modules, entries), True where a module has data for a given train, False for missing data.

select_trains(trains)

Select a subset of trains from this data as a new object.

Slice trains by position within this data:

sel = det.select_trains(np.s_[:5])

Or select trains by train ID, with a slice or a list:

from extra_data import by_id
sel1 = det.select_trains(by_id[142844490 : 142844495])
sel2 = det.select_trains(by_id[[142844490, 142844493, 142844494]])
split_trains(parts=None, trains_per_part=None, frames_per_part=None)

Split this data into chunks with a fraction of the trains each.

At least one of parts, trains_per_part or frames_per_part must be specified. You can pass any combination of these.

Parameters:
  • parts (int) – How many parts to split the data into. If trains_per_part is also specified, this is a minimum, and it may make more parts. It may also make fewer if there are fewer trains in the data.

  • trains_per_part (int) – A maximum number of trains in each part. Parts will often have fewer trains than this.

  • frames_per_part (int) – A target number of frames in each part. Each chunk should have up to this many frames, but chunks always contain complete trains, so if this is less than one train, you may get single train chunks with more frames. When frames_per_part is used, the final chunk may be much smaller than the others.

write_virtual_cxi(filename, fillvalues=None)

Write a virtual CXI file to access the detector data.

The virtual datasets in the file provide a view of the detector data as if it was a single huge array, but without copying the data. Creating and using virtual datasets requires HDF5 1.10.

Parameters:
  • filename (str) – The file to be written. Will be overwritten if it already exists.

  • fillvalues (dict, optional) – keys are datasets names (one of: data, gain, mask) and associated fill value for missing data (default is np.nan for float arrays and zero for integer arrays)

extra_data.components.identify_multimod_detectors(data, detector_name=None, *, single=False, clses=None)

Identify multi-module detectors in the data

Various detectors record data for individual X-ray pulses within trains, and we often want to process whichever detector was used in a run. This tries to identify the detector, so a user doesn’t have to specify it manually.

If single=True, this returns a tuple of (detector_name, access_class), throwing ValueError if there isn’t exactly 1 detector found. If single=False, it returns a set of these tuples.

clses may be a list of acceptable detector classes to check.

If you get data for a train from the main DataCollection interface, there is also another way to combine detector modules from AGIPD, DSSC or LPD:

extra_data.stack_detector_data(train, data, axis=-3, modules=16, fillvalue=None, real_array=True, *, pattern='/DET/(\\d+)CH', starts_at=0)

Stack data from detector modules in a train.

Parameters:
  • train (dict) – Train data.

  • data (str) – The path to the device parameter of the data you want to stack, e.g. ‘image.data’.

  • axis (int) – Array axis on which you wish to stack (default is -3).

  • modules (int) – Number of modules composing a detector (default is 16).

  • fillvalue (number) – Value to use in place of data for missing modules. The default is nan (not a number) for floating-point data, and 0 for integers.

  • real_array (bool) – If True (default), copy the data together into a real numpy array. If False, avoid copying the data and return a limited array-like wrapper around the existing arrays. This is sufficient for assembling images using detector geometry, and allows better performance.

  • pattern (str) – Regex to find the module number in source names. Should contain a group which can be converted to an integer. E.g. r'/DET/JNGFR(\d+)' for one JUNGFRAU naming convention.

  • starts_at (int) – By default, uses module numbers starting at 0 (e.g. 0-15 inclusive). If the numbering is e.g. 1-16 instead, pass starts_at=1. This is not automatic because the first or last module may be missing from the data.

Returns:

combined – Stacked data for requested data path.

Return type:

numpy.array