Release Notes

1.24

New options for lsxfel and info() to show data counts for instrument sources, separate from key names (PR #727).
New lsxfel --source option to select sources (PR #727).
lsxfel and info() will now group together similar source names, e.g. numbered motors, for a more concise display (PR #715). This can be disabled if necessary.
New method KeyData.train_index_bounds() to get the start & end location of each train in an array of data (PR #704).
New virtual overview files will now contain a copy of the RUN section from runs, rather than external links, making it faster to scan sources by device class (PR #723).
Fix getting units in some specific cases, and don’t fail in xarray() if there’s an error getting units (PR #696).
Fix data_counts() in with out-of-order train IDs and labelled=False (PR #697).
Opening many files in a folder in parallel no longer uses forked processes, avoiding some rare but tricky bugs (PR #702).
Detector component classes now warn if they fall back to raw data without the caller specifying raw=True. This will likely become an error in the future.
The identification of raw & corrected data in detector component classes was made more robust (PR #713).

1.23.2

2025-12-09

Fix a bug with the new per-instrument default aliases files, which are not currently accessible for external users (PR #694).

1.23.1

2025-12-08

Internal fix for testing mechanisms.

1.23

2025-12-04

Aliases can now be set in a per-instrument default file at /gpfs/exfel/sw/<instrument>/extra-data-aliases-default.yml (PR #682). This file is copied to proposal folders the first time a run is opened, so changing the defaults only affects new proposals.
Coordinate labels can now be specified for KeyData.xarray() with the extra_coords parameter (PR #672). Also, if dimensions are named using extra_dims, default integer coordinate labels will be added to any dimensions not given in extra_coords.
Support for reading the ERRATA & REDUCTION sections of EXDF files through run.auxiliary (PR #651).
New command extra-data-readable for checking if HDF5 data files can be read without hanging (PR #675).
Fix finding an arbitrary key for control sources with keys selected (PR #648).
Fix getting information from filenames when using a virtual overview file (PR #652).
Speed up getting information from filenames when not using a virtual overview file (PR #688).

1.22

2025-07-16

Train selection now accepts boolean arrays, either Numpy arrays of suitable length or xarray arrays with a trainId coordinate (PR #612).
The run.alias repr now includes links to edit any alias files in use, via max-jhub (PR #640).
The lsxfel --detail option will do a substring match if the argument doesn’t look like a glob pattern (containing *?[) (PR #628).
New lsxfel --aggregators option to show the data aggregator saving each source (PR #625).
Reading compressed detector data in parallel (added in 1.21) can now work around 15% faster (PR #633)
Fix tab completion on aliases (PR #626).
Fix as_single_value() with string data (PR #623).
Fix handling of CONTROL sources with only RUN keys (PR #622).
Fix streaming karabo-bridge data from files including control data with arrays of strings (PR #616).
open_run() accepts a path-like object for proposal= (PR #641).
Make zlib_into an optional dependency rather than absolutely required (PR #621).
Fix a rare race condition when creating a virtual overview file, which could allow something to read a partially written file (PR #624).
Support for custom error messages in SourceNameError (PR #619).

Breaking changes

String data stored from control sources is now returned as str objects rather than bytes from various methods (PR #623).

1.21

2025-03-24

Detector data classes such as AGIPD1M can now decompress data in parallel, providing a significant speedup for reading compressed data (PR #593). This is used by default with 16 threads on suitable data, and can be controlled by passing decompress_threads=N to .ndarray() or .xarray() methods, or setting the EXTRA_NUM_THREADS environment variable. Specify 1 thread to use HDF5’s single-threaded decompression.
The .pulse_id_coordinates() and .cell_id_coordinates() methods on AGIPD, DSSC & LPD data objects now respect pulse selections (PR #604).
Fix running the extra-data-validate command with no --skip parameter (PR #606).

1.20

2025-02-26

Loading data as an xarray object will now include the units symbol as a attribute called units (PR #592).
Some improvements to virtual overview files when one sequence file is missing data (PR #600) or when no data was recorded for a particular source in an entire run (PR #601, PR #602).
EXtra-data now requires Python 3.10 or above (PR #294).

1.19

2025-01-24

open_run() now combines raw & corrected data by default, preferring raw for source names found in both (PR #569). This means corrected detector data is visible by default in recent runs.
Detector data classes can now select corrected or raw data with the parameter raw=False or True (PR #558). If this is not specified, they will use corrected data if available, and raw if not, in line with the previous behaviour. This also depends on how you open the run.
source_name in run and (source_name, key_name) in run now work (PR #582).
You can now select train IDs in DataCollection and SourceData like run[tids] (PR #559)
Make it easier to select a single train ID using by_id, and fix raising IndexError when selecting a single train index as an integer (PR #558).
You can use the | operator to combine multiple DataCollection or SourceData objects, equivalent to their union() methods (PR #582).
New option run[source].run_values(inc_timestamps=False) to get a dict of run values excluding timestamps (PR #581).
Specific parts of validation can now be skipped with a new extra-data-validate --skip option (PR #522).
Avoid memory errors & improve performance of reading XTDF detector data with a pulse selection (PR #576).
Fix det.masked_data().select_pulses() in XTDF detector components (PR #571)
Fix using file_filter parameter when opening a run (PR #566)
PyYAML is now a full dependency (PR #577).

1.18

2024-09-23

EXtra-data now requires Python 3.9 or above (PR #554).
Aliases are now case-insensitive, and allow - & _ interchangeably, so las-x and Las_X are considered the same (PR #515).
Add concept of ‘legacy’ source names, references to sources which have been renamed (PR #527). This will be used for calibrated detector data.
Add source, key & alias completions for IPython (PR #514).
New .masked_data() method to load detector data with mask (PR #518). See Multi-module detector data.
A new euxfel_local_time option for DataCollection.train_timestamps() to convert timestamps to local (German) time (PR #550).
Return timezone-aware values from train_timestamps() where possible (PR #550).
Allow kd[trains] for multi-module KeyData objects (PR #520).
Add optional index group filter to SourceData.one_key() (PR #526).
Fixed various compatibility issues with Numpy 2.0 (PR #530).
Allow caching file maps from ‘open’ & ‘red’ run folders in the proposal scratch folder (PR #548, PR #549).
When the file map is cached in multiple places, read the newest version (PR #524).
Prevent unwanted iteration over a KeyData object (PR #519).
Fix making virtual CXI files for JUNGFRAU data if the ‘mask’ dataset is not present (PR #511).
Fix the message shown when skipping files because of how they’re stored (PR #525).

1.17

2024-04-10

open_run() can now combine additional data locations besides the main raw & proc folders (PR #298):
```
run = open_run(6616, 31, data=['raw', 'scratch/test_cal'])
```
This specifies a list of paths under the proposal directory. The folders given should contain run folders with 4 digit run numbers, e.g. r0031. If the same source names appear, those sources will be visible from the last location in the list.
Add .pulse_id_coordinates() & .train_id_coordinates() for XTDF image data (PR #506).
Add data_availability() method for multi-module detectors (PR #504).
New include_empty option to include empty trains when iterating KeyData with trains() (PR #501)
Support selecting down DataCollection by SourceData objects (PR #499)
Merge attributes of key group and value dataset for CONTROL keys (PR #498)
Add warning when select() with require_all discards all trains (PR #497).
Miscellaneous improvements to .buffer_shape() method for multi-module detector data (PR #505).
Return a copy of the array for detector_key.train_id_coordinates() (PR #502)

1.16

2024-02-26

Fix loading aliases for old proposals (PR #490).
Hide the message about proposal aliases when opening a run. (PR #478).
extra-data-validate gives clearer messages for filesystem errors (PR #472).
Fix OverflowError in lsxfel & run.info() with some corrupted train IDs (PR #489).
Fix a selection of deprecation warnings (PR #469).
Add a development tool to copy the structure of EuXFEL data files without the data (PR #467).

1.15.1

2023-11-17

JUNGFRAU recognises some additional naming patterns seen in new detector instances (PR #464).

1.15

2023-11-06

New properties units and units_name on KeyData objects to retrieve units metadata written by Karabo (PR #449).
New command karabo-bridge-serve-run to more conveniently stream data from a saved run in Karabo Bridge format (PR #458).
Fix split_trains() being very slow when splitting a long run into many pieces (PR #459).
Include XTDF sources in lsxfel when details are enabled (PR #440).

1.14

2023-07-27

New train_id_coordinates method for source data, like the one for key data (PR #431).
New attributes .nbytes, .size_mb and .size_gb to conveniently see how much data is present for a given source & key (PR #430).
Fix .ndarray(module_gaps=True) for xtdf detector data (PR #432).

1.13

2023-06-15

Support for aliases (PR #367), to provide shorter, more meaningful names for specific sources & keys, and support for loading a default set of aliases for the proposal when using open_run() (PR #398). See Using aliases for more information.
New APIs for multi-module detector data to work more like regular sources and keys, e.g. agipd['image.data'].ndarray() (PR #337). These changes also change how Dask arrays are created for multi-module detector data, hopefully making them more efficient for typical use cases.
New method plot_missing_data() to show where sources are missing data for some trains (PR #402).
Merging data with union() now applies the same train IDs to all included sources, whereas previously sources could have different train IDs selected (PR #416).
A new property run[src].device_class exposes the Karabo device class name for control sources (PR #390).
JUNGFRAU now accepts a first_modno for detectors where the first module is named with e.g. JNGFR03 (PR #379).
run[src].is_control and .is_instrument properties (PR #403).
SourceData objects now have .data_counts(), .drop_empty_trains() and .split_trains() methods like KeyData (PR #404, PR #405, PR #407).
New method SourceData.one_key() to quickly find an arbitrary key for a source.
select() now accepts a require_any=True parameter to filter trains where at least one of the selected sources & keys has data, complementing require_all (PR #400).
New property KeyData.source_file_paths to locate real data files even if the run was opened using a virtual overview file (PR #325).
New SourceData properties storage_class, data_category and aggregator to extract details from the filename & folder path, for the main folder structure on EuXFEL compute clusters (PR #399).
It’s now possible to pip install extra-data[complete] to install EXtra-data along with all optional dependencies (PR #414).
Fix for missing CONTROL data when accessing data by train (PR #359).
Fix using with to open & close runs when a virtual overview file is found (PR #375).
Fix calling open_run() with data='all', parallelize=False (PR #338).
Fix using DataCollection objects with multiprocessing and spawned subprocesses (PR #348).
Better error messages when files are missing INDEX or METADATA sections (PR #361).
Fix creating virtual overview files with extended metadata when source files are format version 1.1 or newer (PR #332).

1.12

2022-06-10

SourceData objects now expose RUN information for control sources via new .run_value() and .run_values() methods, and metadata about the run from a new .run_metadata() method (PR #293).
KeyData.ndarray() can now read into a pre-allocated array passed as the out parameter (PR #307)
KeyData.xarray() can return an xarray Dataset object to represent data with named fields (PR #301).
The JUNGFRAU data access class now recognises ‘JF500K’ in source names (PR #300).
Fix sending around FileAccess objects with cloudpickle, which is used by Dask and clusterfutures (PR #303).
Fix permissions errors from opening the run files map JSON files (PR #304).
Fix errors opening runs with data='all' with an empty proc folder (PR #317).
The QuickView class deprecated in version 1.9 was removed.

1.11

2022-03-21

New keep_dims option for trains(), train_from_id() and train_from_index(). Normally the trains/pulses dimension is dropped from the arrays these methods return if it has length 1, but passing keep_dims=True will preserve this dimension (PR #288).
New select_trains() and split_trains() methods for multi-module detector data (PR #278).
select() now accepts a list of source name patterns, which is more convenient for some use cases (PR #287).
Fix open_run(..., data='all') for runs with no proc data (PR #281).
Fix single run status when opening a run with a virtual overview file (PR #290).
Sources with no data recorded in a run are now represented in virtual overview files (PR #287).
Fix a race condition where files were closed in one thread as they were opened in another (PR #289).

1.10

2022-02-01

EXtra-data can now generate and use “virtual overview” files (PR #69). A virtual overview file is a single file containing the metadata and indices of an entire run, and links to the source files for the data (using HDF5 virtual datasets). When virtual overview files are available, open_run() and RunDirectory() will use them automatically; this should make it faster to open and explore runs (but not to read data).
You can now specify parallelize=False for open_run() and RunDirectory() to open files in serial (PR #158). This can be necessary if you’re opening runs inside a parallel worker.
Fix various features to work when 0 trains of data are selected (PR #260).
Fix union() when starting with already-unioned data from different runs (PR #261).
Fix for opening runs with data='all' and combining data in certain ways (PR #274).
Fixes to ensure that files are not unnecessarily reopened (PR #264).

1.9.1

2021-11-30

Fix errors from data_counts() and drop_empty_trains() when different train IDs exist for different sources (PR #257).

1.9

2021-11-25

New KeyData.as_single_value() method to check that a key remains constant (within a specified tolerance) through the data, and return it as a single value (PR #228).
New KeyData.train_id_coordinates() method to get train IDs associated with specific data as a NumPy array (PR #226).
extra-data-validate now checks that timestamps in control data are in increasing order (PR #94).
Ensure basic DataCollection functionality, including getting values from RUN and inspecting the shape & dtype of other data, works when no trains are selected (PR #244).
Fix reading data where some files in a run contain zero trains, as seen in some of the oldest EuXFEL data (PR #225).
Minor performance improvements for select() when selecting single keys (no wildcards) and when selecting all keys along with require_all=True (PR #248).

Deprecations & potentially breaking changes:

The QuickView class is deprecated. We believe no-one is using this. If you are, please get in touch with da-support@xfel.eu .
Removed the h5index module and the hdf5_paths function, which were deprecated in 1.7.

1.8.1

2021-11-01

Fixed two different bugs introduced in 1.8 affecting loading data for multi-module detectors with get_array() when only some of the modules captured data for a given train (PR #234).
Fix open_run(..., data='all') when all sources in the raw data are copied to the corrected run folder (PR #236).

1.8

2021-10-06

New API for inspecting the data associated with a single source (PR #206). Use a source name to get a SourceData object:
```
xgm = run['SPB_XTD9_XGM/DOOCS/MAIN']
xgm.keys()  # List the available keys
beam_x = xgm['beamPosition.ixPos'].ndarray()
```
See Getting data by source & key for more details.
Combining data from the same run with union() now preserves ‘single run’ status, so run_metadata() still works (PR #208). This only works with more recent data (file format version 1.0 and above).
Reading data for multi-module detectors with get_array() is now faster, especially when selecting a subset of pulses (PR #218, PR #220).
Fix data_counts() when data is missing for some selected trains (PR #222).

Deprecations & potentially breaking changes:

The numpy_to_cbf and hdf5_to_cbf functions have been removed (PR #213), after they were deprecated in 1.7. If you need to create CBF files, consult the Fabio package.
Some packages required for karabo-bridge-serve-files are no longer installed along with EXtra-data by default (PR #211). Install with pip install extra-data[bridge] if you need this functionality.

1.7

2021-08-03

New methods to split data into chunks with a similar number of trains in each: DataCollection.split_trains() and KeyData.split_trains() (PR #184).
New method KeyData.drop_empty_trains() to select only trains with data for a given key (PR #193).
Virtual CXI files can now be made for multi-module JUNGFRAU detectors (PR #62).
extra-data-validate now checks INDEX for control sources as well as instrument sources (PR #188).
Fix opening some files written by a test version of the DAQ, marked with format version 1.1 (PR #198).
Fix making virtual CXI files with h5py 3.3 (PR #195).

Deprecations & potentially breaking changes:

Remove special behaviour for get_series() with big detector data, deprecated in 1.4 (PR #196).
Deprecated some functions for converting data to CBF format, and the h5index module (PR #197). We believe these were unused.

1.6.1

2021-05-14

Fix a check which made it very slow to open runs with thousands of files (PR #183).

1.6

2021-05-11

‘Suspect’ train IDs are now included by default (PR #178). Pass inc_suspect_trains=False to exclude them (as in 1.5), or the --exc-suspect-trains option for extra-data-make-virtual-cxi.
open_run() can now combine raw & proc data when called with data='all' (PR #174).
Several new methods for accessing different kinds of metadata:
- DataCollection.run_metadata() - per-run metadata including timestamps and proposal number (PR #175)
- DataCollection.get_run_value() and DataCollection.get_run_values() - per-run data from the control system (PR #164)
Selecting pulses should work for LPD1M.get_array() in parallel gain mode (PR #173)
Several fixes for handling ‘suspect’ train IDs (PR #172).
h5py >= 2.10 is now required (PR #177).

1.5

2021-04-22

Exclude ‘Suspect’ train IDs, fixing occasional issues in particular with AGIPD data containing bad train IDs (PR #121).
Avoid converting train IDs to floats when using run.select(..., require_all=True) (PR #159).
New method train_timestamps() to get approximate timestamps for each train in the data (PR #165)
Checking whether a given source & key is present is now much faster in some cases (PR #170).
lsxfel can display structured datatypes nicely (PR #160).
karabo-bridge-serve-files can now send data on any ZMQ endpoint, not only tcp:// sockets (PR #169).
Ensure virtual CXI files created with EXtra-data can be read using HDF5 1.10 (PR #171).
Some fixes to make the test suite more robust (PR #156, PR #167, PR #169).

1.4.1

2021-03-10

Fix get_array() for raw DSSC & LPD data with multiple sequence files per module (PR #155).
Drop unnecessary dependency on scipy (PR #147).

1.4

2021-02-12

New features:

select() has a new option require_all=True to include only trains where all the selected sources & keys have data (PR #113).
select() now accepts DataCollection and KeyData objects, making it easy to re-select the same sources in another run (PR #114).
New classes for accessing data from AGIPD500K and JUNGFRAU multi-module detectors (PR #139, PR #140).
New options for stack_detector_data() to allow it to work with different data formats, including JUNGFRAU detectors (PR #141).
New option for LPD1M to read data taken in ‘parallel gain’ mode, giving it useful axis labels (PR #122).
get_array() for multi-module detectors has a new option to label frames with memory cell IDs instead of pulse IDs (PR #101).
DataCollection.trains() can now optionally yield flat, single level dictionaries with (source, key) keys instead of nested dictionaries (PR #112).
New method KeyData.data_counts() (PR #92).
Labelled arrays from KeyData.xarray() and DataCollection.get_array() now have a name made from the source & key names, or as specified by the name= parameter (PR #87).

Deprecations & potentially breaking changes:

Earlier versions of EXtra-data unintentionally converted integer data from multi-module detectors to floats (in get_array() and get_dask_array()) with the special value NaN for missing data. This version preserves the data type, but missing integer data will be filled with 0. If this is not suitable, you can use the min_modules parameter to get only trains where all modules have data, or pass astype=np.float64, fill_value=np.nan to convert data to floats and fill gaps with NaN as before.
Special handling in get_series() to label some fast detector data with pulse IDs was deprecated (PR #131). We believe no-one is using this. If you are, please contact da-support@xfel.eu to discuss alternatives.

Fixes and improvements

Prevent select() from rediscovering things that had previously been excluded from the selection (PR #128).
Fix default fill value for uint64 data in stack_detector_data() (PR #103).
Don’t convert integer data to floats in get_array() and get_dask_array() methods for multi-module detector data (PR #98).
Documented the KeyData interface added in 1.3 (PR #96)
Fix extra-data-validate when a file cannot be opened (PR #93).
Fix name of extra-data-validate in its own help info (PR #90).

1.3

2020-08-03

New features:

A new interface for data from a single source & key: use run[source, key] to get a KeyData object, which can inspect and load the data from several sequence files (PR #70).
Methods which took a by_index object now accept slices (e.g. numpy.s_[:10]) or indices directly (PR #68, PR #79). This includes select_trains(), get_array() and various methods for multi-module detectors, described in Multi-module detector data.
extra-data-make-virtual-cxi --fill-value now accepts numbers in hexadecimal, octal & binary formats, e.g. 0xfe (PR #73).
Added an unstack parameter to the get_array() method for multi-module detectors, making it possible to retrieve an array as the data is stored, without separating the train & pulse axes (PR #72).
Added a require_all parameter to the trains() method for multi-module detectors, to allow iterating with incomplete frames included (PR #77).
New identify_multimod_detectors() function to find multi-module detectors in the data (PR #61).

Fixes and improvements:

Fix writing selected detector frames with write_frames() for corrected data (PR #82).
Fix compatibility with pandas 1.1 (PR #83).
The trains() iterator no longer includes zero-length arrays when a source has no data for that train (PR #75).
Fix a test which failed when run as root (PR #67).

1.2

2020-06-04

New features:

New karabo-bridge-serve-files --append-detector-modules option to combine data from multiple detector modules. This makes streaming large detector data more similar to the live data streams (PR #40 and PR #51).
karabo-bridge-serve-files has new options to control the ZMQ socket and the use of an infiniband network interface (PR #50). It also works with newer versions of the karabo_bridge Python package.
New options to filter files from dCache which are unavailable or need to be read from tape when opening a run (PR #35). This also comes with a new command extra-data-locality to inspect this information.
New lsxfel --detail option to show more detail on selected sources (PR #38).
New extra-data-make-virtual-cxi --fill-value option to control the fill value for missing data (PR #59)
New method write_frames() to save a subset of detector frames to a new file in EuXFEL HDF5 format (PR #47).
DataCollection.select() can take arbitrary iterables of patterns, rather than just lists (PR #43).

Fixes and improvements:

EXtra-data now tries to manage how many HDF5 files it has open at one time, to avoid hitting a limit on the total number of open files in a process (PR #25 and PR #48). Importing EXtra-data will now raise this limit as far as it can (to 4096 on Maxwell), and try to keep the files it handles to no more than half of this. Files should be silently closed and reopened as needed, so this shouldn’t affect how you use it.
A better way of creating Dask arrays to avoid problems with Dask’s local schedulers, and with arrays comprising very large numbers of files (PR #63).
The classes for accessing multi-module detector data (see Multi-module detector data) and writing virtual CXI files no longer assume that the same number of frames are recorded in every train (PR #44).
Fix validation where a file has no trains at all (PR #42).
More testing of EuXFEL file format version 1.0 (PR #56).
Test coverage measurement fixed with multiprocessing (PR #37).
Tests switched from mock module to unittest.mock (PR #52).

1.1

2020-03-06

Opening and validating run directories now handles files in parallel, which should make it substantially faster (PR #30).
Various data access operations no longer require finding all the keys for a given data source, which saves time in certain situations (PR #24).
open_run() now accepts numpy integers for proposal and run numbers, as well as standard Python integers (PR #34).
Run map cache files can be saved on the EuXFEL online cluster, which speeds up reopening runs there (PR #36).
Added tests with simulated bad files for the validation code (PR #23).

1.0

2020-02-21

New get_dask_array() method for accessing detector data with Dask (PR #18).
Fix extra-data-validate with a run directory without a cached data map (PR #12).
Add .squeeze() method for virtual stacks of detector data from stack_detector_data() (PR #16).
Close each file after reading its metadata, to avoid hitting the limit of open files when opening a large run (PR #8). This is a mitigation: you will still hit the limit if you access data from enough files. The default limit on Maxwell is 1024 files, but you can raise this to 4096 using the Python resource module.
Display progress information while validating a run directory (PR #19).
Display run duration to only one decimal place (PR #5).
Documentation reorganised to emphasise tutorials and examples (PR #10).

This version requires Python 3.6 or above.

0.8

2019-11-18

First separated version. No functional changes from karabo_data 0.7.

Earlier history

The code in EXtra-data was previously released as karabo_data, up to version 0.7. See the karabo_data release notes for changes before the renaming.