Release Notes


  • Fix a check which made it very slow to open runs with thousands of files (PR #183).




  • Fix get_array() for raw DSSC & LPD data with multiple sequence files per module (PR #155).

  • Drop unnecessary dependency on scipy (PR #147).


New features:

Deprecations & potentially breaking changes:

  • Earlier versions of EXtra-data unintentionally converted integer data from multi-module detectors to floats (in get_array() and get_dask_array()) with the special value NaN for missing data. This version preserves the data type, but missing integer data will be filled with 0. If this is not suitable, you can use the min_modules parameter to get only trains where all modules have data, or pass astype=np.float64, fill_value=np.nan to convert data to floats and fill gaps with NaN as before.

  • Special handling in get_series() to label some fast detector data with pulse IDs was deprecated (PR #131). We believe no-one is using this. If you are, please contact to discuss alternatives.

Fixes and improvements

  • Prevent select() from rediscovering things that had previously been excluded from the selection (PR #128).

  • Fix default fill value for uint64 data in stack_detector_data() (PR #103).

  • Don’t convert integer data to floats in get_array() and get_dask_array() methods for multi-module detector data (PR #98).

  • Documented the KeyData interface added in 1.3 (PR #96)

  • Fix extra-data-validate when a file cannot be opened (PR #93).

  • Fix name of extra-data-validate in its own help info (PR #90).


New features:

  • A new interface for data from a single source & key: use run[source, key] to get a KeyData object, which can inspect and load the data from several sequence files (PR #70).

  • Methods which took a by_index object now accept slices (e.g. numpy.s_[:10]) or indices directly (PR #68, PR #79). This includes select_trains(), get_array() and various methods for multi-module detectors, described in Multi-module detector data.

  • extra-data-make-virtual-cxi --fill-value now accepts numbers in hexadecimal, octal & binary formats, e.g. 0xfe (PR #73).

  • Added an unstack parameter to the get_array() method for multi-module detectors, making it possible to retrieve an array as the data is stored, without separating the train & pulse axes (PR #72).

  • Added a require_all parameter to the trains() method for multi-module detectors, to allow iterating with incomplete frames included (PR #77).

  • New identify_multimod_detectors() function to find multi-module detectors in the data (PR #61).

Fixes and improvements:

  • Fix writing selected detector frames with write_frames() for corrected data (PR #82).

  • Fix compatibility with pandas 1.1 (PR #83).

  • The trains() iterator no longer includes zero-length arrays when a source has no data for that train (PR #75).

  • Fix a test which failed when run as root (PR #67).


New features:

Fixes and improvements:

  • EXtra-data now tries to manage how many HDF5 files it has open at one time, to avoid hitting a limit on the total number of open files in a process (PR #25 and PR #48). Importing EXtra-data will now raise this limit as far as it can (to 4096 on Maxwell), and try to keep the files it handles to no more than half of this. Files should be silently closed and reopened as needed, so this shouldn’t affect how you use it.

  • A better way of creating Dask arrays to avoid problems with Dask’s local schedulers, and with arrays comprising very large numbers of files (PR #63).

  • The classes for accessing multi-module detector data (see Multi-module detector data) and writing virtual CXI files no longer assume that the same number of frames are recorded in every train (PR #44).

  • Fix validation where a file has no trains at all (PR #42).

  • More testing of EuXFEL file format version 1.0 (PR #56).

  • Test coverage measurement fixed with multiprocessing (PR #37).

  • Tests switched from mock module to unittest.mock (PR #52).


  • Opening and validating run directories now handles files in parallel, which should make it substantially faster (PR #30).

  • Various data access operations no longer require finding all the keys for a given data source, which saves time in certain situations (PR #24).

  • open_run() now accepts numpy integers for proposal and run numbers, as well as standard Python integers (PR #34).

  • Run map cache files can be saved on the EuXFEL online cluster, which speeds up reopening runs there (PR #36).

  • Added tests with simulated bad files for the validation code (PR #23).


  • New get_dask_array() method for accessing detector data with Dask (PR #18).

  • Fix extra-data-validate with a run directory without a cached data map (PR #12).

  • Add .squeeze() method for virtual stacks of detector data from stack_detector_data() (PR #16).

  • Close each file after reading its metadata, to avoid hitting the limit of open files when opening a large run (PR #8). This is a mitigation: you will still hit the limit if you access data from enough files. The default limit on Maxwell is 1024 files, but you can raise this to 4096 using the Python resource module.

  • Display progress information while validating a run directory (PR #19).

  • Display run duration to only one decimal place (PR #5).

  • Documentation reorganised to emphasise tutorials and examples (PR #10).

This version requires Python 3.6 or above.


First separated version. No functional changes from karabo_data 0.7.

Earlier history

The code in EXtra-data was previously released as karabo_data, up to version 0.7. See the karabo_data release notes for changes before the renaming.