Release Notes



  • open_run() can now combine additional data locations besides the main raw & proc folders (PR #298):

    run = open_run(6616, 31, data=['raw', 'scratch/test_cal'])

    This specifies a list of paths under the proposal directory. The folders given should contain run folders with 4 digit run numbers, e.g. r0031. If the same source names appear, those sources will be visible from the last location in the list.

  • Add .pulse_id_coordinates() & .train_id_coordinates() for XTDF image data (PR #506).

  • Add data_availability() method for multi-module detectors (PR #504).

  • New include_empty option to include empty trains when iterating KeyData with trains() (PR #501)

  • Support selecting down DataCollection by SourceData objects (PR #499)

  • Merge attributes of key group and value dataset for CONTROL keys (PR #498)

  • Add warning when select() with require_all discards all trains (PR #497).

  • Miscellaneous improvements to .buffer_shape() method for multi-module detector data (PR #505).

  • Return a copy of the array for detector_key.train_id_coordinates() (PR #502)



  • Fix loading aliases for old proposals (PR #490).

  • Hide the message about proposal aliases when opening a run. (PR #478).

  • extra-data-validate gives clearer messages for filesystem errors (PR #472).

  • Fix OverflowError in lsxfel & with some corrupted train IDs (PR #489).

  • Fix a selection of deprecation warnings (PR #469).

  • Add a development tool to copy the structure of EuXFEL data files without the data (PR #467).



  • JUNGFRAU recognises some additional naming patterns seen in new detector instances (PR #464).





  • New train_id_coordinates method for source data, like the one for key data (PR #431).

  • New attributes .nbytes, .size_mb and .size_gb to conveniently see how much data is present for a given source & key (PR #430).

  • Fix .ndarray(module_gaps=True) for xtdf detector data (PR #432).



  • Support for aliases (PR #367), to provide shorter, more meaningful names for specific sources & keys, and support for loading a default set of aliases for the proposal when using open_run() (PR #398). See Using aliases for more information.

  • New APIs for multi-module detector data to work more like regular sources and keys, e.g. agipd[''].ndarray() (PR #337). These changes also change how Dask arrays are created for multi-module detector data, hopefully making them more efficient for typical use cases.

  • New method plot_missing_data() to show where sources are missing data for some trains (PR #402).

  • Merging data with union() now applies the same train IDs to all included sources, whereas previously sources could have different train IDs selected (PR #416).

  • A new property run[src].device_class exposes the Karabo device class name for control sources (PR #390).

  • JUNGFRAU now accepts a first_modno for detectors where the first module is named with e.g. JNGFR03 (PR #379).

  • run[src].is_control and .is_instrument properties (PR #403).

  • SourceData objects now have .data_counts(), .drop_empty_trains() and .split_trains() methods like KeyData (PR #404, PR #405, PR #407).

  • New method SourceData.one_key() to quickly find an arbitrary key for a source.

  • select() now accepts a require_any=True parameter to filter trains where at least one of the selected sources & keys has data, complementing require_all (PR #400).

  • New property KeyData.source_file_paths to locate real data files even if the run was opened using a virtual overview file (PR #325).

  • New SourceData properties storage_class, data_category and aggregator to extract details from the filename & folder path, for the main folder structure on EuXFEL compute clusters (PR #399).

  • It’s now possible to pip install extra-data[complete] to install EXtra-data along with all optional dependencies (PR #414).

  • Fix for missing CONTROL data when accessing data by train (PR #359).

  • Fix using with to open & close runs when a virtual overview file is found (PR #375).

  • Fix calling open_run() with data='all', parallelize=False (PR #338).

  • Fix using DataCollection objects with multiprocessing and spawned subprocesses (PR #348).

  • Better error messages when files are missing INDEX or METADATA sections (PR #361).

  • Fix creating virtual overview files with extended metadata when source files are format version 1.1 or newer (PR #332).



  • SourceData objects now expose RUN information for control sources via new .run_value() and .run_values() methods, and metadata about the run from a new .run_metadata() method (PR #293).

  • KeyData.ndarray() can now read into a pre-allocated array passed as the out parameter (PR #307)

  • KeyData.xarray() can return an xarray Dataset object to represent data with named fields (PR #301).

  • The JUNGFRAU data access class now recognises ‘JF500K’ in source names (PR #300).

  • Fix sending around FileAccess objects with cloudpickle, which is used by Dask and clusterfutures (PR #303).

  • Fix permissions errors from opening the run files map JSON files (PR #304).

  • Fix errors opening runs with data='all' with an empty proc folder (PR #317).

  • The QuickView class deprecated in version 1.9 was removed.



  • New keep_dims option for trains(), train_from_id() and train_from_index(). Normally the trains/pulses dimension is dropped from the arrays these methods return if it has length 1, but passing keep_dims=True will preserve this dimension (PR #288).

  • New select_trains() and split_trains() methods for multi-module detector data (PR #278).

  • select() now accepts a list of source name patterns, which is more convenient for some use cases (PR #287).

  • Fix open_run(..., data='all') for runs with no proc data (PR #281).

  • Fix single run status when opening a run with a virtual overview file (PR #290).

  • Sources with no data recorded in a run are now represented in virtual overview files (PR #287).

  • Fix a race condition where files were closed in one thread as they were opened in another (PR #289).



  • EXtra-data can now generate and use “virtual overview” files (PR #69). A virtual overview file is a single file containing the metadata and indices of an entire run, and links to the source files for the data (using HDF5 virtual datasets). When virtual overview files are available, open_run() and RunDirectory() will use them automatically; this should make it faster to open and explore runs (but not to read data).

  • You can now specify parallelize=False for open_run() and RunDirectory() to open files in serial (PR #158). This can be necessary if you’re opening runs inside a parallel worker.

  • Fix various features to work when 0 trains of data are selected (PR #260).

  • Fix union() when starting with already-unioned data from different runs (PR #261).

  • Fix for opening runs with data='all' and combining data in certain ways (PR #274).

  • Fixes to ensure that files are not unnecessarily reopened (PR #264).





  • New KeyData.as_single_value() method to check that a key remains constant (within a specified tolerance) through the data, and return it as a single value (PR #228).

  • New KeyData.train_id_coordinates() method to get train IDs associated with specific data as a NumPy array (PR #226).

  • extra-data-validate now checks that timestamps in control data are in increasing order (PR #94).

  • Ensure basic DataCollection functionality, including getting values from RUN and inspecting the shape & dtype of other data, works when no trains are selected (PR #244).

  • Fix reading data where some files in a run contain zero trains, as seen in some of the oldest EuXFEL data (PR #225).

  • Minor performance improvements for select() when selecting single keys (no wildcards) and when selecting all keys along with require_all=True (PR #248).

Deprecations & potentially breaking changes:

  • The QuickView class is deprecated. We believe no-one is using this. If you are, please get in touch with .

  • Removed the h5index module and the hdf5_paths function, which were deprecated in 1.7.



  • Fixed two different bugs introduced in 1.8 affecting loading data for multi-module detectors with get_array() when only some of the modules captured data for a given train (PR #234).

  • Fix open_run(..., data='all') when all sources in the raw data are copied to the corrected run folder (PR #236).



  • New API for inspecting the data associated with a single source (PR #206). Use a source name to get a SourceData object:

    xgm = run['SPB_XTD9_XGM/DOOCS/MAIN']
    xgm.keys()  # List the available keys
    beam_x = xgm['beamPosition.ixPos'].ndarray()

    See Getting data by source & key for more details.

  • Combining data from the same run with union() now preserves ‘single run’ status, so run_metadata() still works (PR #208). This only works with more recent data (file format version 1.0 and above).

  • Reading data for multi-module detectors with get_array() is now faster, especially when selecting a subset of pulses (PR #218, PR #220).

  • Fix data_counts() when data is missing for some selected trains (PR #222).

Deprecations & potentially breaking changes:

  • The numpy_to_cbf and hdf5_to_cbf functions have been removed (PR #213), after they were deprecated in 1.7. If you need to create CBF files, consult the Fabio package.

  • Some packages required for karabo-bridge-serve-files are no longer installed along with EXtra-data by default (PR #211). Install with pip install extra-data[bridge] if you need this functionality.



Deprecations & potentially breaking changes:

  • Remove special behaviour for get_series() with big detector data, deprecated in 1.4 (PR #196).

  • Deprecated some functions for converting data to CBF format, and the h5index module (PR #197). We believe these were unused.



  • Fix a check which made it very slow to open runs with thousands of files (PR #183).







  • Fix get_array() for raw DSSC & LPD data with multiple sequence files per module (PR #155).

  • Drop unnecessary dependency on scipy (PR #147).



New features:

Deprecations & potentially breaking changes:

  • Earlier versions of EXtra-data unintentionally converted integer data from multi-module detectors to floats (in get_array() and get_dask_array()) with the special value NaN for missing data. This version preserves the data type, but missing integer data will be filled with 0. If this is not suitable, you can use the min_modules parameter to get only trains where all modules have data, or pass astype=np.float64, fill_value=np.nan to convert data to floats and fill gaps with NaN as before.

  • Special handling in get_series() to label some fast detector data with pulse IDs was deprecated (PR #131). We believe no-one is using this. If you are, please contact to discuss alternatives.

Fixes and improvements

  • Prevent select() from rediscovering things that had previously been excluded from the selection (PR #128).

  • Fix default fill value for uint64 data in stack_detector_data() (PR #103).

  • Don’t convert integer data to floats in get_array() and get_dask_array() methods for multi-module detector data (PR #98).

  • Documented the KeyData interface added in 1.3 (PR #96)

  • Fix extra-data-validate when a file cannot be opened (PR #93).

  • Fix name of extra-data-validate in its own help info (PR #90).



New features:

  • A new interface for data from a single source & key: use run[source, key] to get a KeyData object, which can inspect and load the data from several sequence files (PR #70).

  • Methods which took a by_index object now accept slices (e.g. numpy.s_[:10]) or indices directly (PR #68, PR #79). This includes select_trains(), get_array() and various methods for multi-module detectors, described in Multi-module detector data.

  • extra-data-make-virtual-cxi --fill-value now accepts numbers in hexadecimal, octal & binary formats, e.g. 0xfe (PR #73).

  • Added an unstack parameter to the get_array() method for multi-module detectors, making it possible to retrieve an array as the data is stored, without separating the train & pulse axes (PR #72).

  • Added a require_all parameter to the trains() method for multi-module detectors, to allow iterating with incomplete frames included (PR #77).

  • New identify_multimod_detectors() function to find multi-module detectors in the data (PR #61).

Fixes and improvements:

  • Fix writing selected detector frames with write_frames() for corrected data (PR #82).

  • Fix compatibility with pandas 1.1 (PR #83).

  • The trains() iterator no longer includes zero-length arrays when a source has no data for that train (PR #75).

  • Fix a test which failed when run as root (PR #67).



New features:

Fixes and improvements:

  • EXtra-data now tries to manage how many HDF5 files it has open at one time, to avoid hitting a limit on the total number of open files in a process (PR #25 and PR #48). Importing EXtra-data will now raise this limit as far as it can (to 4096 on Maxwell), and try to keep the files it handles to no more than half of this. Files should be silently closed and reopened as needed, so this shouldn’t affect how you use it.

  • A better way of creating Dask arrays to avoid problems with Dask’s local schedulers, and with arrays comprising very large numbers of files (PR #63).

  • The classes for accessing multi-module detector data (see Multi-module detector data) and writing virtual CXI files no longer assume that the same number of frames are recorded in every train (PR #44).

  • Fix validation where a file has no trains at all (PR #42).

  • More testing of EuXFEL file format version 1.0 (PR #56).

  • Test coverage measurement fixed with multiprocessing (PR #37).

  • Tests switched from mock module to unittest.mock (PR #52).



  • Opening and validating run directories now handles files in parallel, which should make it substantially faster (PR #30).

  • Various data access operations no longer require finding all the keys for a given data source, which saves time in certain situations (PR #24).

  • open_run() now accepts numpy integers for proposal and run numbers, as well as standard Python integers (PR #34).

  • Run map cache files can be saved on the EuXFEL online cluster, which speeds up reopening runs there (PR #36).

  • Added tests with simulated bad files for the validation code (PR #23).



  • New get_dask_array() method for accessing detector data with Dask (PR #18).

  • Fix extra-data-validate with a run directory without a cached data map (PR #12).

  • Add .squeeze() method for virtual stacks of detector data from stack_detector_data() (PR #16).

  • Close each file after reading its metadata, to avoid hitting the limit of open files when opening a large run (PR #8). This is a mitigation: you will still hit the limit if you access data from enough files. The default limit on Maxwell is 1024 files, but you can raise this to 4096 using the Python resource module.

  • Display progress information while validating a run directory (PR #19).

  • Display run duration to only one decimal place (PR #5).

  • Documentation reorganised to emphasise tutorials and examples (PR #10).

This version requires Python 3.6 or above.



First separated version. No functional changes from karabo_data 0.7.

Earlier history

The code in EXtra-data was previously released as karabo_data, up to version 0.7. See the karabo_data release notes for changes before the renaming.