Release Notes
1.16
2024-02-26
Fix loading aliases for old proposals (PR #490).
Hide the message about proposal aliases when opening a run. (PR #478).
extra-data-validate
gives clearer messages for filesystem errors (PR #472).Fix OverflowError in lsxfel & run.info() with some corrupted train IDs (PR #489).
Fix a selection of deprecation warnings (PR #469).
Add a development tool to copy the structure of EuXFEL data files without the data (PR #467).
1.15.1
2023-11-17
1.15
2023-11-06
New properties
units
andunits_name
onKeyData
objects to retrieve units metadata written by Karabo (PR #449).New command karabo-bridge-serve-run to more conveniently stream data from a saved run in Karabo Bridge format (PR #458).
Fix
split_trains()
being very slow when splitting a long run into many pieces (PR #459).Include XTDF sources in lsxfel when details are enabled (PR #440).
1.14
2023-07-27
1.13
2023-06-15
Support for aliases (PR #367), to provide shorter, more meaningful names for specific sources & keys, and support for loading a default set of aliases for the proposal when using
open_run()
(PR #398). See Using aliases for more information.New APIs for multi-module detector data to work more like regular sources and keys, e.g.
agipd['image.data'].ndarray()
(PR #337). These changes also change how Dask arrays are created for multi-module detector data, hopefully making them more efficient for typical use cases.New method
plot_missing_data()
to show where sources are missing data for some trains (PR #402).Merging data with
union()
now applies the same train IDs to all included sources, whereas previously sources could have different train IDs selected (PR #416).A new property
run[src].device_class
exposes the Karabo device class name for control sources (PR #390).JUNGFRAU
now accepts afirst_modno
for detectors where the first module is named with e.g.JNGFR03
(PR #379).run[src].is_control
and.is_instrument
properties (PR #403).SourceData
objects now have.data_counts()
,.drop_empty_trains()
and.split_trains()
methods likeKeyData
(PR #404, PR #405, PR #407).New method
SourceData.one_key()
to quickly find an arbitrary key for a source.select()
now accepts arequire_any=True
parameter to filter trains where at least one of the selected sources & keys has data, complementingrequire_all
(PR #400).New property
KeyData.source_file_paths
to locate real data files even if the run was opened using a virtual overview file (PR #325).New
SourceData
propertiesstorage_class
,data_category
andaggregator
to extract details from the filename & folder path, for the main folder structure on EuXFEL compute clusters (PR #399).It’s now possible to
pip install extra-data[complete]
to install EXtra-data along with all optional dependencies (PR #414).Fix for missing CONTROL data when accessing data by train (PR #359).
Fix using
with
to open & close runs when a virtual overview file is found (PR #375).Fix calling
open_run()
withdata='all', parallelize=False
(PR #338).Fix using
DataCollection
objects with multiprocessing and spawned subprocesses (PR #348).Better error messages when files are missing
INDEX
orMETADATA
sections (PR #361).Fix creating virtual overview files with extended metadata when source files are format version 1.1 or newer (PR #332).
1.12
2022-06-10
SourceData
objects now expose RUN information for control sources via new.run_value()
and.run_values()
methods, and metadata about the run from a new.run_metadata()
method (PR #293).KeyData.ndarray()
can now read into a pre-allocated array passed as theout
parameter (PR #307)KeyData.xarray()
can return an xarray Dataset object to represent data with named fields (PR #301).The
JUNGFRAU
data access class now recognises ‘JF500K’ in source names (PR #300).Fix sending around FileAccess objects with cloudpickle, which is used by Dask and clusterfutures (PR #303).
Fix permissions errors from opening the run files map JSON files (PR #304).
Fix errors opening runs with
data='all'
with an empty proc folder (PR #317).The
QuickView
class deprecated in version 1.9 was removed.
1.11
2022-03-21
New
keep_dims
option fortrains()
,train_from_id()
andtrain_from_index()
. Normally the trains/pulses dimension is dropped from the arrays these methods return if it has length 1, but passingkeep_dims=True
will preserve this dimension (PR #288).New
select_trains()
andsplit_trains()
methods for multi-module detector data (PR #278).select()
now accepts a list of source name patterns, which is more convenient for some use cases (PR #287).Fix
open_run(..., data='all')
for runs with no proc data (PR #281).Fix single run status when opening a run with a virtual overview file (PR #290).
Sources with no data recorded in a run are now represented in virtual overview files (PR #287).
Fix a race condition where files were closed in one thread as they were opened in another (PR #289).
1.10
2022-02-01
EXtra-data can now generate and use “virtual overview” files (PR #69). A virtual overview file is a single file containing the metadata and indices of an entire run, and links to the source files for the data (using HDF5 virtual datasets). When virtual overview files are available,
open_run()
andRunDirectory()
will use them automatically; this should make it faster to open and explore runs (but not to read data).You can now specify
parallelize=False
foropen_run()
andRunDirectory()
to open files in serial (PR #158). This can be necessary if you’re opening runs inside a parallel worker.Fix various features to work when 0 trains of data are selected (PR #260).
Fix
union()
when starting with already-unioned data from different runs (PR #261).Fix for opening runs with
data='all'
and combining data in certain ways (PR #274).Fixes to ensure that files are not unnecessarily reopened (PR #264).
1.9.1
2021-11-30
Fix errors from
data_counts()
anddrop_empty_trains()
when different train IDs exist for different sources (PR #257).
1.9
2021-11-25
New
KeyData.as_single_value()
method to check that a key remains constant (within a specified tolerance) through the data, and return it as a single value (PR #228).New
KeyData.train_id_coordinates()
method to get train IDs associated with specific data as a NumPy array (PR #226).extra-data-validate now checks that timestamps in control data are in increasing order (PR #94).
Ensure basic
DataCollection
functionality, including getting values fromRUN
and inspecting the shape & dtype of other data, works when no trains are selected (PR #244).Fix reading data where some files in a run contain zero trains, as seen in some of the oldest EuXFEL data (PR #225).
Minor performance improvements for
select()
when selecting single keys (no wildcards) and when selecting all keys along withrequire_all=True
(PR #248).
Deprecations & potentially breaking changes:
The
QuickView
class is deprecated. We believe no-one is using this. If you are, please get in touch with da-support@xfel.eu .Removed the
h5index
module and thehdf5_paths
function, which were deprecated in 1.7.
1.8.1
2021-11-01
Fixed two different bugs introduced in 1.8 affecting loading data for multi-module detectors with
get_array()
when only some of the modules captured data for a given train (PR #234).Fix
open_run(..., data='all')
when all sources in the raw data are copied to the corrected run folder (PR #236).
1.8
2021-10-06
New API for inspecting the data associated with a single source (PR #206). Use a source name to get a
SourceData
object:xgm = run['SPB_XTD9_XGM/DOOCS/MAIN'] xgm.keys() # List the available keys beam_x = xgm['beamPosition.ixPos'].ndarray()
See Getting data by source & key for more details.
Combining data from the same run with
union()
now preserves ‘single run’ status, sorun_metadata()
still works (PR #208). This only works with more recent data (file format version 1.0 and above).Reading data for multi-module detectors with
get_array()
is now faster, especially when selecting a subset of pulses (PR #218, PR #220).Fix
data_counts()
when data is missing for some selected trains (PR #222).
Deprecations & potentially breaking changes:
The
numpy_to_cbf
andhdf5_to_cbf
functions have been removed (PR #213), after they were deprecated in 1.7. If you need to create CBF files, consult the Fabio package.Some packages required for karabo-bridge-serve-files are no longer installed along with EXtra-data by default (PR #211). Install with
pip install extra-data[bridge]
if you need this functionality.
1.7
2021-08-03
New methods to split data into chunks with a similar number of trains in each:
DataCollection.split_trains()
andKeyData.split_trains()
(PR #184).New method
KeyData.drop_empty_trains()
to select only trains with data for a given key (PR #193).Virtual CXI files can now be made for multi-module JUNGFRAU detectors (PR #62).
extra-data-validate
now checks INDEX for control sources as well as instrument sources (PR #188).Fix opening some files written by a test version of the DAQ, marked with format version 1.1 (PR #198).
Fix making virtual CXI files with h5py 3.3 (PR #195).
Deprecations & potentially breaking changes:
Remove special behaviour for
get_series()
with big detector data, deprecated in 1.4 (PR #196).Deprecated some functions for converting data to CBF format, and the
h5index
module (PR #197). We believe these were unused.
1.6.1
2021-05-14
Fix a check which made it very slow to open runs with thousands of files (PR #183).
1.6
2021-05-11
‘Suspect’ train IDs are now included by default (PR #178). Pass
inc_suspect_trains=False
to exclude them (as in 1.5), or the--exc-suspect-trains
option for extra-data-make-virtual-cxi.open_run()
can now combine raw & proc data when called withdata='all'
(PR #174).Several new methods for accessing different kinds of metadata:
DataCollection.run_metadata()
- per-run metadata including timestamps and proposal number (PR #175)DataCollection.get_run_value()
andDataCollection.get_run_values()
- per-run data from the control system (PR #164)
Selecting pulses should work for
LPD1M.get_array()
in parallel gain mode (PR #173)Several fixes for handling ‘suspect’ train IDs (PR #172).
h5py >= 2.10 is now required (PR #177).
1.5
2021-04-22
Exclude ‘Suspect’ train IDs, fixing occasional issues in particular with AGIPD data containing bad train IDs (PR #121).
Avoid converting train IDs to floats when using
run.select(..., require_all=True)
(PR #159).New method
train_timestamps()
to get approximate timestamps for each train in the data (PR #165)Checking whether a given source & key is present is now much faster in some cases (PR #170).
karabo-bridge-serve-files can now send data on any ZMQ endpoint, not only
tcp://
sockets (PR #169).Ensure virtual CXI files created with EXtra-data can be read using HDF5 1.10 (PR #171).
Some fixes to make the test suite more robust (PR #156, PR #167, PR #169).
1.4.1
2021-03-10
Fix
get_array()
for raw DSSC & LPD data with multiple sequence files per module (PR #155).Drop unnecessary dependency on scipy (PR #147).
1.4
2021-02-12
New features:
select()
has a new optionrequire_all=True
to include only trains where all the selected sources & keys have data (PR #113).select()
now acceptsDataCollection
andKeyData
objects, making it easy to re-select the same sources in another run (PR #114).New classes for accessing data from
AGIPD500K
andJUNGFRAU
multi-module detectors (PR #139, PR #140).New options for
stack_detector_data()
to allow it to work with different data formats, including JUNGFRAU detectors (PR #141).New option for
LPD1M
to read data taken in ‘parallel gain’ mode, giving it useful axis labels (PR #122).get_array()
for multi-module detectors has a new option to label frames with memory cell IDs instead of pulse IDs (PR #101).DataCollection.trains()
can now optionally yield flat, single level dictionaries with(source, key)
keys instead of nested dictionaries (PR #112).New method
KeyData.data_counts()
(PR #92).Labelled arrays from
KeyData.xarray()
andDataCollection.get_array()
now have a name made from the source & key names, or as specified by thename=
parameter (PR #87).
Deprecations & potentially breaking changes:
Earlier versions of EXtra-data unintentionally converted integer data from multi-module detectors to floats (in
get_array()
andget_dask_array()
) with the special value NaN for missing data. This version preserves the data type, but missing integer data will be filled with 0. If this is not suitable, you can use themin_modules
parameter to get only trains where all modules have data, or passastype=np.float64, fill_value=np.nan
to convert data to floats and fill gaps with NaN as before.Special handling in
get_series()
to label some fast detector data with pulse IDs was deprecated (PR #131). We believe no-one is using this. If you are, please contact da-support@xfel.eu to discuss alternatives.
Fixes and improvements
Prevent
select()
from rediscovering things that had previously been excluded from the selection (PR #128).Fix default fill value for uint64 data in
stack_detector_data()
(PR #103).Don’t convert integer data to floats in
get_array()
andget_dask_array()
methods for multi-module detector data (PR #98).Fix
extra-data-validate
when a file cannot be opened (PR #93).Fix name of
extra-data-validate
in its own help info (PR #90).
1.3
2020-08-03
New features:
A new interface for data from a single source & key: use
run[source, key]
to get aKeyData
object, which can inspect and load the data from several sequence files (PR #70).Methods which took a
by_index
object now accept slices (e.g.numpy.s_[:10]
) or indices directly (PR #68, PR #79). This includesselect_trains()
,get_array()
and various methods for multi-module detectors, described in Multi-module detector data.extra-data-make-virtual-cxi
--fill-value
now accepts numbers in hexadecimal, octal & binary formats, e.g.0xfe
(PR #73).Added an
unstack
parameter to theget_array()
method for multi-module detectors, making it possible to retrieve an array as the data is stored, without separating the train & pulse axes (PR #72).Added a
require_all
parameter to thetrains()
method for multi-module detectors, to allow iterating with incomplete frames included (PR #77).New
identify_multimod_detectors()
function to find multi-module detectors in the data (PR #61).
Fixes and improvements:
1.2
2020-06-04
New features:
New
karabo-bridge-serve-files --append-detector-modules
option to combine data from multiple detector modules. This makes streaming large detector data more similar to the live data streams (PR #40 and PR #51).karabo-bridge-serve-files has new options to control the ZMQ socket and the use of an infiniband network interface (PR #50). It also works with newer versions of the
karabo_bridge
Python package.New options to filter files from dCache which are unavailable or need to be read from tape when opening a run (PR #35). This also comes with a new command extra-data-locality to inspect this information.
New
lsxfel --detail
option to show more detail on selected sources (PR #38).New
extra-data-make-virtual-cxi --fill-value
option to control the fill value for missing data (PR #59)New method
write_frames()
to save a subset of detector frames to a new file in EuXFEL HDF5 format (PR #47).DataCollection.select()
can take arbitrary iterables of patterns, rather than just lists (PR #43).
Fixes and improvements:
EXtra-data now tries to manage how many HDF5 files it has open at one time, to avoid hitting a limit on the total number of open files in a process (PR #25 and PR #48). Importing EXtra-data will now raise this limit as far as it can (to 4096 on Maxwell), and try to keep the files it handles to no more than half of this. Files should be silently closed and reopened as needed, so this shouldn’t affect how you use it.
A better way of creating Dask arrays to avoid problems with Dask’s local schedulers, and with arrays comprising very large numbers of files (PR #63).
The classes for accessing multi-module detector data (see Multi-module detector data) and writing virtual CXI files no longer assume that the same number of frames are recorded in every train (PR #44).
Fix validation where a file has no trains at all (PR #42).
More testing of EuXFEL file format version 1.0 (PR #56).
Test coverage measurement fixed with multiprocessing (PR #37).
Tests switched from
mock
module tounittest.mock
(PR #52).
1.1
2020-03-06
Opening and validating run directories now handles files in parallel, which should make it substantially faster (PR #30).
Various data access operations no longer require finding all the keys for a given data source, which saves time in certain situations (PR #24).
open_run()
now accepts numpy integers for proposal and run numbers, as well as standard Python integers (PR #34).Run map cache files can be saved on the EuXFEL online cluster, which speeds up reopening runs there (PR #36).
Added tests with simulated bad files for the validation code (PR #23).
1.0
2020-02-21
New
get_dask_array()
method for accessing detector data with Dask (PR #18).Fix
extra-data-validate
with a run directory without a cached data map (PR #12).Add
.squeeze()
method for virtual stacks of detector data fromstack_detector_data()
(PR #16).Close each file after reading its metadata, to avoid hitting the limit of open files when opening a large run (PR #8). This is a mitigation: you will still hit the limit if you access data from enough files. The default limit on Maxwell is 1024 files, but you can raise this to 4096 using the Python resource module.
Display progress information while validating a run directory (PR #19).
Display run duration to only one decimal place (PR #5).
Documentation reorganised to emphasise tutorials and examples (PR #10).
This version requires Python 3.6 or above.
0.8
2019-11-18
First separated version. No functional changes from karabo_data 0.7.
Earlier history
The code in EXtra-data was previously released as karabo_data, up to version 0.7. See the karabo_data release notes for changes before the renaming.