Aligning data from different sources
Sometimes, instruments recording data miss a train. In particular, different sources may start & finish recording at slightly different times. So two arrays loaded from the same run don’t necessarily line up:
[1]:
import numpy as np
from extra_data import open_run
[2]:
run = open_run(proposal=700000, run=26)
[3]:
intensity_sase3 = run['SA3_XTD10_XGM/XGM/DOOCS:output', 'data.intensityTD']
photflux_sase3 = run['SA3_XTD10_XGM/XGM/DOOCS', 'pulseEnergy.photonFlux']
print(f"# trains measured: {intensity_sase3.shape[0]}, {photflux_sase3.shape[0]}")
# trains measured: 7263, 7264
Even if we get the same number of trains, they may not line up if different instruments miss different trains:
[4]:
intensity_scs = run['SCS_BLU_XGM/XGM/DOOCS:output', 'data.intensityTD']
print(f"# trains measured: {intensity_sase3.shape[0]}, {intensity_scs.shape[0]}")
train_ids_eq = intensity_sase3.train_id_coordinates() == intensity_scs.train_id_coordinates()
print("Train IDs all match:", train_ids_eq.all())
print("Train IDs matching (every 100th train):")
print(train_ids_eq[::100])
# trains measured: 7263, 7263
Train IDs all match: False
Train IDs matching (every 100th train):
[ True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False True True True True True True True True
True]
We typically want to look at only the trains with data for all the sources we’re using. There are a few ways we can get these.
By selecting sources
Use .select()
to select specified sources & keys in the run. The require_all=True
option discards trains where any of the selected data is missing.
[5]:
# Select a list of sources & keys
sel = run.select([
('SA3_XTD10_XGM/XGM/DOOCS:output', '*'),
('SCS_BLU_XGM/XGM/DOOCS:output', '*'),
], require_all=True)
# Or select sources by pattern - this gets any sources with /XGM/ in the name
sel = run.select('*/XGM/*', require_all=True)
sel.info()
# of trains: 7262
Duration: 0:12:06.4
First train ID: 517755296
Last train ID: 517762559
0 detector modules ()
2 instrument sources (excluding detectors):
- SA3_XTD10_XGM/XGM/DOOCS:output
- SCS_BLU_XGM/XGM/DOOCS:output
2 control sources:
- SA3_XTD10_XGM/XGM/DOOCS
- SCS_BLU_XGM/XGM/DOOCS
[6]:
intensity_sase3 = sel['SA3_XTD10_XGM/XGM/DOOCS:output', 'data.intensityTD']
intensity_scs = sel['SCS_BLU_XGM/XGM/DOOCS:output', 'data.intensityTD']
print(f"# trains measured: {intensity_sase3.shape[0]}, {intensity_scs.shape[0]}")
train_ids_eq = intensity_sase3.train_id_coordinates() == intensity_scs.train_id_coordinates()
print("Train IDs all match:", train_ids_eq.all())
# trains measured: 7262, 7262
Train IDs all match: True
If .select(..., require_all=True)
gives you 0 trains, it probably means that one of the sources you have selected didn’t record any data in that run.
By selecting train IDs
We can use all the data for one source, and cut out trains which that specific source missed, with code like this:
[7]:
from extra_data import by_id
# Keep all data from this source:
intensity_sase3 = run['SA3_XTD10_XGM/XGM/DOOCS:output', 'data.intensityTD']
intensity_scs = run['SCS_BLU_XGM/XGM/DOOCS:output', 'data.intensityTD'].select_trains(
by_id[intensity_sase3.train_id_coordinates()]
)
print(f"# trains measured: {intensity_sase3.shape[0]}, {intensity_scs.shape[0]}")
# trains measured: 7263, 7262
This only excluded trains missing from the first source, so in this case, the first source still has one extra train which the second does not.
Using xarray
The options above exclude trains before loading the data. We can also align data after loading it as xarray labelled arrays:
[8]:
intensity_sase3_arr = run['SA3_XTD10_XGM/XGM/DOOCS:output', 'data.intensityTD'].xarray()
intensity_scs_arr = run['SCS_BLU_XGM/XGM/DOOCS:output', 'data.intensityTD'].xarray()
[9]:
intensity_scs_arr
[9]:
<xarray.DataArray 'SCS_BLU_XGM/XGM/DOOCS:output.data.intensityTD' (trainId: 7263, dim_0: 1000)> array([[ 4.4886490e+01, 4.2309365e+03, -4.5598242e+03, ..., 1.0000000e+00, 1.0000000e+00, 1.0000000e+00], [ 1.0151898e+02, 2.2400598e+03, -2.7732441e+03, ..., 1.0000000e+00, 1.0000000e+00, 1.0000000e+00], [-1.3794557e+02, 2.4830901e+03, -3.6583892e+03, ..., 1.0000000e+00, 1.0000000e+00, 1.0000000e+00], ..., [-4.2194626e+02, 5.4188824e+02, -8.9533582e+02, ..., 1.0000000e+00, 1.0000000e+00, 1.0000000e+00], [-1.3200552e+02, 1.1471447e+03, -1.4556660e+03, ..., 1.0000000e+00, 1.0000000e+00, 1.0000000e+00], [-2.3156431e+01, 2.2287026e+03, -3.3196895e+03, ..., 1.0000000e+00, 1.0000000e+00, 1.0000000e+00]], dtype=float32) Coordinates: * trainId (trainId) uint64 517755296 517755297 ... 517762558 517762559 Dimensions without coordinates: dim_0
- trainId: 7263
- dim_0: 1000
- 44.88649 4230.9365 -4559.824 54.045578 1566.6924 ... 1.0 1.0 1.0 1.0
array([[ 4.4886490e+01, 4.2309365e+03, -4.5598242e+03, ..., 1.0000000e+00, 1.0000000e+00, 1.0000000e+00], [ 1.0151898e+02, 2.2400598e+03, -2.7732441e+03, ..., 1.0000000e+00, 1.0000000e+00, 1.0000000e+00], [-1.3794557e+02, 2.4830901e+03, -3.6583892e+03, ..., 1.0000000e+00, 1.0000000e+00, 1.0000000e+00], ..., [-4.2194626e+02, 5.4188824e+02, -8.9533582e+02, ..., 1.0000000e+00, 1.0000000e+00, 1.0000000e+00], [-1.3200552e+02, 1.1471447e+03, -1.4556660e+03, ..., 1.0000000e+00, 1.0000000e+00, 1.0000000e+00], [-2.3156431e+01, 2.2287026e+03, -3.3196895e+03, ..., 1.0000000e+00, 1.0000000e+00, 1.0000000e+00]], dtype=float32)
- trainId(trainId)uint64517755296 517755297 ... 517762559
array([517755296, 517755297, 517755298, ..., 517762557, 517762558, 517762559], dtype=uint64)
We’ll use the xarray.align() function to line up the arrays by their train ID labels:
[10]:
import xarray as xr
intensity_sase3_arr, intensity_scs_arr = xr.align(
intensity_sase3_arr, intensity_scs_arr, join='inner'
)
[11]:
(intensity_scs_arr.coords['trainId'] == intensity_sase3_arr.coords['trainId']).all().item()
[11]:
True
Using join='inner'
(which is the default) discards data to align the arrays. If we specified join='outer'
instead, it would insert gaps in the arrays where data is missing.
Multi-module detectors
Several detectors at European XFEL have modules recording data as separate sources. This run contains data from a DSSC detector:
[12]:
from extra_data.components import DSSC1M
dssc = DSSC1M(run)
dssc
[12]:
<DSSC1M: Data interface for detector 'SCS_DET_DSSC1M-1' with 16 modules>
[13]:
len(dssc.train_ids)
[13]:
5120
By default, we get trains where any detector module recorded data. We can specify min_modules
to get trains where all modules recorded data:
[14]:
dssc_allmod = DSSC1M(run, min_modules=16)
len(dssc_allmod.train_ids)
[14]:
5049
Or we can allow a certain number of missing modules in each train, to keep more of the data:
[15]:
dssc_mostmod = DSSC1M(run, min_modules=15)
len(dssc_mostmod.train_ids)
[15]:
5118
In this case, missing data will be filled in as 0 (for integers) or NaN (for floating point data) when we read the data. You should check that the code you’re using to process the data will behave correctly with the fill value.