Aligning data from different sources

Sometimes, instruments recording data miss a train. In particular, different sources may start & finish recording at slightly different times. So two arrays loaded from the same run don’t necessarily line up:

[1]:
import numpy as np
from extra_data import open_run
[2]:
run = open_run(proposal=700000, run=26)
[3]:
intensity_sase3 = run['SA3_XTD10_XGM/XGM/DOOCS:output', 'data.intensityTD']
photflux_sase3 = run['SA3_XTD10_XGM/XGM/DOOCS', 'pulseEnergy.photonFlux']
print(f"# trains measured: {intensity_sase3.shape[0]}, {photflux_sase3.shape[0]}")
# trains measured: 7263, 7264

Even if we get the same number of trains, they may not line up if different instruments miss different trains:

[4]:
intensity_scs = run['SCS_BLU_XGM/XGM/DOOCS:output', 'data.intensityTD']
print(f"# trains measured: {intensity_sase3.shape[0]}, {intensity_scs.shape[0]}")

train_ids_eq = intensity_sase3.train_id_coordinates() == intensity_scs.train_id_coordinates()
print("Train IDs all match:", train_ids_eq.all())
print("Train IDs matching (every 100th train):")
print(train_ids_eq[::100])
# trains measured: 7263, 7263
Train IDs all match: False
Train IDs matching (every 100th train):
[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False  True  True  True  True  True  True  True  True
  True]

We typically want to look at only the trains with data for all the sources we’re using. There are a few ways we can get these.

By selecting sources

Use .select() to select specified sources & keys in the run. The require_all=True option discards trains where any of the selected data is missing.

[5]:
# Select a list of sources & keys
sel = run.select([
    ('SA3_XTD10_XGM/XGM/DOOCS:output', '*'),
    ('SCS_BLU_XGM/XGM/DOOCS:output', '*'),
], require_all=True)

# Or select sources by pattern - this gets any sources with /XGM/ in the name
sel = run.select('*/XGM/*', require_all=True)
sel.info()
# of trains:    7262
Duration:       0:12:06.4
First train ID: 517755296
Last train ID:  517762559

0 detector modules ()

2 instrument sources (excluding detectors):
  - SA3_XTD10_XGM/XGM/DOOCS:output
  - SCS_BLU_XGM/XGM/DOOCS:output

2 control sources:
  - SA3_XTD10_XGM/XGM/DOOCS
  - SCS_BLU_XGM/XGM/DOOCS

[6]:
intensity_sase3 = sel['SA3_XTD10_XGM/XGM/DOOCS:output', 'data.intensityTD']
intensity_scs = sel['SCS_BLU_XGM/XGM/DOOCS:output', 'data.intensityTD']
print(f"# trains measured: {intensity_sase3.shape[0]}, {intensity_scs.shape[0]}")

train_ids_eq = intensity_sase3.train_id_coordinates() == intensity_scs.train_id_coordinates()
print("Train IDs all match:", train_ids_eq.all())
# trains measured: 7262, 7262
Train IDs all match: True

If .select(..., require_all=True) gives you 0 trains, it probably means that one of the sources you have selected didn’t record any data in that run.

By selecting train IDs

We can use all the data for one source, and cut out trains which that specific source missed, with code like this:

[7]:
from extra_data import by_id

# Keep all data from this source:
intensity_sase3 = run['SA3_XTD10_XGM/XGM/DOOCS:output', 'data.intensityTD']

intensity_scs = run['SCS_BLU_XGM/XGM/DOOCS:output', 'data.intensityTD'].select_trains(
    by_id[intensity_sase3.train_id_coordinates()]
)
print(f"# trains measured: {intensity_sase3.shape[0]}, {intensity_scs.shape[0]}")
# trains measured: 7263, 7262

This only excluded trains missing from the first source, so in this case, the first source still has one extra train which the second does not.

Using xarray

The options above exclude trains before loading the data. We can also align data after loading it as xarray labelled arrays:

[8]:
intensity_sase3_arr = run['SA3_XTD10_XGM/XGM/DOOCS:output', 'data.intensityTD'].xarray()
intensity_scs_arr = run['SCS_BLU_XGM/XGM/DOOCS:output', 'data.intensityTD'].xarray()
[9]:
intensity_scs_arr
[9]:
<xarray.DataArray 'SCS_BLU_XGM/XGM/DOOCS:output.data.intensityTD' (trainId: 7263, dim_0: 1000)>
array([[ 4.4886490e+01,  4.2309365e+03, -4.5598242e+03, ...,
         1.0000000e+00,  1.0000000e+00,  1.0000000e+00],
       [ 1.0151898e+02,  2.2400598e+03, -2.7732441e+03, ...,
         1.0000000e+00,  1.0000000e+00,  1.0000000e+00],
       [-1.3794557e+02,  2.4830901e+03, -3.6583892e+03, ...,
         1.0000000e+00,  1.0000000e+00,  1.0000000e+00],
       ...,
       [-4.2194626e+02,  5.4188824e+02, -8.9533582e+02, ...,
         1.0000000e+00,  1.0000000e+00,  1.0000000e+00],
       [-1.3200552e+02,  1.1471447e+03, -1.4556660e+03, ...,
         1.0000000e+00,  1.0000000e+00,  1.0000000e+00],
       [-2.3156431e+01,  2.2287026e+03, -3.3196895e+03, ...,
         1.0000000e+00,  1.0000000e+00,  1.0000000e+00]], dtype=float32)
Coordinates:
  * trainId  (trainId) uint64 517755296 517755297 ... 517762558 517762559
Dimensions without coordinates: dim_0

We’ll use the xarray.align() function to line up the arrays by their train ID labels:

[10]:
import xarray as xr

intensity_sase3_arr, intensity_scs_arr = xr.align(
    intensity_sase3_arr, intensity_scs_arr, join='inner'
)
[11]:
(intensity_scs_arr.coords['trainId']  == intensity_sase3_arr.coords['trainId']).all().item()
[11]:
True

Using join='inner' (which is the default) discards data to align the arrays. If we specified join='outer' instead, it would insert gaps in the arrays where data is missing.

Multi-module detectors

Several detectors at European XFEL have modules recording data as separate sources. This run contains data from a DSSC detector:

[12]:
from extra_data.components import DSSC1M

dssc = DSSC1M(run)
dssc
[12]:
<DSSC1M: Data interface for detector 'SCS_DET_DSSC1M-1' with 16 modules>
[13]:
len(dssc.train_ids)
[13]:
5120

By default, we get trains where any detector module recorded data. We can specify min_modules to get trains where all modules recorded data:

[14]:
dssc_allmod = DSSC1M(run, min_modules=16)

len(dssc_allmod.train_ids)
[14]:
5049

Or we can allow a certain number of missing modules in each train, to keep more of the data:

[15]:
dssc_mostmod = DSSC1M(run, min_modules=15)

len(dssc_mostmod.train_ids)
[15]:
5118

In this case, missing data will be filled in as 0 (for integers) or NaN (for floating point data) when we read the data. You should check that the code you’re using to process the data will behave correctly with the fill value.