Reading data train by train

If the data you want to work with is too big to load into memory all at once, one simple alternative is to process data from one train at a time.

Other options such as using Dask may run faster, or make it easier to do certain kinds of processing. But code that iterates through the trains is probably easier to understand.

[1]:
from extra_data import open_run

run = open_run(proposal=700000, run=2)
run.info()  # Show overview info about this data
# of trains:    3392
Duration:       0:05:39.2
First train ID: 79726751
Last train ID:  79730142

16 detector modules (SPB_DET_AGIPD1M-1)
  e.g. module SPB_DET_AGIPD1M-1 0 : 512 x 128 pixels
  SPB_DET_AGIPD1M-1/DET/0CH0:xtdf
  64 frames per train, up to 217088 frames total

3 instrument sources (excluding detectors):
  - SA1_XTD2_XGM/XGM/DOOCS:output
  - SPB_IRU_SIDEMIC_CAM:daqOutput
  - SPB_XTD9_XGM/XGM/DOOCS:output

13 control sources:
  - ACC_SYS_DOOCS/CTRL/BEAMCONDITIONS
  - SA1_XTD2_XGM/XGM/DOOCS
  - SPB_IRU_AGIPD1M/PSC/HV
  - SPB_IRU_AGIPD1M/TSENS/H1_T_EXTHOUS
  - SPB_IRU_AGIPD1M/TSENS/H2_T_EXTHOUS
  - SPB_IRU_AGIPD1M/TSENS/Q1_T_BLOCK
  - SPB_IRU_AGIPD1M/TSENS/Q2_T_BLOCK
  - SPB_IRU_AGIPD1M/TSENS/Q3_T_BLOCK
  - SPB_IRU_AGIPD1M/TSENS/Q4_T_BLOCK
  - SPB_IRU_AGIPD1M1/CTRL/MC1
  - SPB_IRU_AGIPD1M1/CTRL/MC2
  - SPB_IRU_VAC/GAUGE/GAUGE_FR_6
  - SPB_XTD9_XGM/XGM/DOOCS

To iterate through the trains in this run, we need the .trains() method.

But first, it’s always a good idea to select the sources and keys we want, so we don’t waste time loading irrelevant data. Let’s select the image data from all AGIPD modules:

[2]:
sel = run.select('SPB_DET_AGIPD1M-1/DET/*CH0:xtdf', 'image.data')
sel.all_sources
[2]:
frozenset({'SPB_DET_AGIPD1M-1/DET/0CH0:xtdf',
           'SPB_DET_AGIPD1M-1/DET/10CH0:xtdf',
           'SPB_DET_AGIPD1M-1/DET/11CH0:xtdf',
           'SPB_DET_AGIPD1M-1/DET/12CH0:xtdf',
           'SPB_DET_AGIPD1M-1/DET/13CH0:xtdf',
           'SPB_DET_AGIPD1M-1/DET/14CH0:xtdf',
           'SPB_DET_AGIPD1M-1/DET/15CH0:xtdf',
           'SPB_DET_AGIPD1M-1/DET/1CH0:xtdf',
           'SPB_DET_AGIPD1M-1/DET/2CH0:xtdf',
           'SPB_DET_AGIPD1M-1/DET/3CH0:xtdf',
           'SPB_DET_AGIPD1M-1/DET/4CH0:xtdf',
           'SPB_DET_AGIPD1M-1/DET/5CH0:xtdf',
           'SPB_DET_AGIPD1M-1/DET/6CH0:xtdf',
           'SPB_DET_AGIPD1M-1/DET/7CH0:xtdf',
           'SPB_DET_AGIPD1M-1/DET/8CH0:xtdf',
           'SPB_DET_AGIPD1M-1/DET/9CH0:xtdf'})
[3]:
for tid, data in sel.trains():
    print("Processing train", tid)
    print("Detector data module 0 shape:", data['SPB_DET_AGIPD1M-1/DET/0CH0:xtdf']['image.data'].shape)

    break  # Stop after the first train to keep the demo quick
Processing train 79726751
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-3-630f9647c3c0> in <module>
      1 for tid, data in sel.trains():
      2     print("Processing train", tid)
----> 3     print("Detector data module 0 shape:", data['SPB_DET_AGIPD1M-1/DET/0CH0:xtdf']['image.data'].shape)
      4
      5     break  # Stop after the first train to keep the demo quick

KeyError: 'image.data'

Oops, we’re missing data for this detector module. We can use the require_all=True parameter to skip over trains where some modules are missing data.

[4]:
for tid, data in sel.trains(require_all=True):
    print("Processing train", tid)
    print("Detector data module 0 shape:", data['SPB_DET_AGIPD1M-1/DET/0CH0:xtdf']['image.data'].shape)

    break  # Stop after the first train to keep the demo quick
Processing train 79726787
Detector data module 0 shape: (64, 2, 512, 128)

The data for each train is organised in nested dictionaries: data[source][key]. As this is often used with multi-module detectors like AGIPD, the stack_detector_data function is a convenient way to combine data from multiple similar modules.

[5]:
from extra_data import stack_detector_data

for tid, data in sel.trains(require_all=True):
    print("Detctor data module 0 shape:", data['SPB_DET_AGIPD1M-1/DET/0CH0:xtdf']['image.data'].shape)
    stacked = stack_detector_data(data, 'image.data')
    print("Stacked data shape:", stacked.shape)

    break  # Stop after the first train to keep the demo quick
Detctor data module 0 shape: (64, 2, 512, 128)
Stacked data shape: (64, 2, 16, 512, 128)

There are also methods which can get one train in the same format, from either a train ID or an index within this data:

[6]:
tid, data = sel.train_from_id(79726787)
tid, data = sel.train_from_index(36)