Reading data to analyse in memory

It’s often quickest and easiest to load data into memory before analysing it.

Some types of data, especially from large pixel detectors, may be bigger than the available memory. Other examples show how to work with very large amounts of data. But the machines in the Maxwell cluster have 250–750 GB of memory, so you can use the simple approach for many cases.

[1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import re
import xarray as xr

from extra_data import RunDirectory

Tabular data (with pandas)

A run at European XFEL is stored as a folder containing a number of files. We can open a run with EXtra-data:

[2]:
run = RunDirectory('/gpfs/exfel/exp/SA1/201830/p900025/raw/r0150/')
run.info()  # Show overview info about this data
# of trains:    3721
Duration:       0:06:12.1
First train ID: 142844490
Last train ID:  142848210

0 detector modules ()

2 instrument sources (excluding detectors):
  - SA1_XTD2_XGM/XGM/DOOCS:output
  - SPB_XTD9_XGM/XGM/DOOCS:output

2 control sources:
  - SA1_XTD2_XGM/XGM/DOOCS
  - SPB_XTD9_XGM/XGM/DOOCS

This example works with data from two X-Ray Gas Monitors (XGMs). These measure properties of the X-ray beam in different parts of the tunnel. This data refers to one XGM in XTD2 and one in XTD9.

pandas is a popular Python library for working with tabular data. We’ll create a pandas dataframe containing the beam x and y position at each XGM, and the photon flux. We select the columns using ‘glob’ patterns, as often used for selecting files on Unix platforms.

  • [abc]: one character, a/b/c

  • ?: any one character

  • *: any sequence of characters

[3]:
df = run.get_dataframe(fields=[("*_XGM/*", "*.i[xy]Pos"), ("*_XGM/*", "*.photonFlux")])
df.head()
[3]:
SPB_XTD9_XGM/XGM/DOOCS/beamPosition.iyPos SPB_XTD9_XGM/XGM/DOOCS/beamPosition.ixPos SPB_XTD9_XGM/XGM/DOOCS/pulseEnergy.photonFlux SA1_XTD2_XGM/XGM/DOOCS/beamPosition.iyPos SA1_XTD2_XGM/XGM/DOOCS/beamPosition.ixPos SA1_XTD2_XGM/XGM/DOOCS/pulseEnergy.photonFlux
142844490 1.717195 -2.277912 1327.06958 0.161399 2.035218 1410.723755
142844491 1.717195 -2.277912 1327.06958 0.161399 2.035218 1410.137451
142844492 1.717195 -2.277912 1327.06958 0.161399 2.035218 1410.137451
142844493 1.717195 -2.277912 1327.06958 0.161399 2.035218 1410.137451
142844494 1.717195 -2.277912 1327.06958 0.161399 2.035218 1410.137451

We can now make plots to compare the parameters at the two XGM positions. As expected, there’s a strong correlation for each parameter.

[4]:
df.plot.scatter(
    x='SA1_XTD2_XGM/XGM/DOOCS/pulseEnergy.photonFlux',
    y='SPB_XTD9_XGM/XGM/DOOCS/pulseEnergy.photonFlux',
)
[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x2ab99bb80fd0>
_images/xpd_examples_7_1.png
[5]:
ax = df.plot.scatter(
    x='SA1_XTD2_XGM/XGM/DOOCS/beamPosition.ixPos',
    y='SPB_XTD9_XGM/XGM/DOOCS/beamPosition.ixPos',
)
_images/xpd_examples_8_0.png
[6]:
ay = df.plot.scatter(
    x='SA1_XTD2_XGM/XGM/DOOCS/beamPosition.iyPos',
    y='SPB_XTD9_XGM/XGM/DOOCS/beamPosition.iyPos',
)
_images/xpd_examples_9_0.png

We can also export the dataframe to a CSV file - or any other format pandas supports - for further analysis with other tools.

[7]:
df.to_csv('xtd2_xtd9_xgm_r150.csv')

As arrays (with xarray)

We’ll open a different run for this example:

[8]:
run = RunDirectory('/gpfs/exfel/exp/SA3/201830/p900027/raw/r0067/')
run.info()  # Show overview info about the data
# of trains:    1475
Duration:       0:02:27.5
First train ID: 128146446
Last train ID:  128147920

0 detector modules ()

1 instrument sources (excluding detectors):
  - SA3_XTD10_PES/ADC/1:network

11 control sources:
  - SA3_XTD10_PES/ADC/1
  - SA3_XTD10_PES/ASENS/MAGN_X
  - SA3_XTD10_PES/ASENS/MAGN_Y
  - SA3_XTD10_PES/ASENS/MAGN_Z
  - SA3_XTD10_PES/DCTRL/V30300S_NITROGEN
  - SA3_XTD10_PES/DCTRL/V30310S_NEON
  - SA3_XTD10_PES/DCTRL/V30320S_KRYPTON
  - SA3_XTD10_PES/DCTRL/V30330S_XENON
  - SA3_XTD10_PES/GAUGE/G30300F
  - SA3_XTD10_PES/GAUGE/G30310F
  - SA3_XTD10_PES/MCPS/MPOD

This run includes data from a Photo-Electron Spectrometer (PES), a monitoring device which records energy spectra for each train. Here’s the data from one of its 16 spectrometers:

[9]:
run.get_array('SA3_XTD10_PES/ADC/1:network', 'digitizers.channel_4_A.raw.samples')
[9]:
<xarray.DataArray (trainId: 1475, dim_0: 40000)>
array([[ -6, -10,  -7, ..., -10,  -8,  -9],
       [ -8,  -8,  -7, ...,  -9,  -2, -11],
       [ -8, -10,  -7, ...,  -6,  -8, -11],
       ...,
       [ -7,  -9,  -8, ...,  -9,  -2,  -5],
       [ -5, -10,  -8, ...,  -5,  -4, -10],
       [ -7,  -8,  -7, ...,  -6,  -5,  -8]], dtype=int16)
Coordinates:
  * trainId  (trainId) uint64 128146446 128146447 ... 128147919 128147920
Dimensions without coordinates: dim_0

The PES consists of 16 spectrometers arranged in a circle around the beamline. We’ll retrieve the data for two of these, separated by 90°. N and E refer to their positions in the circle, although these are not literally North and East.

xarray extends numpy arrays with metadata about the dimensions: we use this to annotate the data with the pulse train IDs they relate to. This is important when correlating data from different sources, as each source may be missing data for some pulse trains, so we need to match them up based on train IDs. The xarray.align() function does this, and by specifying join='inner', we keep only the trains which have data in both sets.

[10]:
data_n = run.get_array('SA3_XTD10_PES/ADC/1:network', 'digitizers.channel_4_A.raw.samples')
data_e = run.get_array('SA3_XTD10_PES/ADC/1:network', 'digitizers.channel_3_A.raw.samples')

data_n, data_e = xr.align(data_n, data_e, join='inner')
nsamples = data_n.shape[1]
data_n.shape
[10]:
(1475, 40000)

We’ll get a few other values from the run to annotate the plot. This uses pandas - see the section above for more about that.

[11]:
# Get the first values from four channels measuring voltage
electr = run.get_dataframe([('SA3_XTD10_PES/MCPS/MPOD', 'channels.U20[0123].measurementSenseVoltage')])
electr_voltages = electr.iloc[0].sort_index()
electr_voltages
[11]:
SA3_XTD10_PES/MCPS/MPOD/channels.U200.measurementSenseVoltage   -0.101792
SA3_XTD10_PES/MCPS/MPOD/channels.U201.measurementSenseVoltage   -0.111782
SA3_XTD10_PES/MCPS/MPOD/channels.U202.measurementSenseVoltage   -0.106823
SA3_XTD10_PES/MCPS/MPOD/channels.U203.measurementSenseVoltage   -0.107910
Name: 128146446, dtype: float32
[12]:
gas_interlocks = run.get_dataframe([('SA3_XTD10_PES/DCTRL/*', 'interlock.AActionState')])

# Take the first row of the gas interlock data and find which gas was unlocked
row = gas_interlocks.iloc[0]
print(row)
if (row == 0).any():
    key = row[row == 0].index[0]
    target_gas = re.search(r'(XENON|KRYPTON|NITROGEN|NEON)', key).group(1).title()
else:
    target_gas = 'No gas'
SA3_XTD10_PES/DCTRL/V30330S_XENON/interlock.AActionState       1
SA3_XTD10_PES/DCTRL/V30300S_NITROGEN/interlock.AActionState    1
SA3_XTD10_PES/DCTRL/V30320S_KRYPTON/interlock.AActionState     1
SA3_XTD10_PES/DCTRL/V30310S_NEON/interlock.AActionState        0
Name: 128146446, dtype: uint32

Now we can average the spectra across the trains in this run, and plot them.

[13]:
x = np.linspace(0, 0.0005*nsamples, nsamples, endpoint=False)

fig, axes = plt.subplots(1, 2, figsize=(10, 4))
for ax, dataset, start_time in zip(axes, [data_n, data_e], [15.76439411, 15.76289411]):
    ax.plot(x, dataset.sum(axis=0))
    ax.yaxis.major.formatter.set_powerlimits((0, 0))
    ax.set_xlim(15.75, 15.85)
    ax.set_xlabel('time ($\mu$s)')

    ax.axvline(start_time, color='red', linestyle='dotted', label='Start time')
    ax.axvline(start_time + 0.0079, color='magenta', linestyle='dotted', label='Neon K 1s')
    ax.axvline(start_time + 0.041, color='black', label='Auger peak')
    ax.legend()

axes[0].set_title('Spectrometer 00 N')
axes[1].set_title('Spectrometer 04 E')
fig.suptitle('{gas};  900 eV; [A B C D] = [{voltages[0]:.1f} {voltages[1]:.1f} {voltages[2]:.1f} {voltages[3]:.1f}] V; run 67'
             .format(gas=target_gas, voltages=electr_voltages.values), y=1.05);
_images/xpd_examples_23_0.png

The spectra look different because the beam is horizontally polarised, so the E spectrometer sees a peak that the N spectrometer doesn’t.