{ "cells": [ { "cell_type": "markdown", "id": "a517d16c", "metadata": {}, "source": [ "# Aligning data from different sources\n", "\n", "Sometimes, instruments recording data miss a train. In particular, different sources may start & finish recording at slightly different times. So two arrays loaded from the same run don't necessarily line up:" ] }, { "cell_type": "code", "execution_count": 1, "id": "d3757db4", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from extra_data import open_run" ] }, { "cell_type": "code", "execution_count": 2, "id": "983da37d", "metadata": {}, "outputs": [], "source": [ "run = open_run(proposal=700000, run=26)" ] }, { "cell_type": "code", "execution_count": 3, "id": "01e2987e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# trains measured: 7263, 7264\n" ] } ], "source": [ "intensity_sase3 = run['SA3_XTD10_XGM/XGM/DOOCS:output', 'data.intensityTD']\n", "photflux_sase3 = run['SA3_XTD10_XGM/XGM/DOOCS', 'pulseEnergy.photonFlux']\n", "print(f\"# trains measured: {intensity_sase3.shape[0]}, {photflux_sase3.shape[0]}\")" ] }, { "cell_type": "markdown", "id": "1d15a1ab", "metadata": {}, "source": [ "Even if we get the same *number* of trains, they may not line up if different instruments miss different trains:" ] }, { "cell_type": "code", "execution_count": 4, "id": "489c9a14", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# trains measured: 7263, 7263\n", "Train IDs all match: False\n", "Train IDs matching (every 100th train):\n", "[ True True True True True True True True True True True True\n", " True True True True True True True True True True True True\n", " True True True True True True True True True False False False\n", " False False False False False False False False False False False False\n", " False False False False False False False False False False False False\n", " False False False False True True True True True True True True\n", " True]\n" ] } ], "source": [ "intensity_scs = run['SCS_BLU_XGM/XGM/DOOCS:output', 'data.intensityTD']\n", "print(f\"# trains measured: {intensity_sase3.shape[0]}, {intensity_scs.shape[0]}\")\n", "\n", "train_ids_eq = intensity_sase3.train_id_coordinates() == intensity_scs.train_id_coordinates()\n", "print(\"Train IDs all match:\", train_ids_eq.all())\n", "print(\"Train IDs matching (every 100th train):\")\n", "print(train_ids_eq[::100])" ] }, { "cell_type": "markdown", "id": "e000813c", "metadata": {}, "source": [ "We typically want to look at only the trains with data for all the sources we're using.\n", "There are a few ways we can get these." ] }, { "cell_type": "markdown", "id": "6792e2ba", "metadata": {}, "source": [ "## By selecting sources\n", "\n", "Use `.select()` to select specified sources & keys in the run. The `require_all=True` option discards trains where any of the selected data is missing." ] }, { "cell_type": "code", "execution_count": 5, "id": "1a336880", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# of trains: 7262\n", "Duration: 0:12:06.4\n", "First train ID: 517755296\n", "Last train ID: 517762559\n", "\n", "0 detector modules ()\n", "\n", "2 instrument sources (excluding detectors):\n", " - SA3_XTD10_XGM/XGM/DOOCS:output\n", " - SCS_BLU_XGM/XGM/DOOCS:output\n", "\n", "2 control sources:\n", " - SA3_XTD10_XGM/XGM/DOOCS\n", " - SCS_BLU_XGM/XGM/DOOCS\n", "\n" ] } ], "source": [ "# Select a list of sources & keys\n", "sel = run.select([\n", " ('SA3_XTD10_XGM/XGM/DOOCS:output', '*'),\n", " ('SCS_BLU_XGM/XGM/DOOCS:output', '*'),\n", "], require_all=True)\n", "\n", "# Or select sources by pattern - this gets any sources with /XGM/ in the name\n", "sel = run.select('*/XGM/*', require_all=True)\n", "sel.info()" ] }, { "cell_type": "code", "execution_count": 6, "id": "b80cdcda", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# trains measured: 7262, 7262\n", "Train IDs all match: True\n" ] } ], "source": [ "intensity_sase3 = sel['SA3_XTD10_XGM/XGM/DOOCS:output', 'data.intensityTD']\n", "intensity_scs = sel['SCS_BLU_XGM/XGM/DOOCS:output', 'data.intensityTD']\n", "print(f\"# trains measured: {intensity_sase3.shape[0]}, {intensity_scs.shape[0]}\")\n", "\n", "train_ids_eq = intensity_sase3.train_id_coordinates() == intensity_scs.train_id_coordinates()\n", "print(\"Train IDs all match:\", train_ids_eq.all())" ] }, { "cell_type": "markdown", "id": "b74273e1", "metadata": {}, "source": [ "If `.select(..., require_all=True)` gives you 0 trains, it probably means that one of the sources you have selected didn't record any data in that run." ] }, { "cell_type": "markdown", "id": "f7c9a83d", "metadata": {}, "source": [ "## By selecting train IDs\n", "\n", "We can use all the data for one source, and cut out trains which that specific source missed, with code like this:" ] }, { "cell_type": "code", "execution_count": 7, "id": "1034e9f0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# trains measured: 7263, 7262\n" ] } ], "source": [ "from extra_data import by_id\n", "\n", "# Keep all data from this source:\n", "intensity_sase3 = run['SA3_XTD10_XGM/XGM/DOOCS:output', 'data.intensityTD']\n", "\n", "intensity_scs = run['SCS_BLU_XGM/XGM/DOOCS:output', 'data.intensityTD'].select_trains(\n", " by_id[intensity_sase3.train_id_coordinates()]\n", ")\n", "print(f\"# trains measured: {intensity_sase3.shape[0]}, {intensity_scs.shape[0]}\")" ] }, { "cell_type": "markdown", "id": "aa99e276", "metadata": {}, "source": [ "This only excluded trains missing from the first source, so in this case, the first source still has one extra train which the second does not." ] }, { "cell_type": "markdown", "id": "43fb35ab", "metadata": {}, "source": [ "## Using xarray\n", "\n", "The options above exclude trains before loading the data. We can also align data after loading it as [xarray](https://xarray.pydata.org/en/stable/) labelled arrays:" ] }, { "cell_type": "code", "execution_count": 8, "id": "8db4346b", "metadata": {}, "outputs": [], "source": [ "intensity_sase3_arr = run['SA3_XTD10_XGM/XGM/DOOCS:output', 'data.intensityTD'].xarray()\n", "intensity_scs_arr = run['SCS_BLU_XGM/XGM/DOOCS:output', 'data.intensityTD'].xarray()" ] }, { "cell_type": "code", "execution_count": 9, "id": "82c0dd22", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'SCS_BLU_XGM/XGM/DOOCS:output.data.intensityTD' (trainId: 7263, dim_0: 1000)>\n",
       "array([[ 4.4886490e+01,  4.2309365e+03, -4.5598242e+03, ...,\n",
       "         1.0000000e+00,  1.0000000e+00,  1.0000000e+00],\n",
       "       [ 1.0151898e+02,  2.2400598e+03, -2.7732441e+03, ...,\n",
       "         1.0000000e+00,  1.0000000e+00,  1.0000000e+00],\n",
       "       [-1.3794557e+02,  2.4830901e+03, -3.6583892e+03, ...,\n",
       "         1.0000000e+00,  1.0000000e+00,  1.0000000e+00],\n",
       "       ...,\n",
       "       [-4.2194626e+02,  5.4188824e+02, -8.9533582e+02, ...,\n",
       "         1.0000000e+00,  1.0000000e+00,  1.0000000e+00],\n",
       "       [-1.3200552e+02,  1.1471447e+03, -1.4556660e+03, ...,\n",
       "         1.0000000e+00,  1.0000000e+00,  1.0000000e+00],\n",
       "       [-2.3156431e+01,  2.2287026e+03, -3.3196895e+03, ...,\n",
       "         1.0000000e+00,  1.0000000e+00,  1.0000000e+00]], dtype=float32)\n",
       "Coordinates:\n",
       "  * trainId  (trainId) uint64 517755296 517755297 ... 517762558 517762559\n",
       "Dimensions without coordinates: dim_0
" ], "text/plain": [ "\n", "array([[ 4.4886490e+01, 4.2309365e+03, -4.5598242e+03, ...,\n", " 1.0000000e+00, 1.0000000e+00, 1.0000000e+00],\n", " [ 1.0151898e+02, 2.2400598e+03, -2.7732441e+03, ...,\n", " 1.0000000e+00, 1.0000000e+00, 1.0000000e+00],\n", " [-1.3794557e+02, 2.4830901e+03, -3.6583892e+03, ...,\n", " 1.0000000e+00, 1.0000000e+00, 1.0000000e+00],\n", " ...,\n", " [-4.2194626e+02, 5.4188824e+02, -8.9533582e+02, ...,\n", " 1.0000000e+00, 1.0000000e+00, 1.0000000e+00],\n", " [-1.3200552e+02, 1.1471447e+03, -1.4556660e+03, ...,\n", " 1.0000000e+00, 1.0000000e+00, 1.0000000e+00],\n", " [-2.3156431e+01, 2.2287026e+03, -3.3196895e+03, ...,\n", " 1.0000000e+00, 1.0000000e+00, 1.0000000e+00]], dtype=float32)\n", "Coordinates:\n", " * trainId (trainId) uint64 517755296 517755297 ... 517762558 517762559\n", "Dimensions without coordinates: dim_0" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "intensity_scs_arr" ] }, { "cell_type": "markdown", "id": "67d5d6b4", "metadata": {}, "source": [ "We'll use the [xarray.align()](https://xarray.pydata.org/en/stable/generated/xarray.align.html#xarray.align) function to line up the arrays by their train ID labels:" ] }, { "cell_type": "code", "execution_count": 10, "id": "eb212424", "metadata": {}, "outputs": [], "source": [ "import xarray as xr\n", "\n", "intensity_sase3_arr, intensity_scs_arr = xr.align(\n", " intensity_sase3_arr, intensity_scs_arr, join='inner'\n", ")" ] }, { "cell_type": "code", "execution_count": 11, "id": "2b810651", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(intensity_scs_arr.coords['trainId'] == intensity_sase3_arr.coords['trainId']).all().item()" ] }, { "cell_type": "markdown", "id": "550ba7cc", "metadata": {}, "source": [ "Using `join='inner'` (which is the default) discards data to align the arrays.\n", "If we specified `join='outer'` instead, it would insert gaps in the arrays where data is missing." ] }, { "cell_type": "markdown", "id": "c8b99b6a", "metadata": {}, "source": [ "## Multi-module detectors\n", "\n", "Several detectors at European XFEL have modules recording data as separate sources.\n", "This run contains data from a DSSC detector:" ] }, { "cell_type": "code", "execution_count": 12, "id": "66ba9d4a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from extra_data.components import DSSC1M\n", "\n", "dssc = DSSC1M(run)\n", "dssc" ] }, { "cell_type": "code", "execution_count": 13, "id": "2c112e18", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5120" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(dssc.train_ids)" ] }, { "cell_type": "markdown", "id": "6a0dc7d9", "metadata": {}, "source": [ "By default, we get trains where any detector module recorded data.\n", "We can specify `min_modules` to get trains where *all* modules recorded data:" ] }, { "cell_type": "code", "execution_count": 14, "id": "6918131e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5049" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dssc_allmod = DSSC1M(run, min_modules=16)\n", "\n", "len(dssc_allmod.train_ids)" ] }, { "cell_type": "markdown", "id": "410ae8bc", "metadata": {}, "source": [ "Or we can allow a certain number of missing modules in each train, to keep more of the data:" ] }, { "cell_type": "code", "execution_count": 15, "id": "e16e74f0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5118" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dssc_mostmod = DSSC1M(run, min_modules=15)\n", "\n", "len(dssc_mostmod.train_ids)" ] }, { "cell_type": "markdown", "id": "47f71f21", "metadata": {}, "source": [ "In this case, missing data will be filled in as 0 (for integers) or NaN (for floating point data) when we read the data. You should check that the code you're using to process the data will behave correctly with the fill value." ] } ], "metadata": { "kernelspec": { "display_name": "xfel (Python 3.7)", "language": "python", "name": "xfel" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 5 }