{ "cells": [ { "cell_type": "markdown", "id": "a517d16c", "metadata": {}, "source": [ "# Aligning data from different sources\n", "\n", "Sometimes, instruments recording data miss a train. In particular, different sources may start & finish recording at slightly different times. So two arrays loaded from the same run don't necessarily line up:" ] }, { "cell_type": "code", "execution_count": 1, "id": "d3757db4", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from extra_data import open_run" ] }, { "cell_type": "code", "execution_count": 2, "id": "983da37d", "metadata": {}, "outputs": [], "source": [ "run = open_run(proposal=700000, run=26)" ] }, { "cell_type": "code", "execution_count": 3, "id": "01e2987e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# trains measured: 7263, 7264\n" ] } ], "source": [ "intensity_sase3 = run['SA3_XTD10_XGM/XGM/DOOCS:output', 'data.intensityTD']\n", "photflux_sase3 = run['SA3_XTD10_XGM/XGM/DOOCS', 'pulseEnergy.photonFlux']\n", "print(f\"# trains measured: {intensity_sase3.shape[0]}, {photflux_sase3.shape[0]}\")" ] }, { "cell_type": "markdown", "id": "1d15a1ab", "metadata": {}, "source": [ "Even if we get the same *number* of trains, they may not line up if different instruments miss different trains:" ] }, { "cell_type": "code", "execution_count": 4, "id": "489c9a14", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# trains measured: 7263, 7263\n", "Train IDs all match: False\n", "Train IDs matching (every 100th train):\n", "[ True True True True True True True True True True True True\n", " True True True True True True True True True True True True\n", " True True True True True True True True True False False False\n", " False False False False False False False False False False False False\n", " False False False False False False False False False False False False\n", " False False False False True True True True True True True True\n", " True]\n" ] } ], "source": [ "intensity_scs = run['SCS_BLU_XGM/XGM/DOOCS:output', 'data.intensityTD']\n", "print(f\"# trains measured: {intensity_sase3.shape[0]}, {intensity_scs.shape[0]}\")\n", "\n", "train_ids_eq = intensity_sase3.train_id_coordinates() == intensity_scs.train_id_coordinates()\n", "print(\"Train IDs all match:\", train_ids_eq.all())\n", "print(\"Train IDs matching (every 100th train):\")\n", "print(train_ids_eq[::100])" ] }, { "cell_type": "markdown", "id": "e000813c", "metadata": {}, "source": [ "We typically want to look at only the trains with data for all the sources we're using.\n", "There are a few ways we can get these." ] }, { "cell_type": "markdown", "id": "6792e2ba", "metadata": {}, "source": [ "## By selecting sources\n", "\n", "Use `.select()` to select specified sources & keys in the run. The `require_all=True` option discards trains where any of the selected data is missing." ] }, { "cell_type": "code", "execution_count": 5, "id": "1a336880", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# of trains: 7262\n", "Duration: 0:12:06.4\n", "First train ID: 517755296\n", "Last train ID: 517762559\n", "\n", "0 detector modules ()\n", "\n", "2 instrument sources (excluding detectors):\n", " - SA3_XTD10_XGM/XGM/DOOCS:output\n", " - SCS_BLU_XGM/XGM/DOOCS:output\n", "\n", "2 control sources:\n", " - SA3_XTD10_XGM/XGM/DOOCS\n", " - SCS_BLU_XGM/XGM/DOOCS\n", "\n" ] } ], "source": [ "# Select a list of sources & keys\n", "sel = run.select([\n", " ('SA3_XTD10_XGM/XGM/DOOCS:output', '*'),\n", " ('SCS_BLU_XGM/XGM/DOOCS:output', '*'),\n", "], require_all=True)\n", "\n", "# Or select sources by pattern - this gets any sources with /XGM/ in the name\n", "sel = run.select('*/XGM/*', require_all=True)\n", "sel.info()" ] }, { "cell_type": "code", "execution_count": 6, "id": "b80cdcda", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# trains measured: 7262, 7262\n", "Train IDs all match: True\n" ] } ], "source": [ "intensity_sase3 = sel['SA3_XTD10_XGM/XGM/DOOCS:output', 'data.intensityTD']\n", "intensity_scs = sel['SCS_BLU_XGM/XGM/DOOCS:output', 'data.intensityTD']\n", "print(f\"# trains measured: {intensity_sase3.shape[0]}, {intensity_scs.shape[0]}\")\n", "\n", "train_ids_eq = intensity_sase3.train_id_coordinates() == intensity_scs.train_id_coordinates()\n", "print(\"Train IDs all match:\", train_ids_eq.all())" ] }, { "cell_type": "markdown", "id": "b74273e1", "metadata": {}, "source": [ "If `.select(..., require_all=True)` gives you 0 trains, it probably means that one of the sources you have selected didn't record any data in that run." ] }, { "cell_type": "markdown", "id": "f7c9a83d", "metadata": {}, "source": [ "## By selecting train IDs\n", "\n", "We can use all the data for one source, and cut out trains which that specific source missed, with code like this:" ] }, { "cell_type": "code", "execution_count": 7, "id": "1034e9f0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# trains measured: 7263, 7262\n" ] } ], "source": [ "from extra_data import by_id\n", "\n", "# Keep all data from this source:\n", "intensity_sase3 = run['SA3_XTD10_XGM/XGM/DOOCS:output', 'data.intensityTD']\n", "\n", "intensity_scs = run['SCS_BLU_XGM/XGM/DOOCS:output', 'data.intensityTD'].select_trains(\n", " by_id[intensity_sase3.train_id_coordinates()]\n", ")\n", "print(f\"# trains measured: {intensity_sase3.shape[0]}, {intensity_scs.shape[0]}\")" ] }, { "cell_type": "markdown", "id": "aa99e276", "metadata": {}, "source": [ "This only excluded trains missing from the first source, so in this case, the first source still has one extra train which the second does not." ] }, { "cell_type": "markdown", "id": "43fb35ab", "metadata": {}, "source": [ "## Using xarray\n", "\n", "The options above exclude trains before loading the data. We can also align data after loading it as [xarray](https://xarray.pydata.org/en/stable/) labelled arrays:" ] }, { "cell_type": "code", "execution_count": 8, "id": "8db4346b", "metadata": {}, "outputs": [], "source": [ "intensity_sase3_arr = run['SA3_XTD10_XGM/XGM/DOOCS:output', 'data.intensityTD'].xarray()\n", "intensity_scs_arr = run['SCS_BLU_XGM/XGM/DOOCS:output', 'data.intensityTD'].xarray()" ] }, { "cell_type": "code", "execution_count": 9, "id": "82c0dd22", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
<xarray.DataArray 'SCS_BLU_XGM/XGM/DOOCS:output.data.intensityTD' (trainId: 7263, dim_0: 1000)>\n",
"array([[ 4.4886490e+01, 4.2309365e+03, -4.5598242e+03, ...,\n",
" 1.0000000e+00, 1.0000000e+00, 1.0000000e+00],\n",
" [ 1.0151898e+02, 2.2400598e+03, -2.7732441e+03, ...,\n",
" 1.0000000e+00, 1.0000000e+00, 1.0000000e+00],\n",
" [-1.3794557e+02, 2.4830901e+03, -3.6583892e+03, ...,\n",
" 1.0000000e+00, 1.0000000e+00, 1.0000000e+00],\n",
" ...,\n",
" [-4.2194626e+02, 5.4188824e+02, -8.9533582e+02, ...,\n",
" 1.0000000e+00, 1.0000000e+00, 1.0000000e+00],\n",
" [-1.3200552e+02, 1.1471447e+03, -1.4556660e+03, ...,\n",
" 1.0000000e+00, 1.0000000e+00, 1.0000000e+00],\n",
" [-2.3156431e+01, 2.2287026e+03, -3.3196895e+03, ...,\n",
" 1.0000000e+00, 1.0000000e+00, 1.0000000e+00]], dtype=float32)\n",
"Coordinates:\n",
" * trainId (trainId) uint64 517755296 517755297 ... 517762558 517762559\n",
"Dimensions without coordinates: dim_0array([[ 4.4886490e+01, 4.2309365e+03, -4.5598242e+03, ...,\n",
" 1.0000000e+00, 1.0000000e+00, 1.0000000e+00],\n",
" [ 1.0151898e+02, 2.2400598e+03, -2.7732441e+03, ...,\n",
" 1.0000000e+00, 1.0000000e+00, 1.0000000e+00],\n",
" [-1.3794557e+02, 2.4830901e+03, -3.6583892e+03, ...,\n",
" 1.0000000e+00, 1.0000000e+00, 1.0000000e+00],\n",
" ...,\n",
" [-4.2194626e+02, 5.4188824e+02, -8.9533582e+02, ...,\n",
" 1.0000000e+00, 1.0000000e+00, 1.0000000e+00],\n",
" [-1.3200552e+02, 1.1471447e+03, -1.4556660e+03, ...,\n",
" 1.0000000e+00, 1.0000000e+00, 1.0000000e+00],\n",
" [-2.3156431e+01, 2.2287026e+03, -3.3196895e+03, ...,\n",
" 1.0000000e+00, 1.0000000e+00, 1.0000000e+00]], dtype=float32)array([517755296, 517755297, 517755298, ..., 517762557, 517762558, 517762559],\n",
" dtype=uint64)