{ "cells": [ { "cell_type": "markdown", "id": "ca09761db8545719", "metadata": {}, "source": [ "# Optimizing Access Patterns\n", "\n", "```{article-info}\n", ":author: Altay Sansal\n", ":date: \"{sub-ref}`today`\"\n", ":read-time: \"{sub-ref}`wordcount-minutes` min read\"\n", ":class-container: sd-p-0 sd-outline-muted sd-rounded-3 sd-font-weight-light\n", "```\n", "\n", "## Introduction\n", "\n", "In this page we will be showing how we can take an existing MDIO and add\n", "fast access, lossy, versions of the data in IL/XL/TWT cross-sections (slices).\n", "\n", "We can re-use the MDIO dataset we created in the [Quickstart](#quickstart) page.\n", "Please run it first.\n", "\n", "Let's open the original MDIO first." ] }, { "cell_type": "code", "execution_count": null, "id": "45558306-ab9c-46aa-a299-8758a911b373", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 403MB\n",
       "Dimensions:           (inline: 345, crossline: 188, time: 1501)\n",
       "Coordinates:\n",
       "  * inline            (inline) int32 1kB 1 2 3 4 5 6 ... 340 341 342 343 344 345\n",
       "  * crossline         (crossline) int32 752B 1 2 3 4 5 6 ... 184 185 186 187 188\n",
       "  * time              (time) int32 6kB 0 2 4 6 8 10 ... 2992 2994 2996 2998 3000\n",
       "    cdp_y             (inline, crossline) float64 519kB ...\n",
       "    cdp_x             (inline, crossline) float64 519kB ...\n",
       "Data variables:\n",
       "    amplitude         (inline, crossline, time) float32 389MB ...\n",
       "    headers           (inline, crossline) [('trace_seq_num_line', '<i4'), ('trace_seq_num_reel', '<i4'), ('orig_field_record_num', '<i4'), ('trace_num_orig_record', '<i4'), ('energy_source_point_num', '<i4'), ('ensemble_num', '<i4'), ('trace_num_ensemble', '<i4'), ('trace_id_code', '<i2'), ('vertically_summed_traces', '<i2'), ('horizontally_stacked_traces', '<i2'), ('data_use', '<i2'), ('source_to_receiver_distance', '<i4'), ('receiver_group_elevation', '<i4'), ('source_surface_elevation', '<i4'), ('source_depth_below_surface', '<i4'), ('receiver_datum_elevation', '<i4'), ('source_datum_elevation', '<i4'), ('source_water_depth', '<i4'), ('receiver_water_depth', '<i4'), ('elevation_depth_scalar', '<i2'), ('coordinate_scalar', '<i2'), ('source_coord_x', '<i4'), ('source_coord_y', '<i4'), ('group_coord_x', '<i4'), ('group_coord_y', '<i4'), ('coordinate_unit', '<i2'), ('weathering_velocity', '<i2'), ('subweathering_velocity', '<i2'), ('source_uphole_time', '<i2'), ('group_uphole_time', '<i2'), ('source_static_correction', '<i2'), ('receiver_static_correction', '<i2'), ('total_static_applied', '<i2'), ('lag_time_a', '<i2'), ('lag_time_b', '<i2'), ('delay_recording_time', '<i2'), ('mute_time_start', '<i2'), ('mute_time_end', '<i2'), ('samples_per_trace', '<i2'), ('sample_interval', '<i2'), ('instrument_gain_type', '<i2'), ('instrument_gain_const', '<i2'), ('instrument_gain_initial', '<i2'), ('correlated_data', '<i2'), ('sweep_freq_start', '<i2'), ('sweep_freq_end', '<i2'), ('sweep_length', '<i2'), ('sweep_type', '<i2'), ('sweep_taper_start', '<i2'), ('sweep_taper_end', '<i2'), ('taper_type', '<i2'), ('alias_filter_freq', '<i2'), ('alias_filter_slope', '<i2'), ('notch_filter_freq', '<i2'), ('notch_filter_slope', '<i2'), ('low_cut_freq', '<i2'), ('high_cut_freq', '<i2'), ('low_cut_slope', '<i2'), ('high_cut_slope', '<i2'), ('year_recorded', '<i2'), ('day_of_year', '<i2'), ('hour_of_day', '<i2'), ('minute_of_hour', '<i2'), ('second_of_minute', '<i2'), ('time_basis_code', '<i2'), ('trace_weighting_factor', '<i2'), ('group_num_roll_switch', '<i2'), ('group_num_first_trace', '<i2'), ('group_num_last_trace', '<i2'), ('gap_size', '<i2'), ('taper_overtravel', '<i2'), ('inline', '<i4'), ('crossline', '<i4'), ('cdp_x', '<i4'), ('cdp_y', '<i4')] 13MB ...\n",
       "    segy_file_header  <U1 4B ...\n",
       "    trace_mask        (inline, crossline) bool 65kB ...\n",
       "Attributes:\n",
       "    apiVersion:  1.1.1\n",
       "    createdOn:   2025-12-19 16:05:58.230520+00:00\n",
       "    name:        PostStack3DTime\n",
       "    attributes:  {'surveyType': '3D', 'gatherType': 'stacked', 'defaultVariab...
" ], "text/plain": [ " Size: 403MB\n", "Dimensions: (inline: 345, crossline: 188, time: 1501)\n", "Coordinates:\n", " * inline (inline) int32 1kB 1 2 3 4 5 6 ... 340 341 342 343 344 345\n", " * crossline (crossline) int32 752B 1 2 3 4 5 6 ... 184 185 186 187 188\n", " * time (time) int32 6kB 0 2 4 6 8 10 ... 2992 2994 2996 2998 3000\n", " cdp_y (inline, crossline) float64 519kB ...\n", " cdp_x (inline, crossline) float64 519kB ...\n", "Data variables:\n", " amplitude (inline, crossline, time) float32 389MB ...\n", " headers (inline, crossline) [('trace_seq_num_line', '\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 2GB\n",
       "Dimensions:           (inline: 345, crossline: 188, time: 1501)\n",
       "Coordinates:\n",
       "  * inline            (inline) int32 1kB 1 2 3 4 5 6 ... 340 341 342 343 344 345\n",
       "  * crossline         (crossline) int32 752B 1 2 3 4 5 6 ... 184 185 186 187 188\n",
       "  * time              (time) int32 6kB 0 2 4 6 8 10 ... 2992 2994 2996 2998 3000\n",
       "    cdp_x             (inline, crossline) float64 519kB ...\n",
       "    cdp_y             (inline, crossline) float64 519kB ...\n",
       "Data variables:\n",
       "    segy_file_header  <U1 4B ...\n",
       "    trace_mask        (inline, crossline) bool 65kB ...\n",
       "    amplitude         (inline, crossline, time) float32 389MB ...\n",
       "    headers           (inline, crossline) [('trace_seq_num_line', '<i4'), ('trace_seq_num_reel', '<i4'), ('orig_field_record_num', '<i4'), ('trace_num_orig_record', '<i4'), ('energy_source_point_num', '<i4'), ('ensemble_num', '<i4'), ('trace_num_ensemble', '<i4'), ('trace_id_code', '<i2'), ('vertically_summed_traces', '<i2'), ('horizontally_stacked_traces', '<i2'), ('data_use', '<i2'), ('source_to_receiver_distance', '<i4'), ('receiver_group_elevation', '<i4'), ('source_surface_elevation', '<i4'), ('source_depth_below_surface', '<i4'), ('receiver_datum_elevation', '<i4'), ('source_datum_elevation', '<i4'), ('source_water_depth', '<i4'), ('receiver_water_depth', '<i4'), ('elevation_depth_scalar', '<i2'), ('coordinate_scalar', '<i2'), ('source_coord_x', '<i4'), ('source_coord_y', '<i4'), ('group_coord_x', '<i4'), ('group_coord_y', '<i4'), ('coordinate_unit', '<i2'), ('weathering_velocity', '<i2'), ('subweathering_velocity', '<i2'), ('source_uphole_time', '<i2'), ('group_uphole_time', '<i2'), ('source_static_correction', '<i2'), ('receiver_static_correction', '<i2'), ('total_static_applied', '<i2'), ('lag_time_a', '<i2'), ('lag_time_b', '<i2'), ('delay_recording_time', '<i2'), ('mute_time_start', '<i2'), ('mute_time_end', '<i2'), ('samples_per_trace', '<i2'), ('sample_interval', '<i2'), ('instrument_gain_type', '<i2'), ('instrument_gain_const', '<i2'), ('instrument_gain_initial', '<i2'), ('correlated_data', '<i2'), ('sweep_freq_start', '<i2'), ('sweep_freq_end', '<i2'), ('sweep_length', '<i2'), ('sweep_type', '<i2'), ('sweep_taper_start', '<i2'), ('sweep_taper_end', '<i2'), ('taper_type', '<i2'), ('alias_filter_freq', '<i2'), ('alias_filter_slope', '<i2'), ('notch_filter_freq', '<i2'), ('notch_filter_slope', '<i2'), ('low_cut_freq', '<i2'), ('high_cut_freq', '<i2'), ('low_cut_slope', '<i2'), ('high_cut_slope', '<i2'), ('year_recorded', '<i2'), ('day_of_year', '<i2'), ('hour_of_day', '<i2'), ('minute_of_hour', '<i2'), ('second_of_minute', '<i2'), ('time_basis_code', '<i2'), ('trace_weighting_factor', '<i2'), ('group_num_roll_switch', '<i2'), ('group_num_first_trace', '<i2'), ('group_num_last_trace', '<i2'), ('gap_size', '<i2'), ('taper_overtravel', '<i2'), ('inline', '<i4'), ('crossline', '<i4'), ('cdp_x', '<i4'), ('cdp_y', '<i4')] 13MB ...\n",
       "    fast_crossline    (inline, crossline, time) float32 389MB ...\n",
       "    fast_inline       (inline, crossline, time) float32 389MB ...\n",
       "    fast_time         (inline, crossline, time) float32 389MB ...\n",
       "Attributes:\n",
       "    apiVersion:  1.1.1\n",
       "    createdOn:   2025-12-19 16:05:58.230520+00:00\n",
       "    name:        PostStack3DTime\n",
       "    attributes:  {'surveyType': '3D', 'gatherType': 'stacked', 'defaultVariab...
" ], "text/plain": [ " Size: 2GB\n", "Dimensions: (inline: 345, crossline: 188, time: 1501)\n", "Coordinates:\n", " * inline (inline) int32 1kB 1 2 3 4 5 6 ... 340 341 342 343 344 345\n", " * crossline (crossline) int32 752B 1 2 3 4 5 6 ... 184 185 186 187 188\n", " * time (time) int32 6kB 0 2 4 6 8 10 ... 2992 2994 2996 2998 3000\n", " cdp_x (inline, crossline) float64 519kB ...\n", " cdp_y (inline, crossline) float64 519kB ...\n", "Data variables:\n", " segy_file_header " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "from mdio.builder.schemas.v1.stats import SummaryStatistics\n", "\n", "stats = SummaryStatistics.model_validate_json(ds.amplitude.attrs[\"statsV1\"])\n", "imshow_kw = {\n", " \"vmin\": -3 * stats.std,\n", " \"vmax\": 3 * stats.std,\n", " \"cmap\": \"gray_r\",\n", " \"interpolation\": \"bilinear\",\n", " \"yincrease\": False,\n", " \"add_colorbar\": False,\n", "}\n", "\n", "fig, ax = plt.subplots(1, 4, sharex=\"all\", sharey=\"all\", figsize=(8, 5))\n", "\n", "ds_inline = ds.sel(inline=200)\n", "\n", "ds_inline.amplitude.T.plot.imshow(ax=ax[0], **imshow_kw)\n", "ds_inline.fast_inline.T.plot.imshow(ax=ax[1], **imshow_kw)\n", "\n", "diff = ds_inline.amplitude - ds_inline.fast_inline\n", "diff.T.plot.imshow(ax=ax[2], **imshow_kw)\n", "(1000 * diff).T.plot.imshow(ax=ax[3], **imshow_kw)\n", "\n", "for axis, title in zip(ax.ravel(), [\"original\", \"lossy\", \"difference\", \"1,000xdifference\"], strict=False):\n", " if title != \"original\":\n", " axis.set_ylabel(\"\")\n", " axis.set_title(title)\n", "\n", "fig.tight_layout();" ] }, { "cell_type": "markdown", "id": "220399c2-d0a3-48cc-89a3-2594af073f73", "metadata": {}, "source": [ "## Adjusting the Compressor\n", "\n", "The compressor can be modified for fast access patterns but the default setting usually works quite well.\n", "Given 1:10 compression ratio, the fidelity is quite high with the default `ZfpQuality.LOW` setting.\n", "\n", "If you still want to use the ZFP compression but change the quality settings follow the instructions below.\n", "We can also use `Blosc` compressor available in MDIO as well, but we will not demonstrate that here." ] }, { "cell_type": "code", "execution_count": null, "id": "877160c9-9bf3-47b9-92ca-1e3dd87584e2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ZFP(name='zfp', mode=, tolerance=0.09305394453239418, rate=None, precision=None)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from mdio.optimize import ZfpQuality\n", "from mdio.optimize import get_default_zfp\n", "\n", "get_default_zfp(stats, ZfpQuality.HIGH)" ] }, { "cell_type": "markdown", "id": "48a0ece3-2ff4-41f6-9867-6296c733e7e9", "metadata": {}, "source": [ "Here is a medium example. Note that the tolerance changes because it is based on dataset statistics and compression quality setting." ] }, { "cell_type": "code", "execution_count": null, "id": "255713b5-988a-431f-a171-846bba87b228", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ZFP(name='zfp', mode=, tolerance=0.9305394453239417, rate=None, precision=None)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_default_zfp(stats, ZfpQuality.MEDIUM)" ] }, { "cell_type": "markdown", "id": "2900c40b-c332-4334-a4cc-f0e5571c7387", "metadata": {}, "source": [ "In conclusion, we show that by generating optimized, lossy compressed copies of the data\n", "for certain access patterns yield big performance benefits when reading the data.\n", "\n", "The differences are orders of magnitude larger on big datasets and remote stores, given available\n", "network bandwidth." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 5 }