Wranglers

In this notebook we give a brief overview of wrangling with netCDF-SCM.

[1]:
# NBVAL_IGNORE_OUTPUT
import glob
from pathlib import Path

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymagicc
[2]:
plt.style.use("bmh")
%matplotlib inline

Wrangling help

The wrangling help can be accessed via our command line interface.

[3]:
# NBVAL_IGNORE_OUTPUT
!netcdf-scm wrangle -h
Usage: netcdf-scm wrangle [OPTIONS] SRC DST WRANGLE_CONTACT

  Wrangle netCDF-SCM ``.nc`` files into other formats and directory
  structures.

  ``src`` is searched recursively and netcdf-scm will attempt to wrangle all
  the files found.

  ``wrangle_contact`` is written into the header of the output files.

Options:
  --regexp TEXT                   Regular expression to apply to file
                                  directory (only wrangles matches). Be
                                  careful, if you use a very copmlex regexp
                                  directory sorting can be extremely slow (see
                                  e.g. discussion at
                                  https://stackoverflow.com/a/5428712)!
                                  [default: ^(?!.*(fx)).*$]

  --prefix TEXT                   Prefix to apply to output file names (not
                                  paths).

  --out-format [mag-files|mag-files-average-year-start-year|mag-files-average-year-mid-year|mag-files-average-year-end-year|mag-files-point-start-year|mag-files-point-mid-year|mag-files-point-end-year|magicc-input-files|magicc-input-files-average-year-start-year|magicc-input-files-average-year-mid-year|magicc-input-files-average-year-end-year|magicc-input-files-point-start-year|magicc-input-files-point-mid-year|magicc-input-files-point-end-year|tuningstrucs-blend-model]
                                  Format to re-write crunched data into. The
                                  time operation conventions follow those in
                                  `Pymagicc <https://pymagicc.readthedocs.io/e
                                  n/latest/file_conventions.html#namelists>`_.
                                  [default: mag-files]

  --drs [None|MarbleCMIP5|CMIP6Input4MIPs|CMIP6Output]
                                  Data reference syntax to use to decipher
                                  paths. This is required to ensure the output
                                  folders match the input data reference
                                  syntax.  [default: None]

  -f, --force / --do-not-force    Overwrite any existing files.  [default:
                                  False]

  --number-workers INTEGER        Number of worker (threads) to use when
                                  wrangling.  [default: 4]

  --target-units-specs PATH       csv containing target units for wrangled
                                  variables.

  -h, --help                      Show this message and exit.

MAG file wrangling

The most common format to wrangle to is the .MAG format. This is a custom MAGICC format (see https://pymagicc.readthedocs.io/en/latest/file_conventions.html#the-future). We can wrangle data which has already been crunched to this format as shown below.

[4]:
# NBVAL_IGNORE_OUTPUT
!netcdf-scm wrangle \
    "../../../tests/test-data/expected-crunching-output/cmip6output/Lmon/CMIP6/CMIP/NCAR" \
    "../../../output-examples/wrangled-files" "notebook example <email address>" \
    --force \
    --drs "CMIP6Output" \
    --out-format "mag-files" \
    --regexp ".*cSoilFast.*"
69278 2020-10-03 23:35:01,799 INFO:netcdf_scm:netcdf-scm: 2.0.0rc5+3.gc7d2d42.dirty
69278 2020-10-03 23:35:01,799 INFO:netcdf_scm:wrangle_contact: notebook example <email address>
69278 2020-10-03 23:35:01,799 INFO:netcdf_scm:source: /Users/znicholls/Documents/AGCEC/netCDF-SCM/netcdf-scm/tests/test-data/expected-crunching-output/cmip6output/Lmon/CMIP6/CMIP/NCAR
69278 2020-10-03 23:35:01,799 INFO:netcdf_scm:destination: /Users/znicholls/Documents/AGCEC/netCDF-SCM/netcdf-scm/output-examples/wrangled-files
69278 2020-10-03 23:35:01,799 INFO:netcdf_scm:regexp: .*cSoilFast.*
69278 2020-10-03 23:35:01,799 INFO:netcdf_scm:prefix: None
69278 2020-10-03 23:35:01,799 INFO:netcdf_scm:drs: CMIP6Output
69278 2020-10-03 23:35:01,799 INFO:netcdf_scm:out_format: mag-files
69278 2020-10-03 23:35:01,799 INFO:netcdf_scm:force: True
69278 2020-10-03 23:35:01,799 INFO:netcdf_scm:Finding directories with files
Walking through directories and applying `check_func`: 11it [00:00, 11542.99it/s]
69278 2020-10-03 23:35:01,807 INFO:netcdf_scm:Found 1 directories with files
69278 2020-10-03 23:35:01,808 INFO:netcdf_scm.cli_parallel:Processing in parallel with 4 workers
69278 2020-10-03 23:35:01,808 INFO:netcdf_scm.cli_parallel:Forcing dask to use a single thread when reading
100%|████████████████████████████████████████| 1.00/1.00 [00:05<00:00, 5.28s/it]

We can then load the .MAG files using Pymagicc.

[5]:
written_files = [
    f for f in Path("../../../output-examples/wrangled-files").rglob("*.MAG")
]
written_files
[5]:
[PosixPath('../../../output-examples/wrangled-files/CMIP6/CMIP/NCAR/CESM2/historical/r7i1p1f1/Lmon/cSoilFast/gn/v20190311/netcdf-scm_cSoilFast_Lmon_CESM2_historical_r7i1p1f1_gn_195701-195703.MAG')]
[6]:
wrangled = pymagicc.io.MAGICCData(str(written_files[0]))
[7]:
# NBVAL_IGNORE_OUTPUT
wrangled.timeseries()
[7]:
time 1957-01-15 12:00:00 1957-02-14 00:00:00 1957-03-15 12:00:00
climate_model model region scenario todo unit variable
unspecified unspecified World|Land unspecified SET kg m^-2 cSoilFast 0.085600 0.085547 0.085422
World unspecified SET kg m^-2 cSoilFast 0.085600 0.085547 0.085422
World|Northern Hemisphere unspecified SET kg m^-2 cSoilFast 0.097727 0.097910 0.098135
World|Southern Hemisphere|Land unspecified SET kg m^-2 cSoilFast 0.060421 0.059879 0.059024
World|Southern Hemisphere unspecified SET kg m^-2 cSoilFast 0.060421 0.059879 0.059024
World|Northern Hemisphere|Land unspecified SET kg m^-2 cSoilFast 0.097727 0.097910 0.098135
[8]:
# NBVAL_IGNORE_OUTPUT
wrangled.lineplot(hue="region")
[8]:
<AxesSubplot:xlabel='time', ylabel='kg m^-2'>
../_images/usage_wranglers_11_1.png

Adjusting units

The units of the wrangled data are kgmsuper-2. This might not be super helpful. As such, netcdf-scm wrangle allows users to specify a csv which defines the target units to use for variables when wrangling.

The conversion csv should look like the below.

[9]:
conv_csv = pd.DataFrame(
    [["cSoilFast", "t / m**2"], ["tos", "K"]], columns=["variable", "unit"]
)
conv_csv_path = "../../../output-examples/conversion-new-units.csv"
conv_csv.to_csv(conv_csv_path, index=False)
with open(conv_csv_path) as f:
    conv_csv_content = f.read()

print(conv_csv_content)
variable,unit
cSoilFast,t / m**2
tos,K

With such a csv, we can now crunch to our desired units.

[10]:
# NBVAL_IGNORE_OUTPUT
!netcdf-scm wrangle \
    "../../../tests/test-data/expected-crunching-output/cmip6output/Lmon/CMIP6/CMIP/NCAR" \
    "../../../output-examples/wrangled-files-new-units" \
    "notebook example <email address>" \
    --force --drs "CMIP6Output" \
    --out-format "mag-files" \
    --regexp ".*cSoilFast.*" \
    --target-units-specs "../../../output-examples/conversion-new-units.csv"
69297 2020-10-03 23:35:12,106 INFO:netcdf_scm:netcdf-scm: 2.0.0rc5+3.gc7d2d42.dirty
69297 2020-10-03 23:35:12,106 INFO:netcdf_scm:wrangle_contact: notebook example <email address>
69297 2020-10-03 23:35:12,106 INFO:netcdf_scm:source: /Users/znicholls/Documents/AGCEC/netCDF-SCM/netcdf-scm/tests/test-data/expected-crunching-output/cmip6output/Lmon/CMIP6/CMIP/NCAR
69297 2020-10-03 23:35:12,106 INFO:netcdf_scm:destination: /Users/znicholls/Documents/AGCEC/netCDF-SCM/netcdf-scm/output-examples/wrangled-files-new-units
69297 2020-10-03 23:35:12,106 INFO:netcdf_scm:regexp: .*cSoilFast.*
69297 2020-10-03 23:35:12,106 INFO:netcdf_scm:prefix: None
69297 2020-10-03 23:35:12,107 INFO:netcdf_scm:drs: CMIP6Output
69297 2020-10-03 23:35:12,107 INFO:netcdf_scm:out_format: mag-files
69297 2020-10-03 23:35:12,107 INFO:netcdf_scm:force: True
69297 2020-10-03 23:35:12,109 INFO:netcdf_scm:Finding directories with files
Walking through directories and applying `check_func`: 11it [00:00, 12103.19it/s]
69297 2020-10-03 23:35:12,117 INFO:netcdf_scm:Found 1 directories with files
69297 2020-10-03 23:35:12,117 INFO:netcdf_scm.cli_parallel:Processing in parallel with 4 workers
69297 2020-10-03 23:35:12,117 INFO:netcdf_scm.cli_parallel:Forcing dask to use a single thread when reading
100%|████████████████████████████████████████| 1.00/1.00 [00:06<00:00, 6.87s/it]
[11]:
# NBVAL_IGNORE_OUTPUT
written_files = [
    f
    for f in Path("../../../output-examples/wrangled-files-new-units").rglob(
        "*.MAG"
    )
]
wrangled_new_units = pymagicc.io.MAGICCData(str(written_files[0]))
wrangled_new_units.timeseries()
[11]:
time 1957-01-15 12:00:00 1957-02-14 00:00:00 1957-03-15 12:00:00
climate_model model region scenario todo unit variable
unspecified unspecified World unspecified SET t / m^2 cSoilFast 0.000086 0.000086 0.000085
World|Land unspecified SET t / m^2 cSoilFast 0.000086 0.000086 0.000085
World|Southern Hemisphere unspecified SET t / m^2 cSoilFast 0.000060 0.000060 0.000059
World|Northern Hemisphere|Land unspecified SET t / m^2 cSoilFast 0.000098 0.000098 0.000098
World|Northern Hemisphere unspecified SET t / m^2 cSoilFast 0.000098 0.000098 0.000098
World|Southern Hemisphere|Land unspecified SET t / m^2 cSoilFast 0.000060 0.000060 0.000059
[12]:
# NBVAL_IGNORE_OUTPUT
wrangled_new_units.lineplot(hue="region")
[12]:
<AxesSubplot:xlabel='time', ylabel='t / m^2'>
../_images/usage_wranglers_18_1.png

Taking area sum

We can also set the units to include an area sum. For example, if we set our units to Gt / yr rather than Gt / m**2 / yr then the wrangler will automatically take an area sum of the data (weighted by the effective area used in the crunching) before returning the data.

[13]:
conv_csv = pd.DataFrame(
    [["cSoilFast", "Gt"], ["tos", "K"]], columns=["variable", "unit"]
)
conv_csv_path = "../../../output-examples/conversion-area-sum-units.csv"
conv_csv.to_csv(conv_csv_path, index=False)
with open(conv_csv_path) as f:
    conv_csv_content = f.read()

print(conv_csv_content)
variable,unit
cSoilFast,Gt
tos,K

[14]:
# NBVAL_IGNORE_OUTPUT
!netcdf-scm wrangle \
    "../../../tests/test-data/expected-crunching-output/cmip6output/Lmon/CMIP6/CMIP/NCAR" \
    "../../../output-examples/wrangled-files-area-sum-units" \
    "notebook example <email address>" \
    --force \
    --drs "CMIP6Output" \
    --out-format "mag-files" \
    --regexp ".*cSoilFast.*" \
    --target-units-specs "../../../output-examples/conversion-area-sum-units.csv"
69328 2020-10-03 23:35:23,549 INFO:netcdf_scm:netcdf-scm: 2.0.0rc5+3.gc7d2d42.dirty
69328 2020-10-03 23:35:23,549 INFO:netcdf_scm:wrangle_contact: notebook example <email address>
69328 2020-10-03 23:35:23,549 INFO:netcdf_scm:source: /Users/znicholls/Documents/AGCEC/netCDF-SCM/netcdf-scm/tests/test-data/expected-crunching-output/cmip6output/Lmon/CMIP6/CMIP/NCAR
69328 2020-10-03 23:35:23,549 INFO:netcdf_scm:destination: /Users/znicholls/Documents/AGCEC/netCDF-SCM/netcdf-scm/output-examples/wrangled-files-area-sum-units
69328 2020-10-03 23:35:23,549 INFO:netcdf_scm:regexp: .*cSoilFast.*
69328 2020-10-03 23:35:23,549 INFO:netcdf_scm:prefix: None
69328 2020-10-03 23:35:23,549 INFO:netcdf_scm:drs: CMIP6Output
69328 2020-10-03 23:35:23,549 INFO:netcdf_scm:out_format: mag-files
69328 2020-10-03 23:35:23,549 INFO:netcdf_scm:force: True
69328 2020-10-03 23:35:23,551 INFO:netcdf_scm:Finding directories with files
Walking through directories and applying `check_func`: 11it [00:00, 10889.15it/s]
69328 2020-10-03 23:35:23,560 INFO:netcdf_scm:Found 1 directories with files
69328 2020-10-03 23:35:23,560 INFO:netcdf_scm.cli_parallel:Processing in parallel with 4 workers
69328 2020-10-03 23:35:23,560 INFO:netcdf_scm.cli_parallel:Forcing dask to use a single thread when reading
100%|████████████████████████████████████████| 1.00/1.00 [00:05<00:00, 5.42s/it]
[15]:
# NBVAL_IGNORE_OUTPUT
written_files = [
    f
    for f in Path(
        "../../../output-examples/wrangled-files-area-sum-units"
    ).rglob("*.MAG")
]
wrangled_area_sum_units = pymagicc.io.MAGICCData(str(written_files[0]))
wrangled_area_sum_units.timeseries()
[15]:
time 1957-01-15 12:00:00 1957-02-14 00:00:00 1957-03-15 12:00:00
climate_model model region scenario todo unit variable
unspecified unspecified World unspecified SET Gt cSoilFast 12.79290 12.7849 12.76610
World|Land unspecified SET Gt cSoilFast 12.79290 12.7849 12.76610
World|Northern Hemisphere unspecified SET Gt cSoilFast 9.85760 9.8760 9.89873
World|Northern Hemisphere|Land unspecified SET Gt cSoilFast 9.85760 9.8760 9.89873
World|Southern Hemisphere unspecified SET Gt cSoilFast 2.93526 2.9089 2.86740
World|Southern Hemisphere|Land unspecified SET Gt cSoilFast 2.93526 2.9089 2.86740
[16]:
# NBVAL_IGNORE_OUTPUT
solid_regions = [
    "World",
    "World|Northern Hemisphere",
    "World|Southern Hemisphere",
]
ax = wrangled_area_sum_units.filter(region=solid_regions).lineplot(
    hue="region", linestyle="-"
)
wrangled_area_sum_units.filter(region=solid_regions, keep=False).lineplot(
    hue="region", linestyle="--", dashes=(5, 7.5), ax=ax
)
[16]:
<AxesSubplot:xlabel='time', ylabel='Gt'>
../_images/usage_wranglers_23_1.png

As one last sanity check, we can make sure that the world total equals the regional total to within rounding errors.

[17]:
np.testing.assert_allclose(
    wrangled_area_sum_units.filter(region="World")
    .timeseries()
    .values.squeeze(),
    wrangled_area_sum_units.filter(
        region=["World|Northern Hemisphere", "World|Southern Hemisphere"]
    )
    .timeseries()
    .sum()
    .values.squeeze(),
    rtol=1e-5,
)

Time operations

The wrangling can also include a few basic time operations e.g. annual means or interpolation onto different grids. The different out-format codes follow those in Pymagicc (link to be updated once PR is merged). Here we show one example where we take the annual mean as part of the wrangling process.

[18]:
# NBVAL_IGNORE_OUTPUT
!netcdf-scm wrangle \
    "../../../tests/test-data/expected-crunching-output/cmip6output/Amon/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/piControl" \
    "../../../output-examples/wrangled-files-average-year" \
    "notebook example <email address>" \
    --force \
    --drs "CMIP6Output" \
    --out-format "mag-files-average-year-mid-year"
69348 2020-10-03 23:35:33,338 INFO:netcdf_scm:netcdf-scm: 2.0.0rc5+3.gc7d2d42.dirty
69348 2020-10-03 23:35:33,338 INFO:netcdf_scm:wrangle_contact: notebook example <email address>
69348 2020-10-03 23:35:33,338 INFO:netcdf_scm:source: /Users/znicholls/Documents/AGCEC/netCDF-SCM/netcdf-scm/tests/test-data/expected-crunching-output/cmip6output/Amon/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/piControl
69348 2020-10-03 23:35:33,338 INFO:netcdf_scm:destination: /Users/znicholls/Documents/AGCEC/netCDF-SCM/netcdf-scm/output-examples/wrangled-files-average-year
69348 2020-10-03 23:35:33,338 INFO:netcdf_scm:regexp: ^(?!.*(fx)).*$
69348 2020-10-03 23:35:33,338 INFO:netcdf_scm:prefix: None
69348 2020-10-03 23:35:33,338 INFO:netcdf_scm:drs: CMIP6Output
69348 2020-10-03 23:35:33,338 INFO:netcdf_scm:out_format: mag-files-average-year-mid-year
69348 2020-10-03 23:35:33,338 INFO:netcdf_scm:force: True
69348 2020-10-03 23:35:33,339 INFO:netcdf_scm:Finding directories with files
Walking through directories and applying `check_func`: 6it [00:00, 7212.90it/s]
69348 2020-10-03 23:35:33,346 INFO:netcdf_scm:Found 1 directories with files
69348 2020-10-03 23:35:33,346 INFO:netcdf_scm.cli_parallel:Processing in parallel with 4 workers
69348 2020-10-03 23:35:33,346 INFO:netcdf_scm.cli_parallel:Forcing dask to use a single thread when reading
100%|████████████████████████████████████████| 1.00/1.00 [00:09<00:00, 9.98s/it]
[19]:
# NBVAL_IGNORE_OUTPUT
written_files = [
    f
    for f in Path(
        "../../../output-examples/wrangled-files-average-year"
    ).rglob("*.MAG")
]
wrangled_annual_mean = pymagicc.io.MAGICCData(str(written_files[0]))
wrangled_annual_mean.timeseries()
[19]:
time 2840-07-01 00:00:00 2841-07-01 00:00:00 2842-07-01 00:00:00 2843-07-01 00:00:00 2844-07-01 00:00:00 2845-07-01 00:00:00 2846-07-01 00:00:00 2847-07-01 00:00:00 2848-07-01 00:00:00 2849-07-01 00:00:00 2850-07-01 00:00:00 2851-07-01 00:00:00 2852-07-01 00:00:00 2853-07-01 00:00:00 2854-07-01 00:00:00 2855-07-01 00:00:00 2856-07-01 00:00:00 2857-07-01 00:00:00 2858-07-01 00:00:00 2859-07-01 00:00:00
climate_model model region scenario todo unit variable
unspecified unspecified World|Northern Hemisphere|Land unspecified SET K tas 280.975 280.939 280.976 281.026 281.236 281.023 280.592 280.870 280.705 281.066 281.098 280.853 281.057 280.965 281.010 281.415 281.226 281.031 280.868 281.210
World|Ocean unspecified SET K tas 288.500 288.444 288.414 288.467 288.608 288.484 288.306 288.338 288.335 288.491 288.483 288.636 288.773 288.625 288.714 288.694 288.680 288.360 288.394 288.547
World|Southern Hemisphere|Ocean unspecified SET K tas 287.289 287.278 287.332 287.291 287.417 287.318 287.169 287.177 287.203 287.373 287.328 287.434 287.644 287.428 287.517 287.469 287.432 287.189 287.274 287.340
World|Southern Hemisphere unspecified SET K tas 285.184 285.185 285.270 285.167 285.320 285.206 284.970 285.033 285.090 285.292 285.150 285.318 285.596 285.275 285.479 285.338 285.331 285.057 285.135 285.240
World|Northern Hemisphere|Ocean unspecified SET K tas 290.066 289.951 289.813 289.987 290.148 289.991 289.774 289.840 289.798 289.936 289.975 290.191 290.232 290.172 290.260 290.277 290.292 289.872 289.841 290.107
World unspecified SET K tas 285.883 285.841 285.847 285.860 286.026 285.880 285.612 285.717 285.700 285.913 285.862 285.964 286.154 285.959 286.096 286.110 286.074 285.771 285.768 285.969
World|El Nino N3.4 unspecified SET K tas 297.656 296.947 296.951 297.666 297.818 297.051 296.019 296.918 296.887 297.067 297.055 298.488 297.646 297.392 298.025 297.780 296.800 295.766 297.053 298.179
World|Northern Hemisphere unspecified SET K tas 286.566 286.482 286.411 286.538 286.717 286.539 286.240 286.387 286.297 286.521 286.558 286.596 286.700 286.628 286.699 286.866 286.802 286.469 286.387 286.682
World|Southern Hemisphere|Land unspecified SET K tas 276.052 276.101 276.320 275.950 276.218 276.038 275.422 275.728 275.919 276.259 275.697 276.138 276.706 275.931 276.631 276.089 276.209 275.804 275.852 276.126
World|North Atlantic Ocean unspecified SET K tas 291.011 291.097 290.916 290.877 290.998 290.970 290.771 290.576 290.815 290.834 290.735 290.661 291.040 291.011 291.188 291.100 291.354 291.034 290.831 291.135
World|Land unspecified SET K tas 279.388 279.379 279.475 279.390 279.618 279.416 278.925 279.212 279.162 279.516 279.357 279.333 279.654 279.342 279.598 279.698 279.608 279.346 279.251 279.571
[20]:
# NBVAL_IGNORE_OUTPUT
wrangled_annual_mean.lineplot(hue="region")
[20]:
<AxesSubplot:xlabel='time', ylabel='K'>
../_images/usage_wranglers_29_1.png
[21]:
# NBVAL_IGNORE_OUTPUT
fig = plt.figure(figsize=(16, 9))
ax = fig.add_subplot(221)
wrangled_annual_mean.filter(region=["World", "World|*Hemisphere"]).lineplot(
    hue="region", ax=ax
)

ax = fig.add_subplot(222, sharey=ax, sharex=ax)
wrangled_annual_mean.filter(
    region=["World", "World|Land", "World|Ocean"]
).lineplot(hue="region", ax=ax)

ax = fig.add_subplot(223, sharey=ax, sharex=ax)
wrangled_annual_mean.filter(region=["World", "World|*Hemis*|*"]).lineplot(
    hue="region", ax=ax
)

ax = fig.add_subplot(224, sharey=ax, sharex=ax)
wrangled_annual_mean.filter(
    region=["World", "World|*El*", "World|*Ocean*"]
).lineplot(hue="region", ax=ax)
[21]:
<AxesSubplot:xlabel='time', ylabel='K'>
../_images/usage_wranglers_30_1.png