GENERATING NETCDF OUTPUTS

Objective:


Computational architecture nowadays is 64-bit based. When dealing with numbers, this implies that computers have the capacity to store (in memory or disk) up to \(2^{64}\) different values. For practical purposes, this implies that one has up to 64 spaces to store any possible combination of 1’s and 0’s (that ultimately represents the actual value we want). Rainfall is the variable we’re interested here, so one can easily store 100 mm (or mm/h) in that 64-space “vector”. Compared to \(2^{64}\), \(100\) is a pretty small number, having the following representation in a binary system: \(1100100\). If you noticed, it only took 7 “spaces” to represent 100 in binary. Hence, wouldn’t be smarter to use 7 instead of 64 “spaces” to store the number 100?. For historical reasons, 8-bits (i.e., 1-byte) is the minimum “space”-configuration of modern computers. Thus, using 8 instead of 64 spaces means saving actual “space” up to 4 times!.

This notebook shows you how to take advantage of computer architecture concepts so you can actually save space in your PC, while storing/exporting NetCDF files in the most efficient/optimal way.


The masking of the data and the generation of NetCDF files is done by STORM through the functions: ROUNDX, NCBYTES_RAIN, NC_FILE_I, and NC_FILE_II of the rainfall.py module.


STORM PARAMETERS

[36]:
# OGC-WKT for HAD [taken from https://epsg.io/42106]
WKT_OGC = (
    'PROJCS["WGS84_/_Lambert_Azim_Mozambique",'
    'GEOGCS["unknown",'
    'DATUM["unknown",'
    'SPHEROID["Normal Sphere (r=6370997)",6370997,0]],'
    'PRIMEM["Greenwich",0,'
    'AUTHORITY["EPSG","8901"]],'
    'UNIT["degree",0.0174532925199433,'
    'AUTHORITY["EPSG","9122"]]],'
    'PROJECTION["Lambert_Azimuthal_Equal_Area"],'
    'PARAMETER["latitude_of_center",5],'
    'PARAMETER["longitude_of_center",20],'
    'PARAMETER["false_easting",0],'
    'PARAMETER["false_northing",0],'
    'UNIT["metre",1,'
    'AUTHORITY["EPSG","9001"]],'
    'AXIS["Easting",EAST],'
    'AXIS["Northing",NORTH],'
    'AUTHORITY["EPSG","42106"]]'
)

T_RES = 30  # in minutes! -> TEMPORAL.RES of TIME.SLICES
TIME_DICT_ = dict(seconds=1, minutes=1 / 60, hours=1 / 60**2, days=1 / (60**2 * 24))
# UNITS (since DATE_ORIGIN) for NC.TIME dim
TIME_OUTNC = "minutes"

# to store dates as INT
DATE_ORIGIN = "1970-01-01"
# Local Time Zone (see links below for more names)
TIME_ZONE = "Africa/Addis_Ababa"

# RAINFMT = 'f4'
# 'u' for UNSIGNED.INT  ||  'i' for SIGNED.INT  ||  'f' for FLOAT
# number of Bytes (1, 2, 4 or 8) to store the RAINFALL variable (into)
RAINFMT = "u2"
TIMEINT = "u4"  # format for integers in TIME dimension
TIMEFIL = +(2 ** (int(TIMEINT[-1]) * 8)) - 1

PRECISION = 0.025
iMIN = 8.0

ANOTHER SET OF PARAMETERS

[37]:
INTEGER = int(RAINFMT[-1])

# if one wants un-signed integers (e.g.) -> 1 (1 because you need 0 for NaN-filling)
MINIMUM = 1
# 65535  (largest unsigned integer of 16 Bits -> 0 also counts)
MAXIMUM = +(2 ** (INTEGER * 8)) - 1
# # 4294967296  (largest unsigned integer of 32 Bits)
# MAXIMUM = +(2 ** (32)) - 1

print(MINIMUM)
print(MAXIMUM)
1
65535
[38]:
# TO SCALE 'DOWNW' FLOATS TO INTEGERS

# NORMALIZING THE RAINFALL SO IT CAN BE STORED AS 16-BIT INTEGER (65,536 -> unsigned)
# https://stackoverflow.com/a/59193141/5885810      (scaling 'integers')
# https://stats.stackexchange.com/a/70808/354951    (normalize data 0-1)

# if you want a larger precision (or your variable is in the 'low' scale,
# ...say Summer Temperatures in Celsius) you must/could lower this limit.
iMAX = PRECISION * (MAXIMUM - MINIMUM - 1) + iMIN
print(f"iMAX = {iMAX}")

# SCALING FACTOR:
SCL = (iMAX - iMIN) / (MAXIMUM - MINIMUM - 1)  # -1 because 0 doesn't count

# ADDITIVE FACTOR:
# ADD = iMAX - SCL * MAXIMUM
ADD = iMIN - SCL * MINIMUM

print(f"SCLf = {SCL}")
print(f"ADDf = {ADD}")
iMAX = 1646.325
SCLf = 0.025
ADDf = 7.975
[39]:
# loading libraries

from datetime import datetime, timezone
from zoneinfo import ZoneInfo

import netCDF4 as nc4
import numpy as np
import rioxarray as rio
import xarray as xr
from dateutil.tz import tzlocal
[40]:
# help functions


def ROUNDX(x, prec=3, base=PRECISION):
    # FUNCTION to CUSTOM.ROUNDING TO SPECIFIC.BASE
    # https://stackoverflow.com/a/18666678/5885810
    return (base * (np.array(x) / base).round()).round(prec)


def BASE_ROUND(stamps, base=T_RES):
    # FUNCTION to ROUND TIME.STAMPS to the 'T_RES' FLOOR! (or NEAREST 'T_RES')
    # https://stackoverflow.com/a/2272174/5885810
    # "*60" because STAMPS come in seconds; and T_RES is in minutes
    base = base * TIME_DICT_[TIME_OUTNC] * 60
    iround = base * (np.ceil(stamps / base) - 1)  # .astype( TIMEINT )
    return iround

We won’t compute anything here. Somehow, you’ll have arrived to the stage of having some data in memory, which you’re about to export as NetCDF file. For this purpose, we’ll load in memory the for_wrong-1.nc dataset. (exported in notebook for_)

[41]:
# what's the data?
nc_file = "for_wrong-1.nc"
# read the data
data = nc4.Dataset(nc_file)

# how does it look like?... in the inside!
data.variables
[41]:
{'x': <class 'netCDF4._netCDF4.Variable'>
 float64 x(x)
     _FillValue: nan
     axis: X
     long_name: x coordinate of projection
     standard_name: projection_x_coordinate
     units: metre
 unlimited dimensions:
 current shape = (408,)
 filling on,
 'y': <class 'netCDF4._netCDF4.Variable'>
 float64 y(y)
     _FillValue: nan
     axis: Y
     long_name: y coordinate of projection
     standard_name: projection_y_coordinate
     units: metre
 unlimited dimensions:
 current shape = (470,)
 filling on,
 'xomethin': <class 'netCDF4._netCDF4.Variable'>
 int32 xomethin()
     crs_wkt: PROJCS["WGS84_/_Lambert_Azim_Mozambique",GEOGCS["unknown",DATUM["unknown",SPHEROID["Normal Sphere (r=6370997)",6370997,0]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433]],PROJECTION["Lambert_Azimuthal_Equal_Area"],PARAMETER["latitude_of_center",5],PARAMETER["longitude_of_center",20],PARAMETER["false_easting",0],PARAMETER["false_northing",0],UNIT["metre",1],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","42106"]]
     spatial_ref: PROJCS["WGS84_/_Lambert_Azim_Mozambique",GEOGCS["unknown",DATUM["unknown",SPHEROID["Normal Sphere (r=6370997)",6370997,0]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433]],PROJECTION["Lambert_Azimuthal_Equal_Area"],PARAMETER["latitude_of_center",5],PARAMETER["longitude_of_center",20],PARAMETER["false_easting",0],PARAMETER["false_northing",0],UNIT["metre",1],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","42106"]]
     GeoTransform: 1340000.0 5000.0 0.0 1170000.0 0.0 -5000.0
 unlimited dimensions:
 current shape = ()
 filling on, default _FillValue of -2147483647 used,
 'rain': <class 'netCDF4._netCDF4.Variable'>
 float32 rain(y, x)
     _FillValue: nan
     lat#axis: Y
     lat#bounds: lat_bnds
     lat#DimensionNames: lat
     lat#LongName: Latitude at the center of
                        0.05 degree grid intervals of latitude
                        from -90 to 90.
     lat#origname: lat
     lat#standard_name: latitude
     lat#units: degrees_north
     lat#_FillValue: nan
     lon#axis: X
     lon#bounds: lon_bnds
     lon#DimensionNames: lon
     lon#LongName: Longitude at the center of
                        0.05 degree grid intervals of longitude
                        from -180 to 180.
     lon#origname: lon
     lon#standard_name: longitude
     lon#units: degrees_east
     lon#_FillValue: nan
     scale_factor: 1.0
     add_offset: 0.0
     grid_mapping: xomethin
 unlimited dimensions:
 current shape = (470, 408)
 filling on,
 'mask': <class 'netCDF4._netCDF4.Variable'>
 uint8 mask(y, x)
     _FillValue: 0
     lat#axis: Y
     lat#bounds: lat_bnds
     lat#DimensionNames: lat
     lat#LongName: Latitude at the center of
                        0.05 degree grid intervals of latitude
                        from -90 to 90.
     lat#origname: lat
     lat#standard_name: latitude
     lat#units: degrees_north
     lat#_FillValue: nan
     lon#axis: X
     lon#bounds: lon_bnds
     lon#DimensionNames: lon
     lon#LongName: Longitude at the center of
                        0.05 degree grid intervals of longitude
                        from -180 to 180.
     lon#origname: lon
     lon#standard_name: longitude
     lon#units: degrees_east
     lon#_FillValue: nan
     scale_factor: 1.0
     add_offset: 0.0
     grid_mapping: xomethin
 unlimited dimensions:
 current shape = (470, 408)
 filling on}

rain is the variable we’re interested in (and it’s the one stored in the nc.file).

[42]:
# what does the inside of RAIN look like?
data["rain"][:]
[42]:
masked_array(
  data=[[  8.85314465,  12.18742561,  12.00609589, ...,  19.86124039,
          21.57285881,  22.31961441],
        [ 11.41961765,  12.11657238,  11.99657345, ...,  21.30685425,
          22.31961441,  22.31961441],
        [ 11.08961964,  12.00933456,  11.87809658, ...,  21.30685425,
          22.31961441,  22.31961441],
        ...,
        [461.05093384, 461.05093384, 462.56283569, ..., 530.9977417 ,
         516.53491211, 516.53491211],
        [472.34197998, 472.34197998, 470.32394409, ..., 530.9977417 ,
         516.53491211, 516.53491211],
        [459.64810181, 459.64810181, 462.9229126 , ..., 527.46783447,
         526.47857666, 526.47857666]],
  mask=False,
  fill_value=1e+20)

Print some statistics…

[43]:
print(data["rain"][:].min())
print(data["rain"][:].max())
8.853144645690918
1527.81982421875

Having previously established the limits, let’s work out if the precision (also previously established) is correct enough.

[44]:
# run your own (customized) tests

# initial seed for "temp"
temp = 3.14159
# seed larger than the maximum of RAIN data
seed = 4000
# test different PRECISIONs
epsilon = [0.05, 0.03, 0.025, 0.02, 0.01]

for epsn in epsilon:
    while temp > epsn:
        temp = (seed - iMIN) / (MAXIMUM - MINIMUM - 1)
        seed = seed - 1
    print(
        (
            f"starting from {iMIN}, you'd need a max. of ~{seed+1} to guarantee an epsilon of {epsn}"
        )
    )
starting from 8.0, you'd need a max. of ~3284 to guarantee an epsilon of 0.05
starting from 8.0, you'd need a max. of ~1973 to guarantee an epsilon of 0.03
starting from 8.0, you'd need a max. of ~1646 to guarantee an epsilon of 0.025
starting from 8.0, you'd need a max. of ~1318 to guarantee an epsilon of 0.02
starting from 8.0, you'd need a max. of ~663 to guarantee an epsilon of 0.01

So, starting a bit lower that the MINIMUM, and going about the PRECISION (of 0.025), we’ll surpass our MAXIMUM. This means that we can indeed represent all the numbers in rain using just 16-bit un-signed integers.

Let’s test if that’s at all true.

[45]:
# testing if i'm using my whole 16-INT "bit-space"
allv = PRECISION * (np.linspace(MINIMUM, MAXIMUM - 1, MAXIMUM - 1) - 1) + iMIN
print(allv.shape)

# last value
print(allv[-1])
print(allv[-1] == iMAX)

# is there any non-conscutive integer?
# ...if so, we're in trouble(s)
allt = (allv - ADD) / SCL
allt = allt.round(0).astype("u2")
print(np.where(np.diff(allt) != 1))

# go back from 16-INT to our "actual" values
vall = (allt * SCL) + ADD
vall = ROUNDX((allt * SCL) + ADD)
print(vall)
(65534,)
1646.325
True
(array([], dtype=int64),)
[   8.       8.025    8.05  ... 1646.275 1646.3   1646.325]
How does the rain-mask look like?.
Note that having a mask of 1’s and 0’s means we can easily save it as 8-bit unsigned INT.
[46]:
data["mask"][:].data
[46]:
array([[1., 1., 1., ..., 0., 0., 0.],
       [1., 1., 1., ..., 0., 0., 0.],
       [1., 1., 1., ..., 0., 0., 0.],
       ...,
       [1., 1., 1., ..., 0., 0., 0.],
       [1., 1., 1., ..., 0., 0., 0.],
       [1., 1., 1., ..., 0., 0., 0.]])

First, we get some help from our void set.

[47]:
# create VOID data
void = np.empty((len(data.dimensions["y"]), len(data.dimensions["x"])))
void.fill(np.nan)

# create xarray
xoid = xr.DataArray(
    data=void,
    dims=["y", "x"],
    # name='void',
    coords=dict(
        y=(["y"], data["y"][:].data),
        x=(["x"], data["x"][:].data),
    ),
    attrs=dict(
        _FillValue=np.nan,
        units="mm",
    ),
)
xoid.rio.write_crs(rio.crs.CRS(WKT_OGC), grid_mapping_name="spatial_ref", inplace=True)
[47]:
<xarray.DataArray (y: 470, x: 408)>
array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]])
Coordinates:
  * y            (y) float64 1.168e+06 1.162e+06 ... -1.172e+06 -1.178e+06
  * x            (x) float64 1.342e+06 1.348e+06 ... 3.372e+06 3.378e+06
    spatial_ref  int64 0
Attributes:
    _FillValue:  nan
    units:       mm

We’ll interested mainly in the following meta-data:

[48]:
# the CRS (which we already have)
print(xoid.rio.crs)
PROJCS["WGS84_/_Lambert_Azim_Mozambique",GEOGCS["unknown",DATUM["unknown",SPHEROID["Normal Sphere (r=6370997)",6370997,0]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433]],PROJECTION["Lambert_Azimuthal_Equal_Area"],PARAMETER["latitude_of_center",5],PARAMETER["longitude_of_center",20],PARAMETER["false_easting",0],PARAMETER["false_northing",0],UNIT["metre",1],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","42106"]]
[49]:
# ...and the transform
print(
    " ".join(
        map(
            str,
            np.roll(np.asarray(xoid.rio.transform()).reshape(3, 3), shift=1, axis=1)[
                :-1
            ]
            .ravel()
            .tolist(),
        )
    )
)
1340000.0 5000.0 0.0 1170000.0 0.0 -5000.0
In dealing with NetCDFs, most of the time we’ll also have to deal with dates.
The thing is… one can also store the date-dimension (and any other dimension –if possible–) as integer.
[50]:
# today's date
date_now = datetime.now(tzlocal())
print(date_now)

# append or modify its time.zone
date_now = date_now.replace(tzinfo=ZoneInfo(TIME_ZONE))
print(date_now)
2023-10-15 16:53:15.432848+01:00
2023-10-15 16:53:15.432848+03:00

To compute integers from dates, we need an origin time from which we can count seconds, hours, days… etc.

[51]:
# date of origin
DATE_ORIGEN = datetime.strptime(DATE_ORIGIN, "%Y-%m-%d").replace(
    tzinfo=ZoneInfo(TIME_ZONE)
)
print(DATE_ORIGEN)

# seconds since that date (to now... which is our date in analysis)
date_sec = (date_now - DATE_ORIGEN).total_seconds()
print(date_sec)
1970-01-01 00:00:00+03:00
1697388795.432848

STORM can interpret/transform seconds to the desired unit. (minutes in this case)

[52]:
# seconds to minutes
date_min = date_sec * TIME_DICT_[TIME_OUTNC]
print(date_min)
28289813.257214133

Rounding minutes…

[53]:
o_date_min = BASE_ROUND(date_min)
print(o_date_min)
28289790.0

Finally, we have all the ingredients to create the NetCDF.

[54]:
# the name of the output-NetCDF
out_name = "fiv_ok.nc"

# CREATE NC.FILE
nc = nc4.Dataset(out_name, "w", format="NETCDF4")
nc.created_on = datetime.now(tzlocal()).strftime("%Y-%m-%d %H:%M:%S %Z")  # %z

# define SUB.GROUP
sub_grp = nc.createGroup("group_one")
# define its dimensions
sub_grp.createDimension("y", len(data.dimensions["y"]))
sub_grp.createDimension("x", len(data.dimensions["x"]))
# time could be created here... but it's not
# sub_grp.createDimension('t', None)  # unlimited

# name for the CRS
sref_name = "spatial_ref"

# create variables
grid = sub_grp.createVariable(sref_name, "u1")

# define some attributes -> PART 1
grid.long_name = sref_name
grid.crs_wkt = xoid.spatial_ref.attrs["crs_wkt"]
grid.spatial_ref = xoid.spatial_ref.attrs["crs_wkt"]
# https://www.simplilearn.com/tutorials/python-tutorial/list-to-string-in-python
# https://www.geeksforgeeks.org/how-to-delete-last-n-rows-from-numpy-array/
# grid.GeoTransform = ' '.join( map(str, list(xoid.rio.transform()) ) )
# # this is apparently the "correct" way to store the GEOTRANSFORM!
grid.GeoTransform = " ".join(
    map(
        str,
        np.roll(np.asarray(xoid.rio.transform()).reshape(3, 3), shift=1, axis=1)[:-1]
        .ravel()
        .tolist(),
    )
)  # [:-1] removes last row

# define some attributes -> PART 2
# ~~~ from CFCONVENTIONS.ORG ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [start]
grid.grid_mapping_name = "lambert_azimuthal_equal_area"
grid.latitude_of_projection_origin = 5
grid.longitude_of_projection_origin = 20
grid.false_easting = 0
grid.false_northing = 0
# grid.horizontal_datum_name = 'WGS84'  # in pple this can also be un-commented!
grid.reference_ellipsoid_name = "sphere"
# new in CF-1.7 [https://cfconventions.org/wkt-proj-4.html]
grid.projected_crs_name = "WGS84_/_Lambert_Azim_Mozambique"
# ~~~ from CFCONVENTIONS.ORG ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [end]

# define some attributes -> PART 3
# ~~~ from https://publicwiki.deltares.nl/display/NETCDF/Coordinates [start]
grid._CoordinateTransformType = "Projection"
grid._CoordinateAxisTypes = "GeoY GeoX"
# ~~~ from https://publicwiki.deltares.nl/display/NETCDF/Coordinates ~ [end]

# STORING LOCAL COORDINATES (which are seen also as variables)
yy = sub_grp.createVariable("projection_y_coordinate", "i4", dimensions=("y"))
xx = sub_grp.createVariable("projection_x_coordinate", "i4", dimensions=("x"))
yy[:] = data["y"][:].data
xx[:] = data["x"][:].data
yy.coordinates = "projection_y_coordinate"
xx.coordinates = "projection_x_coordinate"
yy.units = "meter"
xx.units = "meter"
yy.long_name = "y coordinate of projection"
xx.long_name = "x coordinate of projection"
yy._CoordinateAxisType = "GeoY"
xx._CoordinateAxisType = "GeoX"
yy.grid_mapping = sref_name
xx.grid_mapping = sref_name

# store these attributes (for later use)
tag_y = yy.getncattr("coordinates")
tag_x = xx.getncattr("coordinates")

# STORING THE MASK (as 8-bit INT)
ncmask = sub_grp.createVariable(
    "mask", "i1", dimensions=("y", "x"), zlib=True, complevel=9
)
ncmask[:] = data["mask"][:].data.astype("i1")
ncmask.grid_mapping = sref_name
ncmask.long_name = "catchment mask"
ncmask.description = "1 means catchment or region : 0 is void"
ncmask.coordinates = f"{tag_y} {tag_x}"


# # if ANOTHER/OTHER variable is needed (e.g.)
# maskxx = sub_grp.createVariable(
#     "k_means",
#     datatype="i1",
#     dimensions=("y", "x"),
#     dimensions=("y", "x"),
#     zlib=True,
#     complevel=9,
#     fill_value=np.array(-1).astype("i1"),
# )
# maskxx.grid_mapping = sref_name
# maskxx.long_name = "k-means clusters"
# maskxx.description = "-1 indicates region out of any cluster"
# maskxx.coordinates = f"{tag_y} {tag_x}"


# define the TIME.variable/dimension
nctnam = "time"
sub_grp.createDimension(nctnam, None)
timexx = sub_grp.createVariable(nctnam, TIMEINT, (nctnam), fill_value=TIMEFIL)
timexx[:] = o_date_min.astype(TIMEINT)
timexx.long_name = "starting time"
# '%Y-%m-%d %H:%M:%S %Z%z'
timexx.units = f"{TIME_OUTNC} since {DATE_ORIGEN.strftime('%Y-%m-%d %H:%M:%S')}"
timexx.calendar = "proleptic_gregorian"  # 'gregorian'#
timexx._CoordinateAxisType = "Time"
timexx.coordinates = nctnam

# SWITCH IF YOU ONE STILL TO PRESERVE YOUR var AS FLOAT
if RAINFMT[0] == "f":
    # -DOING.FLOATS------------------------------------------------------------------
    ncvarx = sub_grp.createVariable(
        "rain",
        datatype=f"{RAINFMT}",
        dimensions=(nctnam, "y", "x"),
        zlib=True,
        complevel=9,
        least_significant_digit=3,
        fill_value=np.nan,
    )
    ncvarx[:] = (data["rain"][:].data).astype(f"{RAINFMT}")
else:
    # -DOING.INTEGERS--------------------------------------------------------------
    ncvarx = sub_grp.createVariable(
        "rain",
        datatype=f"{RAINFMT}",
        dimensions=(nctnam, "y", "x"),
        zlib=True,
        complevel=9,
        # least_significant_digit=3),
        fill_value=np.array(0).astype(f"{RAINFMT}"),
    )
    ncvarx[0, :] = (((data["rain"][:].data - ADD) / SCL).round(0)).astype(f"{RAINFMT}")
    # ncvarx[0,:] -> ONLY FOR THIS EXERCISE

# attributes for the RAIN variable (so later can be properly read)
ncvarx.precision = PRECISION
ncvarx.scale_factor = SCL
ncvarx.add_offset = ADD
# ncvarx._FillValue = np.array(MINIMUM).astype(f"{RAINFMT}")
# ncvarx._FillValue = np.array(0).astype(f"{RAINFMT}")
ncvarx.units = "mm"
ncvarx.long_name = "rainfall"
ncvarx.grid_mapping = sref_name
ncvarx.coordinates = f"{tag_y} {tag_x}"

nc.close()

Time to check if this technique is at all useful??