AIModels.AIutil module¶
Utility Treatment Routines for ML Predicting models¶
This modules contains a set of routines for preparing data for insertion in transformers models used for prediction of multivariate time series data. The classes are contaiend in the companion file AIClasses.py and are imported in this module. The classes are contaiend in the companion file AIClasses.py and are imported in this module.
Utilities¶
- AIModels.AIutil.copy_dict(data, strip_values=False, remove_keys=[])[source]¶
Copy dictionary
- Parameters:
data (dict) -- Dictionary to be copied
strip_values (boolean) -- If True, strip values
remove_keys (list) -- List of keys to be removed
- Returns:
out -- Copied dictionary without the keys in remove_keys
- Return type:
dict
- AIModels.AIutil.get_arealat_arealon(area)[source]¶
Obtain lat lon extent for various choices of areas
- Parameters:
area (string) -- Area to be analyzed, possible values are
'TROPIC': Tropics (-35,35), [0,360]
'GLOBAL': Global (-60,60), [120,290]
'PACTROPIC': Pacific Tropics (-25,25), [180,290]
'WORLD': World (-70,70), [0,360]
'EUROPE': Europe (30,70), [-15,50]
'NORTH_AMERICA': North America (25,70), [200,310]
'NH-ML': Northern Hemisphere Mid-Latitudes (20,90), [0,360]
- AIModels.AIutil.select_field(INX, outfield, verbose=False)[source]¶
Select field for outfield in dict
- Parameters:
INX (dict) -- Dictionary with the fields to be analyzed
outfield (string) -- Field to be analyzed
verbose (boolean) -- If True, print the field information
- Returns:
out (numpy array) -- Field data
arealat (numpy array) -- Latitude limits of the field
arealon (numpy array) -- Longitude limits of the field
centlon (numpy array) -- Central longitude of the field
- AIModels.AIutil.select_field_key(INX, outfield, dataname)[source]¶
Select field for outfield in dict according to dataname
- Parameters:
INX (dict) -- Dictionary with the fields to be analyzed
outfield (string) -- Field to be analyzed
dataname (string) -- Key to be analyzed
- Returns:
out -- Field data
- Return type:
numpy array
- AIModels.AIutil.select_field_eof(INX, outfield)[source]¶
Select eof-related fields for outfield in dict
- Parameters:
INX (dict) -- Dictionary with the fields to be analyzed
outfield (string) -- Field to be analyzed
- Returns:
U (numpy array) -- EOF modes
S (numpy array) -- Singular values
V (numpy array) -- EOF coefficients
- AIModels.AIutil.select_area(area, ZZ)[source]¶
Select area for dataset ZZ
- Parameters:
area (string) -- Area to be analyzed, possible values are (EUROPE only implemented)
'TROPIC': Tropics
'GLOBAL': Global
'PACTROPIC': Pacific Tropics
'WORLD': World
'EUROPE': Europe
'NORTH_AMERICA': North America
'NH-ML': Northern Hemisphere Mid-Latitudes
ZZ (xarray dataset)
- Returns:
Z
- Return type:
xarray dataset
- AIModels.AIutil.make_matrix(S, SMOOTH, normalization, shift, **kwargs)[source]¶
Make matrix for SVD analysis
- Parameters:
S (xarray dataset) -- Dataset with the field to be analyzed
SMOOTH (boolean) -- If True, smooth the data
normalization (string) -- Type of normalization to be applied
shift (string) -- Type of shift to be applied to select the data
dropnan (String) -- Option to drop NaN values from Xmat True -- Uses indexed dropna in Xmat False -- Uses standard dropna in Xmat from Xarray
detrend (boolean) -- If True, detrend the data
- Returns:
X -- Math Matrix with EOF coefficients
- Return type:
xarray dataset
- AIModels.AIutil.make_eof(X, mr, eof_interval=None)[source]¶
Compute EOF analysis on data matrix
- Parameters:
X (X matrix) -- Data matrix in X format
mr (int) -- Number of modes to be retained
eof_interval (list) -- List with the starting and ending date for EOF analysis
- Returns:
mr (int) -- Number of modes retained, in case the number of modes is less than rank Set to math.inf to keep all modes
var_retained (float) -- Percentage of variance retained
udat (numpy array) -- Matrix of EOF modes
vdat (numpy array) -- Matrix of EOF coefficients
sdat (numpy array) -- Vector of singular values
- AIModels.AIutil.make_field(*args, **kwargs)[source]¶
Make field for analysis
- Parameters:
area (string (postional argument)) -- Area to be analyzed, possible values are
'TROPIC': Tropics
'GLOBAL': Global
'PACTROPIC': Pacific Tropics
'WORLD': World
'EUROPE': Europe
'NORTH_AMERICA': North America
'NH-ML': Northern Hemisphere Mid-Latitudes
var (string (keyword argument)) -- Variable to be analyzed
level (string (keyword argument)) -- Level to be analyzed
period (string (keyword argument)) -- Period to be analyzed
version (string (keyword argument)) -- Version of the dataset
loc (string (keyword argument)) -- Location of the dataset
- Returns:
Z (xarray dataset) -- Field to be analyzed
arealat (numpy array) -- Latitude limits of the field
arealon (numpy array) -- Longitude limits of the field
- AIModels.AIutil.make_field_V5(area, **kwargs)[source]¶
Make field for analysis
- Parameters:
area (string) -- Area to be analyzed, possible values are
'TROPIC': Tropics
'GLOBAL': Global
'PACTROPIC': Pacific Tropics
'WORLD': World
'EUROPE': Europe
'NORTH_AMERICA': North America
'NH-ML': Northern Hemisphere Mid-Latitudes
var (string) -- Variable to be analyzed
level (string) -- Level to be analyzed
period (string) -- Period to be analyzed
version (string) -- Version of the dataset
loc (string) -- Location of the dataset
- Returns:
Z (xarray dataset) -- Field to be analyzed
arealat (numpy array) -- Latitude limits of the field
arealon (numpy array) -- Longitude limits of the field
- AIModels.AIutil.make_field_HAD(area, **kwargs)[source]¶
Make field for analysis for HADSST
- Parameters:
area (string) -- Area to be analyzed, possible values are
'TROPIC': Tropics
'GLOBAL': Global
'PACTROPIC': Pacific Tropics
'WORLD': World
'EUROPE': Europe
'NORTH_AMERICA': North America
'NH-ML': Northern Hemisphere Mid-Latitudes
var (string) -- Variable to be analyzed
level (string) -- Level to be analyzed
period (string) -- Period to be analyzed
version (string) -- Version of the dataset
loc (string) -- Location of the dataset
- Returns:
Z (xarray dataset) -- Field to be analyzed
arealat (numpy array) -- Latitude limits of the field
arealon (numpy array) -- Longitude limits of the field
- AIModels.AIutil.normalize_training_data(params, vdat, period='train', scaler=None, feature_scale=1)[source]¶
Normalize training data
- Parameters:
params (dict) -- Dictionary with the parameters for the analysis
vdat (numpy array) -- Data to be normalized
period (string) -- Period to be analyzed
scaler (object) -- Scaler object
feature_scale (float) -- Feature scale
- Returns:
datatr (numpy array) -- Normalized data
sstr (object) -- Scaler object
- AIModels.AIutil.make_data(INX, params)[source]¶
Prepare data for analysis and concatenate as needed. Modify input INX dictionary by adding values for scaler and index for each field. scaler is the scaler used, index is the index of the data in the concatenated matrix INX.
The Convention for indeces is that they point to the real date. If python ranges need to be defined then it must take into account the extra 1 in the end of the range.
- Parameters:
INX (dict) -- Dictionary with the fields to be analyzed
params (dict) -- Dictionary with the parameters for the analysis
- Returns:
datain (numpy array) -- Matrix with the data to be analyzed
INX (dict) -- Input dictionary updated with the information from the data analysis
- AIModels.AIutil.make_features(INX)[source]¶
Prepare features for analysis and compute features boundaries
- Parameters:
INX (dict) -- Dictionary with the fields to be analyzed
- Returns:
num_features (int) -- Number of features
m_limit (list) -- List with the boundaries of the features
- AIModels.AIutil.make_data_base(InputVars, period='ANN', version='V5', SMOOTH=False, normalization='STD', eof_interval=None, detrend=False, shift='ERA5', case=None, datatype='Source_data', location='DDIR')[source]¶
Organize data variables in data base INX
- Parameters:
InputVars (list) -- List of variables to be analyzed
period (string) -- Period to be analyzed
version (string) -- Version of the dataset
SMOOTH (boolean) -- If True, smooth the data
normalization (string) -- Type of normalization to be applied
eof_interval (list) -- List with the starting and ending date for EOF analysis
detrend (boolean) -- If True, detrend the data
shift (string) -- Type of shift to be applied to select the data
case (string) -- Case to be analyzed
datatype (string) -- Type of data to be analyzed, either Source_data or Target_data
location (string) -- Data directory
- Returns:
INX -- Dictionary with the fields to be analyzed. The dictionary is organized as follows: INX = {'id1':{'case':case,'datatype': datatype,'field':invar.name,'level':inlevel, 'centlon':centlon,'arealat':arealat, 'arealon':arealon, X':X,'xdat':xdat,'mr':mr,'var_retained':varr,'udat':udat,'vdat':vdat,'sdat':sdat}}
- Return type:
dict
- AIModels.AIutil.create_time_features(data_time, start, device)[source]¶
Create the past features for monthly means
- Parameters:
data_time (xarray dataset) -- Time data
start (datetime) -- Starting date
device (string) -- Device to be used for computation
- Returns:
pasft -- Tensor with the past features Three levels:
The first level is the month sequence
The second level is the seasonal cycle
The third is the year
- Return type:
torch tensor
- AIModels.AIutil.rescale(params, PDX, out_train0, out_val0, out_test0, verbose=True)[source]¶
' Rescale data to original values, according to scaling choice in
- Parameters:
params (dict) -- Dictionary with the parameters for the analysis
PDX (dict) -- Dictionary with the information for the fields to be analyzed
out_train0 (numpy array) -- Training data
out_val0 (numpy array) -- Validation data
out_test0 (numpy array) -- Test data
- Returns:
out_train (numpy array) -- Rescaled training data
out_val (numpy array) -- Rescaled validation data
out_test (numpy array) -- Rescaled test data
true (numpy array) -- Original data
- AIModels.AIutil.matrix_rank_light(X, S)[source]¶
Compute the rank of a matrix using the singular values
- Parameters:
X (numpy array) -- Matrix to be analyzed
S (numpy array) -- Singular values of the matrix
- Returns:
rank -- Rank of the matrix
- Return type:
int
- AIModels.AIutil.CPRSS(observations, climate, forecasts)[source]¶
Compute the CRPSS score for ensemble forecasts with respect to the climate reference.
- Positive CRPSS (0 < CRPSS ≤ 1): Indicates that your forecast model has better skill than the reference model.
Closer to 1: Greater improvement over the reference.
- Zero CRPSS (CRPSS = 0): Your forecast model performs equally to the reference model.
Negative CRPSS (CRPSS < 0): Indicates that your forecast model performs worse than the reference model. More Negative: Greater underperformance compared to the reference.
- Parameters:
observations (numpy array) -- Observations data
forecasts (numpy array) -- Forecasts data
climate (numpy array) -- Reference data
- Returns:
score -- CRPS score
- Return type:
float
- AIModels.AIutil.transform_strings(strings)[source]¶
Transforms a list of strings by extracting the repeating pattern (2, 3, or 7 characters long) from each string.
- Parameters:
strings (list of str): List of input strings.
- Returns:
list of str: Transformed strings with only the repeating pattern.
- AIModels.AIutil.make_fcst_array(startdate, enddate, leads, data)[source]¶
Make forecast array for verification. The input uses the xarray DataArray format of dimension (ntim, lead, z) where z is a stacked coordinate.
The output is an xarray DataArray with the time, lead, and z dimensions, and the valid_time coordinate as a 2D array.
The lead time starts from 0 as the last element of the input sequence to leads-1.
- Parameters:
startdate (string) -- Starting date for the forecast
enddate (string) -- Ending date for the forecast
leads (int) -- Number of leads, including the IC
data (xarray DataArray) -- DataArray with the forecast data
- Returns:
out -- Forecast array for verification
- Return type:
xarray DataArray
- AIModels.AIutil.eof_to_grid(field, forecasts, startdate, enddate, params=None, INX=None, truncation=None)[source]¶
Transform from the eof representation to the grid representation, putting data in the format (Np, Tpredict,gridpoints) in stacked format . Where Np is the total number of cases that is given by $N_p = L - TIN - Tpredict +1$ L is the total length of the test period and Tpredict is the number of lead times and gridpoints is the number of grid points in the field. All fields start at month 1 of the prediction.
- Parameters:
field (string) -- Field to be analyzed
forecasts (numpy array) -- Forecasts data from the network
startdate (int) -- Starting date for the forecast (IC)
enddate (int) -- Ending date for the forecast
params (dict) -- Dictionary with the parameters for the analysis
INX (dict) -- Dictionary with the information for the fields to be analyzed
truncation (int) -- Number of modes to be retained in the observations
- Returns:
The routines returns arrays in the format of xarray datasets
as (Np,lead,grid points) in stacked format
Fcst (xarray dataset) -- Forecast data in grid representation with dims (Np,lead,grid points)
Per (xarray dataset) -- Persistence data in grid representation with dims (Np,lead,grid points)
- AIModels.AIutil.advance_months(ts, n=1)[source]¶
Advance or reduce a Timestamp by n months
- Parameters:
ts (pd.Timestamp or str) -- Timestamp to be modified. If str, it should be in the YYYY-MM-DD format.
n (int) -- Number of months to advance (positive) or reduce (negative)
- Returns:
Modified timestamp. If input was a string, output will be a string in the YYYY-MM-DD format.
- Return type:
pd.Timestamp or str
- AIModels.AIutil.project_dyn(data, INX, field, truncation=None)[source]¶
Project the dynamic data into the EOF space
- Parameters:
data (xarray dataset) -- Dynamic data
INX (dict) -- Dictionary with the information for the fields to be analyzed
field (string) -- Field to be analyzed
truncation (int) -- Number of modes to be retained in the observations
- Returns:
out -- Projected data, in stacked format with NaN values
- Return type:
xarray dataset
- AIModels.AIutil.make_dyn_verification_new(ver_f, area, dyn_cases, dynddr, times, filever)[source]¶
Make dynamic verification data. Read all time levels for the GCM data
- Parameters:
ver_f (numpy array) -- Verification data
area (string) -- Area to be analyzed
dyn_cases (list) -- List of cases to be analyzed
dynaddr (string) -- Address of the dynamic data
times (numpy array) -- Time data
filever (string) -- Name of the file to be written
- Returns:
ver_d -- Verification data in numpy format
- Return type:
numpy array
- AIModels.AIutil.compute_increments(tensor, axis=0)[source]¶
Take the difference of the torch tensor along the specified axis and output the initial value
- Parameters:
tensor (torch tensor) -- Tensor to be analyzed
axis (int) -- Axis along which to compute the differences
- Returns:
diff (torch tensor) -- Tensor with the differences along the specified axis
init_value (torch tensor) -- Initial value along the specified axis
- AIModels.AIutil.cumsum_with_init(differences, init_value)[source]¶
Compute cumulative sum with initial values for a PyTorch tensor.
This function calculates the cumulative sum of a tensor containing differences, while incorporating specific initial values. The input tensor differences has dimensions (time, nlead, neof), and the initial values tensor init_value has dimensions (neof). The initial values are added to the first time step of each lead slice, and the cumulative sum is computed along the time dimension.
- Parameters:
differences (torch.Tensor) -- Tensor containing the differences with dimensions (time, nlead, neof).
init_value (torch.Tensor) -- 1D tensor representing the initial values, with size matching the last dimension (neof) of differences.
- Returns:
Tensor after computing the cumulative sum with initial values, having the same shape as the input differences.
- Return type:
torch.Tensor
- Raises:
ValueError -- If the size of init_value does not match the last dimension of differences.
Computes the cumulative sum for a PyTorch tensor of differences with specific initial values. --
The differences tensor has dimensions (time, nlead, neof), and the initial values tensor --
has dimensions (neof). The initial values are added to the first time level of each slice (lead) --
and then accumulated using the cumulative sum. --
Parameters: --
- differences -- torch.Tensor: Input tensor containing the differences with dimensions (time, nlead, neof).
- init_value -- torch.Tensor: A 1D tensor representing the initial values, with a size matching the last dimension (neof).
Returns: --
- result -- torch.Tensor: The resulting tensor after computing the cumulative sum with initial values.
- AIModels.AIutil.select_fcst(IC, my_data)[source]¶
Select the forecast data for the given initial condition IC from the xarray my_data dataset.
- Parameters:
IC (string) -- Initial condition for the forecast
my_data (xarray dataset) -- Forecast data
- Returns:
ds_single_init -- Forecast data for verification with coordinate "time" as the valid_time
- Return type:
xarray dataset
- AIModels.AIutil.variance_features(INX)[source]¶
retur the variance of the features
- Parameters:
INX (dict) -- Dictionary with the information for the fields to be analyzed
- Returns:
ssvar -- Variance of the features
- Return type:
numpy array
- AIModels.AIutil.create_subdirectory(parent_dir, subdirectory_name)[source]¶
Create a subdirectory within the specified parent directory if it does not exist.
- Parameters:
parent_dir (str) -- The path to the parent directory.
subdirectory_name (str) -- The name of the subdirectory to create.
- Returns:
The full path of the created or existing subdirectory.
- Return type:
str
- AIModels.AIutil.eof_to_grid_new(field, forecasts, startdate, enddate, params=None, INX=None, truncation=None)[source]¶
Transform from the EOF representation to the grid representation. The routine now supports forecasts arrays with an extra ensemble dimension.
For the original case, forecasts has shape (Np, T, n_eofs) and the output forecast dataset has dims (Np, lead, gridpoints). Now, if forecasts has shape (Np, K, T, n_eofs), the output forecast dataset will have dims (Np, member, lead, gridpoints).
- Parameters:
field (string) -- Field to be analyzed
forecasts (numpy array) -- Forecast data from the network. Expected shape is either (Np, T, n_eofs) or (Np, K, T, n_eofs), where K is the ensemble size.
startdate (datetime-like) -- Starting date for the forecast (initial condition)
enddate (datetime-like) -- Ending date for the forecast
params (dict) -- Dictionary with the parameters for the analysis. Must include 'Tpredict', 'TIN', and 'T'
INX (dict) -- Dictionary with information for the fields to be analyzed (including the EOF patterns)
truncation (int, optional) -- Number of modes to be retained in the observations
- Returns:
Fcst (xarray DataArray) -- Forecast data in grid representation. * If forecasts is 3D, dims are (time, lead, z). * If forecasts is 4D, dims are (time, member, lead, z).
Per (xarray DataArray) -- Persistence data in grid representation, with the same dims as Fcst.
Obs (xarray DataArray) -- Observation data in grid representation, with dims (time, z)
- AIModels.AIutil.get_common_dates(Ft, Dt)[source]¶
Get common dates between two xarray datasets and their indices.
- Parameters:
Ft (xarray.DataArray) -- First dataset with a time dimension.
Dt (xarray.DataArray) -- Second dataset with a time dimension.
- Returns:
common_dates (numpy.ndarray) -- Array of common dates.
Ft_indices (numpy.ndarray) -- Indices of common dates in the first dataset.
Dt_indices (numpy.ndarray) -- Indices of common dates in the second dataset.