API reference¶

This page provides an auto-generated summary of synthia’s API. For more details and examples, refer to the relevant chapters in the main part of the documentation.

Data generators¶

class synthia.CopulaDataGenerator(verbose=False)¶

Estimates the characteristics of a set of multi-feature samples and generates synthetic samples with the same or modified characteristics, based on copulas.

The input can be a numpy array or xarray DataArray of shape (sample, feature), or an xarray Dataset where all variables have shapes like (sample[, …]). For Datasets, all extra dimensions except the first are treated as features.

The output is in the same form as the input.

Algorithm:

Fitting phase

(Gaussian copula) The multivariate correlation between features is estimated and stored as correlation matrix with shape (feature,feature). Matrix values are between -1 and 1 inclusive.

(Vine copula) The pyvinecopulib package is used to fit a vine copula that captures the multivariate correlation between features. See that package for further details.

(Optional) Some or all features of the input data are parameterized. If a feature is not parameterized, then the original data is used during generation. Parameterization may impact the quality of the synthetic samples. It can be useful for storing/re-distributing a data generator for later use without requiring the original data.

Per-feature summary statistics (min, max, median) of the input data are computed. These statistics are only used if the synthetic samples should be generated with modified characteristics (uniformization, stretching).

Note that all three steps in the fitting phase are independent from each other.

Generation phase

Generate new samples from the fitted copula model (Gaussian or Vine).

Transform copula samples to the feature scale by the quantile transform.

(Optional) Apply modifications (uniformization, stretching) if asked for.

Example

>>> import xarray as xr
>>> from scipy.stats import multivariate_normal
>>> import synthia as syn
>>> # Generate dataset ~ N(0, .5) with 1000 random samples and 2 features.
>>> mvnorm = multivariate_normal(mean=[0, 0], cov=[[1, 0.5], [0.5, 1]])
>>> arr = xr.DataArray(mvnorm.rvs(1000))
>>> # Initialize the generator
>>> generator = syn.CopulaDataGenerator(verbose=False)
>>> # Fit the generator to the data using a Gaussian copula model.
>>> generator.fit(arr, copula=syn.GaussianCopula(), parameterize_by=None)
>>> # Generate twice as many samples from the Gaussian copula model.
>>> synth = generator.generate(n_samples=2000)

__init__(verbose=False)¶

Parameters: verbose (bool, optional) – If True, prints progress messages during computation.

fit(data, copula, types=None, parameterize_by=None)¶

Fit the marginal distributions and copula model for all features.

Parameters

data (ndarray or DataArray or Dataset) – The input data, either a 2D array of shape (sample, feature) or a dataset where all variables have the shape (sample[, …]).
copula – The underlying copula to use, for example a GaussianCopula object.
types (str or mapping, optional) – Indicates whether features are categorical (‘cat’), discrete (‘disc’), or continuous (‘cont’). The following forms are valid:
- str
- per-feature mapping {feature idx: str} – ndarray/DataArray only
- per-variable mapping {var name: str} – Dataset only
parameterize_by (Parameterizer or mapping, optional) – The following forms are valid:
- Parameterizer
- per-feature mapping {feature idx: Parameterizer} – ndarray/DataArray only
- per-variable mapping {var name: Parameterizer} – Dataset only

Returns

None

generate(n_samples, uniformization_ratio=0, stretch_factor=1, **copula_kws)¶

Generate synthetic data from the model.

Parameters

n_samples (int) – Number of samples to generate.
uniformization_ratio (float or mapping, optional) – The following forms are valid:
- ratio
- per-feature mapping {feature idx: ratio} – ndarray/DataArray only
- per-variable mapping {var name: ratio} – Dataset only
stretch_factor (float or mapping, optional) – The following forms are valid:
- stretch factor
- per-feature mapping {feature idx: stretch factor} – ndarray/DataArray only
- per-variable mapping {var name: stretch factor} – Dataset only

Returns

Synthetic samples in the form of the input data

class synthia.FPCADataGenerator¶

Estimates the characteristics of a set of multi-feature samples and generates synthetic samples with the same or modified characteristics, based on (functional) principal component analysis.

The input can be a numpy array or xarray DataArray of shape (sample, feature), or an xarray Dataset where all variables have shapes like (sample [, …]). For Datasets, all extra dimensions except the first are treated as features.

The output is in the same form as the input.

Algorithm:

Fitting phase

Compute principal component vectors and, for every input sample,

corresponding principal component scores.

Fit a distribution model for each score (using all samples).

Generation phase

Generate new samples of principal component scores from the fitted

distributions.

Transform scores into synthetic data on the feature scale by

multiplying with principal component vectors.

fit(data, n_fpca_components, n_samples_reduced=None)¶

Find the first n_fpca_components principal components vectors and fit marginal distributions to the corresponding scores.

Parameters

data (ndarray or DataArray or Dataset) – The input data, either a 2D array of shape (sample, feature) or a dataset where all variables have the shape (sample[, …]).
n_fpca_components (int) – Reduces the number of features
n_samples_reduced (int, optional) – Reduces the number of samples

Returns

None

generate(n_samples, scaling_factor=None)¶

Generate synthetic data from the model.

Parameters

n_samples (int) – Number of samples to generate.
scaling_factor (float, optional) – tbd

Returns

Synthetic samples in the form of the input data

reconstruct(eig_scores)¶

Copulas¶

class synthia.GaussianCopula¶

A Gaussian copula.

fit(rank_standardized)¶

Fit a Gaussian copula to data.

Parameters: rank_standardized (ndarray) – 2D array of shape (feature, feature) with values in range [0,1]
Returns: None

generate(n_samples, qrng=False, seed=None)¶

Generate n_samples gaussian copula entries.

Parameters

n_samples (int) – Number of samples to generate.
qrng (bool, optional) – If True, quasirandom numbers are generated using pyvinecopulib.

Returns

2D array of shape (n_samples, feature) with gaussian copula entries.

class synthia.VineCopula(controls=None)¶

A Vine copula.

__init__(controls=None)¶

Parameters: controls (pyvinecopulib.FitControlsVinecop, optional) – Controls for fitting vine copula models.

fit(rank_standardized)¶

Fit a Vine copula to continuous data.

Parameters: rank_standardized (ndarray) – 2D array of shape (sample, feature) with values in range [-1,1]
Returns: None

fit_with_discrete(rank_standardized, is_discrete)¶

Fit a Vine copula to mixed continuous/discrete data

Parameters

rank_standardized (ndarray) – 2D array of shape (sample, feature) with values in range [-1,1]
is_discrete (List[bool]) – 1D list of booleans of shape (feature) indicating whether features are discrete or continuous

Returns

None

generate(n_samples, qrng=False, num_threads=1, seed=None)¶

Generate n_samples Vine copula entries.

Parameters: n_samples (int) – Number of samples to generate.
Returns: 2D array of shape (sample, feature) with Vine copula entries.

Parameterizers¶

class synthia.ConstParameterizer(val)¶

Preserves the size of the original data. No downsampling is performed at fitting and no interpolation is perfomed at generation.

__init__(val)¶

fit(data)¶

generate(n_samples)¶: Returns original samples without interpolation.

class synthia.QuantileParameterizer(n_quantiles)¶

Compresses the original data. Downsampling is performed at fitting using quantiles and interpolation is perfomed at generation using cubic interpolation.

__init__(n_quantiles)¶

fit(data)¶: Downsamples by fitting quantile vector of user-defined size.

generate(n_samples)¶: Generate samples vector using scipy.interpolate.PchipInterpolator.

class synthia.DistributionParameterizer(dist_names=None, verbose=False)¶

Compresses the original data. Downsampling is performed by fitting the “best” parametric distribution from those supported in scipy. Generation is perfomed using the respective random varietes method in scipy.

__init__(dist_names=None, verbose=False)¶: Args: dist_names (sequence of str, optional): Names of scipy distributions to use. verbose (bool, optional): If True, print progress messages.

fit(data)¶

Fits data on multiple distributions and keeps the best fit according to the Kolmogorov-Smirnov test.

Parameters: y (ndarray) – 1D array

generate(n_samples)¶

Generate n random samples from the best-fitting distribution.

Parameters: n (int) – Number of samples to generate.
Returns: 1D array of shape (n,)

static get_dist_names()¶

Returns a list of names of all supported scipy distributions.

Returns: List of distribution names

Transformers¶

class synthia.ArcTanhTransformer(var_names)¶

Preview: Performs inverse hyperbolic tangent tranformations on original data. Assumption: the first dimension is samples and all others are features. I.e. transformations will be done per feature.

__init__(var_names)¶

apply(ds)¶

revert(ds)¶

class synthia.BoxCoxTransformer(var_names, lmbda, boundary_location='left')¶

Preview: Performs Box Cox tranformations on original data. Assumption: the first dimension is samples and all others are features. I.e. transformations will be done per feature.

__init__(var_names, lmbda, boundary_location='left')¶

apply(ds)¶

revert(ds)¶

class synthia.CombinedTransformer(transformers)¶

Preview: Performs combined tranformations.

__init__(transformers)¶

apply(ds)¶

revert(ds)¶

Utilities¶

synthia.util.load_dataset(name='SAF-Synthetic')¶: Return a dataset of 25 000 synthetic temperature profiles from the SAF dataset (http://dx.doi.org/10.13140/2.1.4476.8963). These were fitted with 6 fPCA componenets in Synthia version 0.2.0.