API reference¶
This page provides an auto-generated summary of synthia’s API. For more details and examples, refer to the relevant chapters in the main part of the documentation.
Data generators¶
- class synthia.CopulaDataGenerator(verbose=False)¶
Estimates the characteristics of a set of multi-feature samples and generates synthetic samples with the same or modified characteristics, based on copulas.
The input can be a numpy array or xarray DataArray of shape (sample, feature), or an xarray Dataset where all variables have shapes like (sample[, …]). For Datasets, all extra dimensions except the first are treated as features.
The output is in the same form as the input.
Algorithm:
Fitting phase
(Gaussian copula) The multivariate correlation between features is estimated and stored as correlation matrix with shape (feature,feature). Matrix values are between -1 and 1 inclusive.
(Vine copula) The pyvinecopulib package is used to fit a vine copula that captures the multivariate correlation between features. See that package for further details.
(Optional) Some or all features of the input data are parameterized. If a feature is not parameterized, then the original data is used during generation. Parameterization may impact the quality of the synthetic samples. It can be useful for storing/re-distributing a data generator for later use without requiring the original data.
Per-feature summary statistics (min, max, median) of the input data are computed. These statistics are only used if the synthetic samples should be generated with modified characteristics (uniformization, stretching).
Note that all three steps in the fitting phase are independent from each other.
Generation phase
Generate new samples from the fitted copula model (Gaussian or Vine).
Transform copula samples to the feature scale by the quantile transform.
(Optional) Apply modifications (uniformization, stretching) if asked for.
Example
>>> import xarray as xr >>> from scipy.stats import multivariate_normal >>> import synthia as syn >>> # Generate dataset ~ N(0, .5) with 1000 random samples and 2 features. >>> mvnorm = multivariate_normal(mean=[0, 0], cov=[[1, 0.5], [0.5, 1]]) >>> arr = xr.DataArray(mvnorm.rvs(1000)) >>> # Initialize the generator >>> generator = syn.CopulaDataGenerator(verbose=False) >>> # Fit the generator to the data using a Gaussian copula model. >>> generator.fit(arr, copula=syn.GaussianCopula(), parameterize_by=None) >>> # Generate twice as many samples from the Gaussian copula model. >>> synth = generator.generate(n_samples=2000)
- __init__(verbose=False)¶
- Parameters
verbose (bool, optional) – If True, prints progress messages during computation.
- fit(data, copula, types=None, parameterize_by=None)¶
Fit the marginal distributions and copula model for all features.
- Parameters
data (ndarray or DataArray or Dataset) – The input data, either a 2D array of shape (sample, feature) or a dataset where all variables have the shape (sample[, …]).
copula – The underlying copula to use, for example a GaussianCopula object.
types (str or mapping, optional) – Indicates whether features are categorical (‘cat’), discrete (‘disc’), or continuous (‘cont’). The following forms are valid:
str
per-feature mapping {feature idx: str} – ndarray/DataArray only
per-variable mapping {var name: str} – Dataset only
parameterize_by (Parameterizer or mapping, optional) – The following forms are valid:
Parameterizer
per-feature mapping {feature idx: Parameterizer} – ndarray/DataArray only
per-variable mapping {var name: Parameterizer} – Dataset only
- Returns
None
- generate(n_samples, uniformization_ratio=0, stretch_factor=1, **copula_kws)¶
Generate synthetic data from the model.
- Parameters
n_samples (int) – Number of samples to generate.
uniformization_ratio (float or mapping, optional) – The following forms are valid:
ratio
per-feature mapping {feature idx: ratio} – ndarray/DataArray only
per-variable mapping {var name: ratio} – Dataset only
stretch_factor (float or mapping, optional) – The following forms are valid:
stretch factor
per-feature mapping {feature idx: stretch factor} – ndarray/DataArray only
per-variable mapping {var name: stretch factor} – Dataset only
- Returns
Synthetic samples in the form of the input data
- class synthia.FPCADataGenerator¶
Estimates the characteristics of a set of multi-feature samples and generates synthetic samples with the same or modified characteristics, based on (functional) principal component analysis.
The input can be a numpy array or xarray DataArray of shape (sample, feature), or an xarray Dataset where all variables have shapes like (sample [, …]). For Datasets, all extra dimensions except the first are treated as features.
The output is in the same form as the input.
Algorithm:
Fitting phase
Compute principal component vectors and, for every input sample,
corresponding principal component scores.
Fit a distribution model for each score (using all samples).
Generation phase
Generate new samples of principal component scores from the fitted
distributions.
Transform scores into synthetic data on the feature scale by
multiplying with principal component vectors.
- fit(data, n_fpca_components, n_samples_reduced=None)¶
Find the first n_fpca_components principal components vectors and fit marginal distributions to the corresponding scores.
- Parameters
data (ndarray or DataArray or Dataset) – The input data, either a 2D array of shape (sample, feature) or a dataset where all variables have the shape (sample[, …]).
n_fpca_components (int) – Reduces the number of features
n_samples_reduced (int, optional) – Reduces the number of samples
- Returns
None
- generate(n_samples, scaling_factor=None)¶
Generate synthetic data from the model.
- Parameters
n_samples (int) – Number of samples to generate.
scaling_factor (float, optional) – tbd
- Returns
Synthetic samples in the form of the input data
- reconstruct(eig_scores)¶
Copulas¶
- class synthia.GaussianCopula¶
A Gaussian copula.
- fit(rank_standardized)¶
Fit a Gaussian copula to data.
- Parameters
rank_standardized (ndarray) – 2D array of shape (feature, feature) with values in range [0,1]
- Returns
None
- generate(n_samples, qrng=False, seed=None)¶
Generate n_samples gaussian copula entries.
- Parameters
n_samples (int) – Number of samples to generate.
qrng (bool, optional) – If True, quasirandom numbers are generated using pyvinecopulib.
- Returns
2D array of shape (n_samples, feature) with gaussian copula entries.
- class synthia.VineCopula(controls=None)¶
A Vine copula.
- __init__(controls=None)¶
- Parameters
controls (pyvinecopulib.FitControlsVinecop, optional) – Controls for fitting vine copula models.
- fit(rank_standardized)¶
Fit a Vine copula to continuous data.
- Parameters
rank_standardized (ndarray) – 2D array of shape (sample, feature) with values in range [-1,1]
- Returns
None
- fit_with_discrete(rank_standardized, is_discrete)¶
Fit a Vine copula to mixed continuous/discrete data
- Parameters
rank_standardized (ndarray) – 2D array of shape (sample, feature) with values in range [-1,1]
is_discrete (List[bool]) – 1D list of booleans of shape (feature) indicating whether features are discrete or continuous
- Returns
None
- generate(n_samples, qrng=False, num_threads=1, seed=None)¶
Generate n_samples Vine copula entries.
- Parameters
n_samples (int) – Number of samples to generate.
- Returns
2D array of shape (sample, feature) with Vine copula entries.
Parameterizers¶
- class synthia.ConstParameterizer(val)¶
Preserves the size of the original data. No downsampling is performed at fitting and no interpolation is perfomed at generation.
- __init__(val)¶
- fit(data)¶
- generate(n_samples)¶
Returns original samples without interpolation.
- class synthia.QuantileParameterizer(n_quantiles)¶
Compresses the original data. Downsampling is performed at fitting using quantiles and interpolation is perfomed at generation using cubic interpolation.
- __init__(n_quantiles)¶
- fit(data)¶
Downsamples by fitting quantile vector of user-defined size.
- generate(n_samples)¶
Generate samples vector using scipy.interpolate.PchipInterpolator.
- class synthia.DistributionParameterizer(dist_names=None, verbose=False)¶
Compresses the original data. Downsampling is performed by fitting the “best” parametric distribution from those supported in scipy. Generation is perfomed using the respective random varietes method in scipy.
- __init__(dist_names=None, verbose=False)¶
Args: dist_names (sequence of str, optional): Names of scipy distributions to use. verbose (bool, optional): If True, print progress messages.
- fit(data)¶
Fits data on multiple distributions and keeps the best fit according to the Kolmogorov-Smirnov test.
- Parameters
y (ndarray) – 1D array
- generate(n_samples)¶
Generate n random samples from the best-fitting distribution.
- Parameters
n (int) – Number of samples to generate.
- Returns
1D array of shape (n,)
- static get_dist_names()¶
Returns a list of names of all supported scipy distributions.
- Returns
List of distribution names
Transformers¶
- class synthia.ArcTanhTransformer(var_names)¶
Preview: Performs inverse hyperbolic tangent tranformations on original data. Assumption: the first dimension is samples and all others are features. I.e. transformations will be done per feature.
- __init__(var_names)¶
- apply(ds)¶
- revert(ds)¶
- class synthia.BoxCoxTransformer(var_names, lmbda, boundary_location='left')¶
Preview: Performs Box Cox tranformations on original data. Assumption: the first dimension is samples and all others are features. I.e. transformations will be done per feature.
- __init__(var_names, lmbda, boundary_location='left')¶
- apply(ds)¶
- revert(ds)¶
Utilities¶
- synthia.util.load_dataset(name='SAF-Synthetic')¶
Return a dataset of 25 000 synthetic temperature profiles from the SAF dataset (http://dx.doi.org/10.13140/2.1.4476.8963). These were fitted with 6 fPCA componenets in Synthia version 0.2.0.