By continuing to use our site, you consent to the processing of cookies, user data (location information, type and version of the OS, the type and version of the browser, the type of device and the resolution of its screen, the source of where the user came from, from which site or for what advertisement, language OS and Browser, which pages are opened and to which buttons the user presses, ip-address) for the purpose of site functioning, retargeting and statistical surveys and reviews. If you do not want your data to be processed, please leave the site.

Frequently Asked Questions

Synthetic Data Generation FAQ

  • What is synthetic data generation (SDG) ?

    There are many ways to generate synthetic data. But in essence synthetic data is created by fitting a model to real data then sampling from the model to generate synthetic records. Therefore, the synthetic records are generated from a model.

    The real data would be the original personally identifiable datasets that are available, for example health records, financial transaction data, or sales data. This is the data that we want to create non-identifiable versions or synthetic versions of.

    We take that data and build a model which captures the statistical properties and the patterns in that data. This is typically a machine learning model that is built. These types of models have become quite good at capturing the subtle patterns in complex datasets. These models are also called generative models or data simulators.

    Once a model is built, we can then use that model to generate new data. This is the synthetic data. So the new data that is generated actually comes from the model that was fitted. There will not be a one-to-one mapping between the synthetic records and the records in the source data. This is why synthetic data can be quite protective against identity disclosure and other privacy risks.

    You have probably already seen synthetic data in the context of deep fakes. These are realistic computer generated images. The fake people that they create look quite realistic, and so as you can see the technology has become quite good.

    The same principles apply to structured data, which is the type of data that we are interested in.

    Therefore the goal with synthetic data is to generate new records that look realistic, but in the context of datasets that include personal information we are also interested in preserving the privacy of individuals in the real data. So synthetic data generation needs to balance preserving privacy with generating high utility data that looks realistic.

  • Do you need to know how the data will be used for SDG?

    A common question about synthetic data generation is whether you need to know how the data will be used in advance. We assume here that "use" means some form of data analysis (e.g., regression or machine learning algorithm). This means is the data synthesis specific to the analysis that will be performed on the dataset ?

    The general answer is no. The basic idea is that the data synthesis process is capturing many, or almost all, of the patterns in the data that the synthetic data can be used in arbitrary ways. Of course there are limitations to that. For example, very rare events may not be captured well by the generative model. However, in general the generative models have been good at capturing the patterns in the data.

    Therefore, when synthetic data is generated no a priori assumptions are made about the eventual data uses.

  • How large can the source data be ?

    One other question that often comes up is about how large the source datasets need to be or can be. This question can be interpreted in two different ways:

    What is the minimal dataset size for SDG to work ?

    Because at Replica Analytics many of our initial projects were with clinical trial data, we have developed generative models that work well with small datasets. This means datasets with say 100 to 200 patients. If there are fewer than a 100 patients or so it will be difficult to build a generative model.

    An important factor to consider is the number of variables in the dataset. It is easier to create a generative model with a dataset having 500 patients and 10 variables than a dataset having 500 patients and 500 variables. In practice for the latter case, many of these variables will be redundant, derived, or have many missing values. Therefore, it is often not as extreme as it sounds. But the point is that the number of variables is an important consideration.

    The statistical machine learning techniques that work very well for large datasets, like deep learning methods, will generally not do well on small datasets. Therefore we need have a toolbox of different techniques that we apply depending on the nature of the data.

    What is the largest source dataset that can be synthesized ?

    The answer to this question will partially depend on the computing capacity that is available to the SDG software. The creation of a generative model is computationally demanding. And depending on the methods use, the availability of GPUs will have an impact on the answer to that question.

    There are techniques to handle large datasets in SDG.  They cover both algorithmic techniques and implementation techniques. These approaches have been implemented in our Replica Synthesis software.

    Another consideration is the use of simulators. Simulators are one time actions. This means that once a simulator is created it can be used repeatedly to generate many datasets for multiple users and use cases. The behavior of a simulator will not depend on the size of the source dataset size.

  • What is a simulator exchange ?

    Taking the concept of synthetic data one step further, Replica Synthesis also implements a simulator exchange. This is another quite powerful way to share non-identifiable data with a very broad community. Except that instead of sharing data we share the generative models or the simulators.

    Note that when we say that we share the generative models, I mean that we provide access to the generative models so that users can use the models to generate synthetic data. It does not mean that we share the actual models with the users.

    A simulator exchange is a catalogue of generative models. A data user searches the meta-data for the type of data that they want and find the appropriate simulator. Then they can generate data from that simulator. They can generate as much data as they want from that simulator. They can generate 100 records or 1 million records from the same simulator. It does not matter how large the source data was.

    We have found that data custodians are much more comfortable sharing simulators than they are sharing data. Simulators are one step removed from data. And the generated data from the simulators is not considered personally identifying information. But, this can be empirically verified as well.

    Under this scenario the data consumers are not accessing any real data. They only access the simulators, which means that the simulator exchange can be quite open and available to many people within an organization with few constraints (at least for privacy reasons).

    Other advantages of simulators are that you can track how much data has been consumed, and who is consuming the most data. 

    The meta-data that accompanies a simulator include reports on its utility and privacy, license information, and a data dictionary. Plus, the user can add arbitrary information (such as articles or other reference documentation).

Replica Synthesis FAQ

  • What is Replica Synthesis?

    Replica Synthesis is the main software product from Replica Analytics. It is a scalable enterprise application that allows users to create synthetic data and to build simulators. The main features of the product are as follows:

    SDG The main functionality is that Replica Synthesis allows users to create synthetic datasets from source datasets.

    SDG Workflows A powerful capability in the software is the workflow designer which allows complex pipelines to be defined, including joins, pooling, defining cohorts, and a powerful scripting capability for pre and post processing datasets.

    Simulators Users can also package their generative models as simulators and share them with their team or with others in their organization.

    Simulator Exchange The data sharing policy designer means that simulator exchanges can be defined. A user can search the catalogue of simulators to find the ones that meet their needs and generate data from there.

    Privacy Assurance The privacy risks in synthetic data can be evaluated using our unified privacy assurance model.

    APIs and SDKs The Replica Synthesis engine can be accessed programmatically through a REST API, an R package, and a Python library. This allows easy integration into analytics pipelines.

Privacy Assurance FAQ

  • What is Privacy Assurance?

    One of the benefits of synthetic data is that it has low privacy risks. Privacy risks can be measured in a number of different ways.

    Replica Analytics has developed a unified privacy model for synthetic data that considers attribute disclosure conditional on identity disclosure, and membership disclosure. This is a comprehensive way to think about the privacy risks in synthetic datasets.

    The privacy risk assessment can be performed on every dataset that is generated by Replica Synthesis. A report is generated summarizing the risks in the data. (coming in Replica Synthesis 2.0)

    Attribute disclosure conditional on identity disclosure is the result of two tests: 

    1. Can an adversary match a synthetic record with a real person ?
    2. If a match is successful, will the adversary learn something new and correct from that match ?

    It produces an estimate of how likely these two tests are to pass.