There are many ways to generate synthetic data. But in essence synthetic data is created by fitting a model to real data then sampling from the model to generate synthetic records. Therefore, the synthetic records are generated from a model.
The real data would be the original personally identifiable datasets that are available, for example health records, financial transaction data, or sales data. This is the data that we want to create non-identifiable versions or synthetic versions of.
We take that data and build a model which captures the statistical properties and the patterns in that data. This is typically a machine learning model that is built. These types of models have become quite good at capturing the subtle patterns in complex datasets. These models are also called generative models or data simulators.
Once a model is built, we can then use that model to generate new data. This is the synthetic data. So the new data that is generated actually comes from the model that was fitted. There will not be a one-to-one mapping between the synthetic records and the records in the source data. This is why synthetic data can be quite protective against identity disclosure and other privacy risks.
You have probably already seen synthetic data in the context of deep fakes. These are realistic computer generated images. The fake people that they create look quite realistic, and so as you can see the technology has become quite good.
The same principles apply to structured data, which is the type of data that we are interested in.
Therefore the goal with synthetic data is to generate new records that look realistic, but in the context of datasets that include personal information we are also interested in preserving the privacy of individuals in the real data. So synthetic data generation needs to balance preserving privacy with generating high utility data that looks realistic.
A common question about synthetic data generation is whether you need to know how the data will be used in advance. We assume here that "use" means some form of data analysis (e.g., regression or machine learning algorithm). This means is the data synthesis specific to the analysis that will be performed on the dataset ?
The general answer is no. The basic idea is that the data synthesis process is capturing many, or almost all (it is never going to be "all" since that cannot be guaranteed except for trivial cases), of the patterns in the data that the synthetic data can be used in arbitrary ways. Of course there are limitations to that. For example, very rare events may not be captured well by the generative model. However, in general the generative models have been good at capturing the patterns in the data.
Therefore, when synthetic data is generated no a priori assumptions are made about the eventual data uses.
One other question that often comes up is about how large the source datasets need to be or can be. This question can be interpreted in two different ways:
What is the minimal dataset size for SDG to work ?
Because at Replica Analytics many of our initial projects were with clinical trial data, we have developed generative models that work well with small datasets. This means datasets with say 100 to 200 patients. If there are fewer than a 100 patients or so it will be difficult to build a generative model.
An important factor to consider is the number of variables in the dataset. It is easier to create a generative model with a dataset having 500 patients and 10 variables than a dataset having 500 patients and 500 variables. In practice for the latter case, many of these variables will be redundant, derived, or have many missing values. Therefore, it is often not as extreme as it sounds. But the point is that the number of variables is an important consideration.
The statistical machine learning techniques that work very well for large datasets, like deep learning methods, will generally not do well on small datasets. Therefore we need have a toolbox of different techniques that we apply depending on the nature of the data.
What is the largest source dataset that can be synthesized ?
The answer to this question will partially depend on the computing capacity that is available to the SDG software. The creation of a generative model is computationally demanding. And depending on the methods use, the availability of GPUs will have an impact on the answer to that question. Also the amount of memory will be important (on the CPUs and GPUs). The system adapts the computations but the hardware will place limits as well.
There are techniques to handle large datasets in SDG. They cover both algorithmic techniques and software and systems engineering techniques. These approaches have been implemented in our Replica Synthesis software. The system distributes computations across (virtual) machines and therefore can scale in that way quite easily.
Another consideration is the use of simulators. Building simulators are one time actions. This means that once a simulator is created it can be used repeatedly to generate many datasets for multiple users and use cases. The speed of executing a simulator will not depend on the number of patients / individuals of the source dataset size. It will depend on the number of variables though.
Taking the concept of synthetic data one step further, Replica Synthesis also implements a simulator exchange. This is another quite powerful way to share non-identifiable data with a very broad community. Except that instead of sharing data we share the generative models or the simulators.
Note that when we say that we share the generative models, I mean that we provide access to the generative models so that users can use the models to generate synthetic data. It does not mean that we share the actual models with the users.
A simulator exchange is a catalogue of generative models. A data user searches the meta-data for the type of data that they want and find the appropriate simulator. Then they can generate data from that simulator. They can generate as much data as they want from that simulator. They can generate 100 records or 1 million records from the same simulator. It does not matter how large the source data was.
We have found that data custodians are much more comfortable sharing simulators than they are sharing data. Simulators are one step removed from data. And the generated data from the simulators is not considered personally identifying information. But, this can be empirically verified as well.
Under this scenario the data consumers are not accessing any real data. They only access the simulators, which means that the simulator exchange can be quite open and available to many people within an organization with few constraints (at least for privacy reasons).
Other advantages of simulators are that you can track how much data has been consumed, and who is consuming the most data.
The meta-data that accompanies a simulator include reports on its utility and privacy, license information, and a data dictionary. Plus, the user can add arbitrary information (such as articles or other reference documentation).
Real data will have missingness in it. The missingness may be at random (which is a common assumption and probably not met that often in practice). Missingness patterns are important in that they may have meaning by themselves (e.g., certain patients do not or cannot provide certain data), or they may be structural.
The way our SDG engine works is that it models missingness in the original data. This means that a specific model (or set of models) is built to understand the conditions under which missingness occurs, These learned patterns are then reproduced at the output end. Therefore, the synthetic data would have similar patterns of missingness as the original data. Nothing specific needs to be done by the analyst to ensure that this occurs as this is handled automatically by the software.
The original data structure means the field types, field names, tables, and relationships among the tables in the source data. These relationships are retained in the output synthetic data. This means that the synthetic data will look very similar to the original dataset structurally.
The format of the dataset is determined by the end-user of the software. For example, the end user may send the synthetic data to a different type of database or file format than the original dataset. Of course, the format of the synthetic data can be set to be the same as the original dataset.
Such structural similarity is very important for some use cases. For example, if the use case is to perform software testing then the exact structure and format as the original data is an important requirement.
Replica Synthesis is the main software product from Replica Analytics. It is a scalable enterprise application that allows users to create synthetic data and to build simulators. The main features of the product are as follows:
SDG The main functionality is that Replica Synthesis allows users to create synthetic datasets from source datasets.
SDG Workflows A powerful capability in the software is the workflow designer which allows complex pipelines to be defined, including joins, pooling, defining cohorts, and a powerful scripting capability for pre and post processing datasets.
Simulators Users can also package their generative models as simulators and share them with their team or with others in their organization.
Simulator Exchange The data sharing policy designer means that simulator exchanges can be defined. A user can search the catalogue of simulators to find the ones that meet their needs and generate data from there.
Privacy Assurance The privacy risks in synthetic data can be evaluated using our unified privacy assurance model.
APIs and SDKs The Replica Synthesis engine can be accessed programmatically through a REST API, an R package, and a Python library. This allows easy integration into analytics pipelines.
The overall architecture of the product is shown below. The software allows the computations to scale in a cluster to accommodate larger and more complex datasets.
Yes, it is possible to run the Replica Synthesis software on-prem.
There are healthcare and life sciences organizations that have not moved their computing workloads to use multi-tenant cloud services. Part of this is due to hesitation to have sensitive data reside in a different environment that is not under their control. Although we are seeing more and more workloads moving to the cloud.
An on-prem installation, for the purposes of this response, is an installation on hardware that is operated by the data custodian / data controller. By installing Replica Synthesis on-prem then the SDG computations are all within the organization’s direct control and no sensitive data needs to leave the organizational boundary. This can be on actual servers or in a virtual private cloud.
To support that, Replica Analytics provides documentation and support for:
For an air-gapped computing environment or where no external communication is permitted, additional steps will be needed to activate licenses through the license server, and to access the on-line help.
Defining the appropriate hardware / virtual machine and system requirements for common deployments.
Instructions on the software installation, which utilizes containers.
This question has multiple layers and it is best to parse them out and address them separately.
Is manual intervention needed to synthesize data ?
We have worked very hard to maximize the automation in Replica Synthesis. The software does quite a bit of automated discovery of the data characteristics and data shaping to make it ready for synthesis, and then reverses any shaping at the back-end of the whole process.
The user of course has to load data or connect to data sources. If any cohorts need to be defined then that is also a necessary task. However, in many situations the synthesis process itself is automated, and this includes all of the necessary hyperparameter tuning that is needed for training the generative models. This applies to when using the GUI or when using the different APIs (R or Python) to perform SDG.
However, there is also an option to tweak this automated process for the advanced users. While the automated pre-processing works very well, there may be cases where some adjustments are needed. The Replica Synthesis software also provides this capability, but we hope you never have to use it.
How much knowledge about SDG is needed to use Replica Synthesis ?
By design, very little knowledge about SDG is needed to us Replica Synthesis. Of course, the user needs to know the data and how to access their data sources. And the user will need to understand the data domain to be able to define meaningful cohorts. But training on SDG is not necessary as that complexity is hidden from the user in Replica Synthesis.
The on-line help is also a great resource for using the software.
Can Replica Synthesis be included in automated data provisioning pipelines ?
We have clients who have done exactly that. Because of the high level of automation, Replica Synthesis can be inserted in data pipelines to convert the original datasets into synthetic variants. This can be done by training a generative model for every data cut that comes through the pipeline. For example, when a dataset request is approved, the original dataset can be sent to Replica Synthesis, and the resultant synthetic dataset then forwarded to the analyst to work on.
Alternatively, if a data simulator is created, then that can just act as a permanent source of data into a pipeline.
One of the benefits of synthetic data is that it has low privacy risks. Privacy risks can be measured in a number of different ways.
Replica Analytics has developed a unified privacy model for synthetic data that considers attribute disclosure conditional on identity disclosure, and membership disclosure. This is a comprehensive way to think about the privacy risks in synthetic datasets.
The privacy risk assessment can be performed on every dataset that is generated by Replica Synthesis. A report is generated summarizing the risks in the data.
Attribute disclosure conditional on identity disclosure is the result of two tests:
It produces an estimate of how likely these two tests are to pass.
The generation of synthetic data is a process of creating non-identifiable data. The question here is whether that process needs additional consent from patients.
We have performed a detailed legal analysis of this topic in our book with reference to specific regulations, such as the GDPR and HIPAA. Below we will summarize some of the key points:
You can get more details from the book on our legal analysis of this issue. These are extensions of the arguments that we had made in the past regarding the requirement to obtain consent for de-identification, which you can also find here.