Replica Analytics - An Aetion Company

Real Data vs Synthetic Data: Propagating Constraints


As we are working through a number of data generation projects, one issue that has come up a few times is the extent to which synthetic data is constrained by original data. Let’s take a concrete example to analyze – an original data set has been provided for a specific purpose. Can a model be constructed from this data set and then that model used to generate synthetic data ? We argue that the answer has to be that yes, this is a reasonable thing to do even if the synthetic data will be used for a different purpose.

For example, let’s say that a researcher has published a paper in a journal, and in that paper the researcher includes a model with say ten variables. One can then take that published model and create synthetic data with ten variables from that model, accounting for estimation errors (which are also published).  This synthesized data is then not based on the original data but are based on the model that was created. Whatever constraints were placed on the original data when it was given to the researcher do not carry forward to the model itself (or on all possible models that can be created from the original data).

For instance, the data supplier may have said that the original data cannot be shared with anyone else. The researcher followed these instructions precisely and did not share that data. That restriction would not apply to the ten variable synthetic data, however. The synthetic data can be shared more broadly.

Let’s dig further into this issue.

There are two types of synthetic data – fully synthetic and partially synthetic. Partially synthetic data retain some of the variables in the original data. Fully synthetic data have all of their values generated from the model. The argument above only applies to fully synthetic data since partially synthetic data are clearly based on the original data.

One more consideration about models is whether the model uses are deemed ethical. This question rests on whether there can be individual harm or group harm from the use of models. While in general that question would need to be addressed on a case-by-case basis, it is difficult to argue that the creation of synthetic data from a model by itself causes harm.

Therefore, fully synthetic data provides an analyst more flexibility to use the data for a different purpose. Whether the uses of the synthetic data need to pass an ethics test is a topic we will cover in a future article.

Share on social: