We have just had a paper published in JAMIA Open, entitled Validating A Membership Disclosure Metric For Synthetic Health Data, the paper validates and demonstrates, using several large datasets, how to apply a membership disclosure metric for synthetic health data, which is important for assessing the privacy risks of synthetic data.
Synthetic data generation (SDG) is a technique that can be used to generate privacy-preserving data. There has been growing interest in using SDG to create synthetic health data that can be shared for secondary analyses. In order to ensure that the generated synthetic health data has appropriately mitigated the privacy risks and is ready to be shared in a responsible manner it is essential to have reliable metrics to measure this privacy risk.
One of the increasingly common methods to evaluate the privacy of synthetic data is by measuring the risk of membership disclosure. Membership disclosure occurs if an adversary using information in the synthetic data could determine that a target individual in the real world was present in the training dataset used to generate the synthetic data. Learning a target individual was present in the training data could lead to a disclosure of sensitive information if, for example, the training dataset involved participants in a clinical trial for treatment of HIV.
To assess the membership disclosure risk in a synthetic dataset prior to a data release, a data custodian may use the partitioning method. This method assumes that an adversary has access to an ‘attack dataset’ or set of records for individuals in the population that may or may not be in the real sample used during SDG. The adversary then attempts to match individuals in the attack dataset to the synthetic records. A membership disclosure occurs if a matching record truly was present in the training dataset (Figure 1).
Given that the data custodian is unlikely to have an attack dataset, they must estimate this risk using a slightly different procedure (Figure 2). To estimate membership disclosure, data custodians can split the real data into two sets: a training and a holdout datasets prior to SDG. Then the SDG model can be trained using the training dataset and synthetic data can be generated. For the data custodian to assess membership disclosure in this synthetic dataset they can create an attack dataset using a mix of records from the training and holdout datasets. The data custodian can then calculate the risk of membership disclosure empirically by attempting to match between the attack dataset and the synthetic dataset.
Our recent publication in JAMIA Open validated the partitioning technique for assessing membership disclosure and made recommendations on how best to parameterize the attack dataset during estimation. We conducted a simulation study to compare the performance of the risk estimate relative to the parameterization of the attack dataset against the ground truth risk of membership disclosure using four large health datasets and two different kinds of synthesis models. The paper also demonstrates how to apply this methodology to assess membership disclosure risk using seven oncology clinical trial datasets. The application is particularly important because increasingly synthetic clinical trial datasets are being generated and shared to enable broader access. These new findings will allow data custodians to more accurately assess risk in synthetic datasets and build trust when sharing synthetic datasets.
Read the full study here.
You may also be interested in a past blog post entitled Re-identification, the wrong criterion for synthetic data, which offers an overview of privacy risks with synthetic data, and how they differ from privacy risks for real datasets.