At a PHUSE workshop in September we organised a half-day session on synthetic data and its applications. This was hands-on with the attendees, using R to synthesise datasets and evaluate the utility of the generated data. The response after the workshop was positive and there was strong interest in providing a broader capability to PHUSE members to learn more about data synthesis.
We are now making available a data synthesis platform, as part of the PHUSE Open Data Repository (PODR), in partnership with Replica Analytics Ltd. This is available free for non-commercial purposes and allows users to gain first-hand experience with data synthesis.
Data synthesis is an analytic approach for creating “fake” data. This means that a generative model is trained that captures the statistical properties and patterns in the original data. This generative model is then used to produce new synthetic data. Therefore, the generated data are produced from the model and do not have a one-to-one mapping to the original data, but still retain its analytic utility.
There are two data synthesis tools available on the PODR.
The first is an interactive tool for data synthesis. This allows users to upload datasets, synthesise these datasets, and then generate comprehensive utility metrics to see how similar the generated data are to the original data. A simple workflow modelling approach is used to define the data sources and the data transformations that should be applied to the data through that pipeline. Those accustomed to working with data will be familiar with the general workflow modelling approach.
The second is an R package that implements the same synthesis and utility evaluation functionality. The R package is available in a Jupyter Hub, which is configured to communicate with the synthesis engine. The interactive tool and the Jupyter Hub work together, and therefore synthesised data can be exported to the Jupyter Hub and further analysed there. The combined toolset will allow users to synthesise data and analyse them within the same environment, moving easily between them.
The ultimate objective is to demonstrate the capabilities of data synthesis, and enable the user community to learn about this technology, which is gaining interest within the health, and other, sectors. The PODR synthesis platform will be updated over time to include additional capabilities and to incorporate feedback from this community.
If you are a PHUSE member and would like to get an account and receive additional information about the data synthesis platform, please email firstname.lastname@example.org.