By continuing to use our site, you consent to the processing of cookies, user data (location information, type and version of the OS, the type and version of the browser, the type of device and the resolution of its screen, the source of where the user came from, from which site or for what advertisement, language OS and Browser, which pages are opened and to which buttons the user presses, ip-address) for the purpose of site functioning, retargeting and statistical surveys and reviews. If you do not want your data to be processed, please leave the site.

Delivering on the Promise of Synthetic Data

Data Synthesis Platform Available for the PHUSE Community

Data Synthesis Platform Available for the PHUSE Community [1]

Author: Kayley Phillpott

At a PHUSE workshop in September we organised a half-day session on synthetic data and its applications. This was hands-on with the attendees, using R to synthesise datasets and evaluate the utility of the generated data. The response after the workshop was positive and there was strong interest in providing a broader capability to PHUSE members to learn more about data synthesis.

We are now making available a data synthesis platform, as part of the PHUSE Open Data Repository (PODR), in partnership with Replica Analytics Ltd. This is available free for non-commercial purposes and allows users to gain first-hand experience with data synthesis.

Data synthesis is an analytic approach for creating “fake” data. This means that a generative model is trained that captures the statistical properties and patterns in the original data. This generative model is then used to produce new synthetic data. Therefore, the generated data are produced from the model and do not have a one-to-one mapping to the original data, but still retain its analytic utility.

There are two data synthesis tools available on the PODR.

The first is an interactive tool for data synthesis. This allows users to upload datasets, synthesise these datasets, and then generate comprehensive utility metrics to see how similar the generated data are to the original data. A simple workflow modelling approach is used to define the data sources and the data transformations that should be applied to the data through that pipeline. Those accustomed to working with data will be familiar with the general workflow modelling approach.

The second is an R package that implements the same synthesis and utility evaluation functionality. The R package is available in a Jupyter Hub, which is configured to communicate with the synthesis engine. The interactive tool and the Jupyter Hub work together, and therefore synthesised data can be exported to the Jupyter Hub and further analysed there. The combined toolset will allow users to synthesise data and analyse them within the same environment, moving easily between them.

The ultimate objective is to demonstrate the capabilities of data synthesis, and enable the user community to learn about this technology, which is gaining interest within the health, and other, sectors. The PODR synthesis platform will be updated over time to include additional capabilities and to incorporate feedback from this community.

If you are a PHUSE member and would like to get an account and receive additional information about the data synthesis platform, please email


[1]       Originally published as: K. Phillpott, “Data Synthesis Platform Available for the PHUSE Community,” PhUSE, Nov. 10, 2020.