Replica Analytics - An Aetion Company

Reflections on Generating Synthetic Data for the Vivli-Microsoft Data Challenge


This June, we generated synthetic clinical trial data for the inaugural Vivli-Microsoft Data Challenge. The goal of the competition was to propose innovative methods to facilitate the sharing of rare disease datasets, in a manner that maintains the analytic value of the data while safeguarding participant privacy. Rare disease datasets are particularly difficult to share while maintaining participant privacy as these datasets often contain relatively few individuals, where individuals may be uniquely identified using only a handful of attributes or quasi-identifiers.

This event gathered 60 participants on 11 teams from universities, hospitals, pharmaceutical, biotech and software companies. Each team had 5 hours to plan and propose a solution, then 5 minutes to present the solution to the judges. The solutions developed combined new and existing technologies in interesting ways tailored for use in rare disease datasets. Unsurprisingly, (from our perspective!) the winning team proposed a solution built around the use of synthetic data.

Synthetic data can solve a variety of data sharing problems. In particular, synthetic data can be used in place of real data when developing and testing innovative solutions to real world data problems. The appeal is that synthetic data will have the same structure and complexity as the real data, without risk of revealing personal information to users. This allows synthetic data to be shared broadly without requiring stringent access control. This is ideal for innovation challenges such as data-thons (hack-a-thons) as synthetic data reduces the logistic burden on the organizers and competition participants. By reducing the burden on participants, innovation competitions are able to attract more participants, increasing the likelihood of generating meaningful solutions.

Synthetic data was critical to this event’s success as it allowed all participants to ‘get their hands dirty’ with realistic clinical trial data, without needing to use costly secure computational environments or other control mechanisms. The synthetic data grounded the competition in reality by providing participants with example data that their solutions would need to be able to accommodate. Groups that built demos of their solutions were also able to apply their methods to the synthetic data as a proof of concept.

Data challenges like this are dependent on providing high quality data to participants, and synthetic data is a fantastic means to do so. We loved the opportunity to attend the data challenge and see such a diverse motivated group use our synthetic data to work towards solutions for pressing data sharing issues.

Full results of the Vivli-Microsoft Data Challenge can be found here: 

Share on social: