By continuing to use our site, you consent to the processing of cookies, user data (location information, type and version of the OS, the type and version of the browser, the type of device and the resolution of its screen, the source of where the user came from, from which site or for what advertisement, language OS and Browser, which pages are opened and to which buttons the user presses, ip-address) for the purpose of site functioning, retargeting and statistical surveys and reviews. If you do not want your data to be processed, please leave the site.

Reflections on Generating Synthetic Data for the Vivli-Microsoft Data Challenge

This June, we generated synthetic clinical trial data for the inaugural Vivli-Microsoft Data Challenge. The goal of the competition was to propose innovative methods to facilitate the sharing of rare disease datasets, in a manner that maintains the analytic value of the data while safeguarding participant privacy. Rare disease datasets are particularly difficult to share while maintaining participant privacy as these datasets often contain relatively few individuals, where individuals may be uniquely identified using only a handful of attributes or quasi-identifiers.  

This event gathered 60 participants on 11 teams from universities, hospitals, pharmaceutical, biotech and software companies. Each team had 5 hours to plan and propose a solution, then 5 minutes to present the solution to the judges. The solutions developed combined new and existing technologies in interesting ways tailored for use in rare disease datasets. Unsurprisingly, (from our perspective!) the winning team proposed a solution built around the use of synthetic data. 

Synthetic data can solve a variety of data sharing problems. In particular, synthetic data can be used in place of real data when developing and testing innovative solutions to real world data problems. The appeal is that synthetic data will have the same structure and complexity as the real data, without risk of revealing personal information to users. This allows synthetic data to be shared broadly without requiring stringent access control. This is ideal for innovation challenges such as data-thons (hack-a-thons) as synthetic data reduces the logistic burden on the organizers and competition participants. By reducing the burden on participants, innovation competitions are able to attract more participants, increasing the likelihood of generating meaningful solutions. 

Synthetic data was critical to this event’s success as it allowed all participants to ‘get their hands dirty’ with realistic clinical trial data, without needing to use costly secure computational environments or other control mechanisms. The synthetic data grounded the competition in reality by providing participants with example data that their solutions would need to be able to accommodate. Groups that built demos of their solutions were also able to apply their methods to the synthetic data as a proof of concept.  

Data challenges like this are dependent on providing high quality data to participants, and synthetic data is a fantastic means to do so. We loved the opportunity to attend the data challenge and see such a diverse motivated group use our synthetic data to work towards solutions for pressing data sharing issues.  

Full results of the Vivli-Microsoft Data Challenge can be found here: