Organisational Readiness and Perceptions of Synthetic Data Production and Dissemination in the UK: Survey Data, 2024
This collection comprises survey data gathered in 2024 as part of a project aimed at investigating how synthetic data can support secure data access and improve research workflows, particularly from the perspective of data-owning organisations. The survey targeted data-owning organisations across the UK, including those in government, academia and health sector. Respondents were individuals who could speak on behalf of their organisations, such as data managers, principal investigators, and information governance leads. The motivation for this collection stemmed from the growing interest in synthetic data as a tool to enhance access to sensitive data and reduce pressure on Trusted Research Environments (TREs). The study explored organisational engagement with two types of synthetic data: synthetic data generated from real data, and “data-free” synthetic data created using metadata only. The aims of the survey were to assess current practices, explore motivations and barriers to adoption, understand cost and governance models, and gather perspectives on scaling and outsourcing synthetic data production. Conditional logic was used to tailor the survey to organisations actively producing, planning, or not engaging with synthetic data. This collection includes responses from 15 UK-based organisations. The survey covered eight core topics: organisational background, production practices, anticipated and realised benefits, technical and financial challenges, cost structures, data sharing models, scalability, and openness to external synthetic data generation. The data offers exploratory insights into how UK organisations are approaching synthetic data in practice and can inform future research, infrastructure development, and policy guidance in this evolving area. The findings have informed recommendations to support the responsible and efficient scaling of synthetic data production across sectors.The growing discourse around synthetic data underscores its potential not only in addressing data challenges in a fast-paced changing landscape but for fostering innovation and accelerating advancements in data analytics and artificial intelligence. From optimising data sharing and utility (James et al., 2021), to sustaining and promoting reproducibility (Burgard et al., 2017) to mitigating disclosure (Nikolenko, 2021) synthetic data has emerged as a solution to various complexities of the data ecosystem. The project proposes a mixed-methods approach and seeks to explore the operational, economic, and efficiency aspects of using low-fidelity synthetic data from the perspectives of data owners and Trusted Research Environments (TREs). The essence of the challenge is in understanding the tangible and intangible costs associated with creating and sharing low-fidelity synthetic data, alongside measuring its utility and acceptance among data producers, data oweners and TREs. The broader aim of the project is to foster a nuanced understanding that could potentially catalyse a shift towards a more efficient and publicly acceptable model of synthetic data dissemination. This project is centred around three primary goals: 1. to evaluate the comprehensive costs incurred by data owners and TREs in the creation and ongoing maintenance of low-fidelity synthetic data, including the initial production of synthetic data and subsequent costs; 2. to assess the various models of synthetic data sharing, evaluating the implications and efficiencies for data owners and TREs, covering all aspects from pre-ingest to curation procedures, metadata sharing, and data discoverability; and 3. to measure the efficiency improvements for data owners and TREs when synthetic data is available, analysing impacts on resources, secure environment usage load, and the uptake dynamics between synthetic and real datasets by researchers. Commencing in March 2024, the project will begin with stakeholder engagement, forming an expert panel and aligning collaborative efforts with parallel projects. Following a robust literature review, the project will embark on a methodical data collection journey through a targeted survey with data creators, case studies with d and data owners and providers of synthetic data, and a focus group with TRE representatives. The insights collected from these activities will be analysed and synthesized to draft a comprehensive report delineating the findings and sensible recommendations for scaling up the production and dissemination of low-fidelity synthetic data as applicable. The potential applications and benefits of the proposed work are diverse. The project aims to provide a solid foundation for data owners and TREs to make informed decisions regarding synthetic data production and sharing.
Show More
Geographic Coverage:
GB
Temporal Coverage:
2024-08-02/2024-09-13
Resource Type:
dataset
Available in Data Catalogs:
UK Data Service