KB and partners are developing a secure virtual research environment allowing researchers to analyse data that are sensitive, private or protected under copyright law. The data always remain safe and, depending on the specific legal restrictions, cannot be copied or even viewed. This type of research infrastructure can help substantially improve the results of humanities and social sciences research by making far more data available for analysis.
Researchers who want to analyse datasets by applying computational analysis tools usually obtain a copy of the data from a data provider. This has some disadvantages. Firstly, copying, storing, and managing the data on their computers is cumbersome and error prone. Secondly, and more importantly, data providers cannot hand over specific, sensitive datasets.
Government, commercial parties, and heritage institutions have an increasing number of interesting but sensitive datasets available, e.g. housing market information from real estate platforms, business data from the chamber of commerce or e-books which are still in commerce. Unfortunately, there is no generic infrastructure available allowing researchers to analyse these sensitive data in a way that data providers are assured personal data or copyright protected data remain safe. As a result, potential data providers are still hesitant to share their data. The projects Tools-to-Data & SANE are developing a solution.
Both projects turn things around by bringing the tool to the data. The data remains in a virtual, fully shielded environment. The researcher provides a tool, algorithm or container that will run in the same environment. The researcher cannot copy the data and only receives the results. In 2022 a proof-of-concept was built and evaluated in order to demonstrate the viability of the tools-to-data-environment. This was the Tools-to-data project.
In the SANE project (Secure ANalysis Environment) the aim is to develop the proof-of-concept into a working service. It allows researchers to mine sensitive data, while leaving the data providers in complete control from beginning to end. They control the access, can screen the software, and can decide whether the data should remain hidden or not. The data itself never leaves the virtual environment. Derived data can be released, again after screening. SANE comes in two variants:
- In SANE Blind, the researcher submits a tool or script without being able to see the data and the data provider approves the tool and the output.
- SANE Tinker allows the researcher to see and manipulate the data.
SANE is being developed by the Erasmus School of Social and Behavioural Sciences, ODISSEI (Open Data Infrastructure for Social Science and Economic Innovations), Netherlands Institute for Sound and Vision, CLARIAH (Common Lab Research Infrastructure for the Arts and Humanities), SURF and KB, National Library of the Netherlands. SANE is funded by PDI-SSH (Platform Digital Infrastructure Social Sciences & Humanities).
Tools-to-data has been developed by SURF, the collaborative organisation for IT in Dutch education and research, and KB, National Library of the Netherlands.
Tools-to-data animation: Dutch / English. Design: Studio Flix.