Archive and publish data
Data must not only be stored in the work process, but should also be archived at an appropriate time in the sense of good scientific practice. For example, the DFG and the University of Kassel require that research data be stored for at least 10 years.
This function can be performed by repositories in particular. These also offer the possibility of publishing data.
We recommend storing and, if necessary, publishing the data in a subject-specific repository. A targeted search for subject-specific repositories is offered by the RepositoryFinder.
The University of Kassel provides all researchers who cannot or do not wish to use a subject-specific repository with an institutional re pository (DaKS), which fulfills the function of both archiving and publication. This can also be used for student projects and theses.
Publishing your data offers advantages for the scientific system, but also for you personally.
Published data are available for subsequent use in new contexts, e.g. also for interdisciplinary questions or meta-analyses. This not only creates scientific added value, but also avoids duplication of work and saves costs.
By assigning permanent identifiers, your data can be permanently referenced and cited by yourself and others. This is a prerequisite for data publications to be recognized as an independent achievement and to enter the scientific reputation system . A study by Piwowar and Vision (2013) also shows the higher citation rate of publications where the underlying research data have been published.
Last but not least, in some cases the publication simply fulfills requirements of third parties. In addition to the requirements ofresearch funders, publication service providers are also increasingly demanding that those research data on which a publication is based be made available. Some examples of such requirements are:
- Public Library of Science (PLOS): Data Availability Policy / Materials and Software Sharing Policy
- Nature Publishing Group: Availability of Data, Material and Methods Policy
- Science: Data and Materials Availability Policy / Preparing Your Supplementary Materials
- BioMed Central: Availability of supporting data
- Elsevier: Research data Policy and Text and Data Mining Policy
There are both subject-specific or thematic as well as generic repositories. Subject repositories and data centers (such as Pangaea for geoscientific data, GenBank, Protein Data Bank) are often the first choice, not least with regard to visibility in the subject community, but also with regard to conformity to subject-specific standards . An overview of subject repositories is provided by the Registry of research data repositories(re3data.org) and the Open Access Directory to research data. A targeted search for subject repositories that also allow data storage is offered by the re3data-based RepositoryFinder.
When deciding on a particular repository, the following points can help you:
- Is it a repository that fits the subject matter? Is it established and connected to specific search portals?
- Does the repository offer the desired services (PIDs, open access, differentiated access rights (e.g. user agreements), realization of embargo periods)?
- Is the sustainability of the repository guaranteed? Is there an exit strategy or an agreement to preserve the data in case of e.g. discontinuation of funding?
- How are data transfer and data use regulated in terms of content and form?
The University of Kassel also provides all researchers who cannot or do not wish to use a subject-specific repository with an institutional repository (DaKS) (expected to be available from mid-January 2021), which fulfills both archiving and publication functions (see also "Archiving and publishing data"). This repository can also be used for student projects and theses.
In addition, interdisciplinary repositories for research data are available, such as the EU-funded ZENODO, Dryad or figshare.
Uploading your data does not equate to open access. In principle, you can also publish research data with a delay or only make the metadata accessible. In the case of actual publication, you can regulate the rights to access and edit in detail via the license or contracts (Can I then control the use of my data at all?). These possibilities can essentially be limited by:
- the specific requirements and policies of your research funders and/or publishers
- lack of/limited rights to the data
- restrictions under data protection law
- restrictions on the part of the repository
There are constellations in which data should not be published or should only be published under certain conditions. The most important prerequisite for publication is that you have the right to do so (Who may decide on the disclosure and publication of data? DoI own the copyright to my data?).
On the other hand, it may be confidential, personal data that may only be published after anonymization or with the consent of the persons concerned (What data protection restrictions must I observe?).
Personal data is defined as"individual information about personal or factual circumstances of a specific or identifiable natural person" (Section 3 (1) BDSG). They are subject to strict specifications in their collection, use and disclosure. For archiving, provision and publication, information that can be assigned to a specific or identifiable person should be removed from the research data. Depending on the data, different ways of anonymization are suitable here.
Instructions can be found at the Forschungsdatenzentrum Bildung. In addition, there are various tools for anonymizing data such as ARX, sdc-micro or the anonymization tool of the TMF.
If personal data are to be processed, the consent of the data subject must usually be obtained. Among other things, the purpose must be clearly defined and the data subject must be able to assess the consequences.
In addition, research data such as company data may contain confidential information (know-how protection) or confidentiality and non-disclosure agreements may have been made that preclude publication.
Possible owners or co-owners of the rights to the data are the researchers, the employer, the client, research funders and/or (private sector) contractual partners. Who may co-decide or must be asked about the sharing or publication of research data is determined by the contractual relationship. Usually, the results of commissioned research are the property of the employer or funder. The situation is different in the case of in-house research, where researchers are allowed to determine the data themselves.
Research objects and occasionally also research data may be protected as works within the meaning of the Copyright Act. These may be works of speech, computer programs, musical works, pantomime works including works of dance, works of fine arts including works of architecture and applied arts, photographic works, cinematographic works and representations of a scientific and technical nature.
As a rule, however, research data lack the necessary level of creation and are not works. It is possible, however, that certain types of research data are covered by a performance protection right , for example photographs, motion pictures or sound recordings.
Often, however, the research data of a research project are protected by copyright as part of a database work or fall under the ancillary copyright for databases.
Research data that do not fall under a property right can generally be used by anyone for any purpose without permission or obligation to pay.
If you have copyright or ancillary copyright over research data, you can regulate various aspects of use via appropriate contracts, such as the type and manner of use, user groups and time period, purpose, etc. Since contractual regulations for individual cases would be very costly in practice, there are various solutions for standardized regulations of usage rights. For example, the Leibniz Center for Psychological Information and Documentation (ZPID) offers standard contracts for the use of psychological data and GESIS regulates access restrictions for particularly sensitive social science data via user contracts. If you do not want your data to be subject to any specific access or usage restrictions, the use of standardized licenses such as Creative Commons or Open Data Commons is a good option (Which license should I choose?).
The publication of data under a specific license allows a detailed definition of the permissible form of its use. They create legal certainty on the part of both the person providing the data and the person using it. Even when waiving any restrictions, it is therefore important to formulate them.
Although data themselves are not usually subject to copyright, there is a case for treating them as potentially worthy of protection, not least to express one's own ideas about further use. Various licensing models are available for this purpose. The most common of these is 'Creative Commons(CC). CC licenses are independent of the licensed content and cover copyrights, ancillary copyrights, and in the current version - if it exists - also database producer rights.
The license package 'Open Data Commons' of Open Knowledge International (formerly Open Knowledge Foundation) has been designed especially for the publication of data. In addition to the unconditional license (Open Data Commons Public Domain Dedication and License (PDDL)), it offers three other models:
- Open Data Commons Attribution License (ODC BY) (v 1.0) (attribution condition).
- Open Data Commons Open Database License (ODbL) (v 1.0) (sharing under equal conditions)
- Database Contents License (DbCL) (distribution under the same conditions also for database contents)
Regardless of its legal bindingness, the CC-BY license certainly comes closest to fulfilling the idea of Open Access and Open Science, whereas the 'distribution under the same conditions' can lead to compatibility problems with other licenses, the prohibition of editing can lead to restrictions in use, e.g. for data mining, or to problems with long-term archiving. Prohibiting commercial use makes it more difficult to use in commercial databases and thus potentially reduces the visibility of your research (for details see Paul Klimpel, 2012).
Whichever license you choose, you should make a conscious and informed decision. For a more detailed discussion of the issue, see Andreas Wiebe & Lucie Guibault (2013). Regardless of the terms of use, the rules of good scientific practice apply, of course, which require that the source of data used be acknowledged.
Metadata is used to describe resources, in this case research data, in order to optimize their discoverability. Basic information includes, for example, title, author/primary researcher, institution, identifier, location & time period, subject, rights, file names, formats, etc. Since this information is essential for finding, understanding, and using data, standardized metadata schemas are intended to ensure that descriptions are as uniform and comprehensible as possible.
Metadata schemas are compilations of elements for describing data. Some disciplines already have specific metadata schemas, such as
- Humanities: Text Encoding Intitiative (TEI)
- Earth Sciences: ISO 19115, Darwin Core
- Natural Sciences: ICAT schema, Cristallographic Information Framework, conventions for Climate and Forecast metadata.
- Social and economic sciences: Digital Documentation Initiative (DDI)
Before you start documenting your data, ideally already as part of a data management plan, you should therefore check whether a suitable metadata schema already exists for your discipline. Information on this is provided, for example, by the Digital Curation Center (DDC). If no discipline-specific schema is available, a discipline-independent one, such as Dublin Core, MARC21 or RADAR. can also be used.
Metadata schemas thus specify what information should be delivered. For the best possible search and use of the data, it is also important to provide this information in as uniform a format as possible. A number of discipline-specific and cross-discipline so-called 'controlled vocabularies', thesauri, classifications and standards data are available for this purpose, such as:
- Standards for unique identification of individuals such as Open Researcher and Contributor ID (ORCID) or International Standard Name Identifier (ISNI, ISO 27729).
- Subject classification systems (e.g. DDC or LCC)
- Subject-specific classifications such as the Mathematics Subject Classification (MSC) or the Social Sciences Classification.
- Subject-specific thesauri such as the Thesaurus of Social Sciences (TheSoz), the Standard Thesaurus of Economics (STW) or the Getty Vocabularies (AAT, TGB, CONA, ULAN).
An overview of different systems is provided, for example, by the Basel Register of Thesauri, Ontologies & Classifications (BARTOC) and Taxonomy Warehouse.
Documentation usually goes beyond the description of data via metadata. It represents a deeper (scientific) indexing, in the context of which e.g. context of origin, variables, instruments, methods etc. are described in detail. In many cases, such a description is indispensable for understanding, verifying and, if necessary, using the data.
Introductions to the topic of metadata are provided, for example, by the JISC Guide or the interactive Mantra course of the University of Edinburgh.
Unless otherwise noted, all texts on this site and its subpages are licensed under a Creative Commons Attribution 4.0 International License.