Glossar – Research Data Management – Leibniz University Hannover

Bitstream preservation

Digital data consist of a fix sequence of bits (bit stream) with each bit representing either the value 1 or 0. Bit stream preservation means, that the sequence stays exactly the same. Due to ageing processes, storage media tend to corrupt individual bits over time. In order to avoid this, it is necessary to replace the medium on a regular basis. Copying data to a new medium is also necessary when technological advances lead to the widespread use of a new kind of media. Bit stream preservation is a basic requirement for long-term archiving of digital data.

Citation guidelines

Citation of scientific data publications varies widely depending on the subject area and research discipline. The topic of citation of research data is currently being dealt with by various scientific groups, so that a uniform standard does not (yet) exist. As in case of citing text documents, you should definitely indicate the author(s), the name of the dataset, the year of publication and the storage location or (if existent) a persistent identifier (like a DOI).

You can create machine-readable metadata files for your own publications using the free online service CFF INIT. Such files facilitate the incorporation of the data into reference management programs etc. and help citing you correctly.

to CFF INIT

Copyright

Some kinds of research data, such as drawings and photographs, may constitute a “work” and hence be subject to copyright laws (German). This is true if they reach the necessary level of creativity and originality. (Measurement) data exclusively generated by machines usually do not qualify. If data is protected by copyright then all rights to use, exploit and reproduce them lies with those persons who created them, as long as no legal agreements reached beforehand indicate otherwise (e.g. work contract, cooperation agreement or a contract on a commissioned project). The originators may, however, cede these rights in order to enable others to re-use their works.

In case of publically funded research many funders expects all data to be made accessible by everyone free of charge as long as there are no legal constraints. Wherever possible, re-use should be allowed without restrictions. To this end, it is recommendable to allocate a license to (possibly) copyright-protected research data. The Creative Commons licenses CC0 (no rights reserved) or CC-BY (originators must be indicated), for example, are especially well-known and proven.

For additional information and points of contact for legal advice, please have a look at our subpage on legal and ethical aspects.

Data archive

A data archive is a facility for the long-term storage of digital data in their original state (bitstream preservation). This includes backup copies and a regular replacement of storage media. If additional services such as a migration to newer file formats or the online publication are available, the facility is not a mere archive but a repository.

Data backup

Data backup means a temporary duplication of data. Hence, in case of a hardware defect or unintentional data deletion there always exists at least one backup copy. For maximum security, at least two copies should exist in different locations with one of them being synchronised with the original data automatically and on a daily basis. When data is stored on servers of the computing centre of Leibniz Universität Hannover, this requirement is always met. Institutes may use the backup & restore service of LUIS to back up their servers.

Data journal

Data journals publish articles (so-called data papers) which document the processes of data generation, including applied instruments and methods. These descriptions enable the best possible reusability of the data. In some cases, the journals provide an in-house repository where the described data themselves may be deposited. But it is also possible to keep the data in a different place. Generally, this place should be referenced in the data paper using a persistent identifier, such as a digital object identifier (doi). You can find more information on this topic in this blog. Well-known data journals are compiled in this list.

Data management plan

A data management plan (DMP) is a structured document describing how research data is handled in the context of a certain project. It should inform about how data is generated, processed, documented, saved, archived and published (if applicable). The DMP should also indicate deployed tools and infrastructure (e.g. hard- and software). Ideally, a DMP is already drafted at the planning stage of a research project but should be updated and complemented on a regular basis during later stages. By now, a growing number of funders expect a DMP as part of a grant proposal.

Data publication

Data can best be published by deposition in a suitable repository where they are publically accessible via internet. Many repositories also offer the option to restrict access to certain groups (e.g. scientists only) or to attach conditions. In order to keep published data citable long-term, they should be accessible via a permanent link, which is ensured by assigning a persistent identifier, e.g. a DOI (Digital Object Identifier).

Digital object identifier (DOI)

A Digital Object Identifier (DOI) is a persistent identifier allowing to unambiguously identify and reference digital objects. DOI are especially suitable for citing articles and datasets published in a repository, for instance.

A DOI consist of a prefix indicating the institution that assigned the DOI and a suffix separated by a "/", which refers to the object itself (e.g. DOI: 10.1000/123456). Further information on DOI registration of research data can be found at the DOI Service of the German National Library of Science and Technology.

Embargo

A (temporal) embargo defines a period in which only the description (metadata) of the research data is publicly accessible, but not the associated data. An embargo can be applied if research data (e.g. as part of a peer review process) are to be published with a time delay.

FAIR Data Principles

The "FAIR Data Principles" were formulated by the FORCE11 initiative and aim to optimally process research data for sustainable reuse - first and foremost by machines! In order to do this, data must meet the following criteria:

Findable

(meta)data are assigned globally unique and eternally persistent identifier.
data are described with rich metadata.
(meta)data are registered or indexed in a searchable resource.
metadata specify the data identifier.

Accessible

(meta)data are retrievable by their identifier using a standardized communications protocol.
the protocol is open, free, and universally implementable.
the protocol allows for an authentication and authorization procedure, where necessary.
metadata are accessible, even when the data are no longer available.

Interoperable

(meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
(meta)data use vocabularies that follow FAIR principles.
(meta)data include qualified references to other (meta)data.

Re-usable

(meta)data have a plurality of accurate and relevant attributes.
(meta)data are released with a clear and accessible data usage license.
(meta)data are associated with their provenance.
(meta)data meet domain-relevant community standards.

Based on this, the FAIR4RS-Principles apply to research software.

File format

The file format is key to a long-term readability of digital data. File formats differ in level of prevalence and documentation. Some are “open”, meaning that their exact specifications are public. Others are proprietary and hence producer-dependent. In these cases, specifications are often not public. The rarer a format and the less known its exact specifications the higher is the probability that already in a few years from now no up-to-date software will be available which is able to open and read the files. If you are going to archive data for a long term try to convert the files into open und widely used standard formats. The RADAR project and the portal forschungsdaten.info provide lists of suitable formats (in German).

Good scientific practice

The rules of Good Research Practice define standards for scientific works. They are meant to ensure that proper methods were applied to obtain the research results and that these results are verifiable. They also prescribe sanctions in case of gross violations. The rules were first published as a memorandum by the German Research Foundation in 1998, which was updated and augmented in 2013. Since August 2019 the code “Guidelines for Safeguarding Good Research Practice” replaces the memorandum, which in many parts is formulated more binding and concrete than the recommendations of the memorandum. Out of the 19 guidelines that the codex contains seven explicitly refer to research data. In consequence, research data management should definitely be regarded as an integral component of good scientific practice.

Harvesting (metadata)

Harvesting describes the systematic and automated collection and processing of metadata from databases, repositories and other digital sources by computer programmes. By joining the scattered information, it becomes possible to search across databases. The visibility, discoverability and re-usability of published research data can thus be increased.

Legal protection for databases

The legal protection for databases includes an ancillary copyright which guaranties the investors who financed the creation of a database the commercial exploitation rights for 15 years. The data base protection does not protect the content of the database (which may be subject to general copyrights) but its compilation. It only applies if a “major investment” in terms of money, time, labour, etc. was necessary to reach the “threshold of originality”. The ancillary copyright is based on the directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases.

Licences

In some cases the creators of research data may gain a copyright (e.g., often on photographs or drawings). Re-using such data would require the authors´ explicit permission, often causing unintended complications and uncertainties when the data is to be shared with third parties. It is therefore recommendable to endow the data with a licence clearly stating the conditions for re-use.

The Creative Commons (CC) licences are especially well-known and established. CC licences exist in several varieties, each allowing or prohibiting certain kinds of usage. Research data is generally best licences under CC0 (public domain), meaning that the authors waive any copyrights they may hold. When re-using the data in a scientific context, it is nevertheless mandatory to name the authors in order to comply with the rules of good scientific practice. Specifically for software we recommend the licences MIT, MPL-2.0, Apache 2.0 or GPL-3.0-only.

You can find a complete overview over different kinds of licences on the webpage of the Open Source Initiative.

Long-term archiving

The long-term archiving (LTA) of research data is a procedure that keeps data available and interpretable for a long period (generally >10 years) despite technological and socio-cultural changes. For one thing, LTA requires bitstream preservation by regularly replacing defect storage devices. Since many file formats with time become technically obsolete and get out of use, LTA must also ensure that the file contents can still be read even in the far future. This requirement can be met either by regularly converting files into up-to-date formats or by using open and well documented file formats right from the start.

Metadata

Metadata are information on context and properties of data. Technical metadata may include file size, format and location. Descriptive metadata inform about a file´s content and the context in which it was generated or to which it is related (e.g. a scientific project or the design of an experiment). Without descriptive metadata, the data itself is not understandable or interpretable in most cases. Metadata are essential for systematic searching for datasets as well as for referencing and re-using them.

Protection of personal data

The protection of personal data comrises technical and organisational measures to prevent loss, unauthorised access to and abuse of personal data. Personal data are all data that directly or indirectly enable the identification of an individual (e.g. name, adress, IP adress or E-Mail adress). In general collecting personal data ist only allowed if the concerned persons explicitly agree, though there are restrictions and exceptions (e.g. for certain authorities and use cases).

Regarding research, personal data especially accrue in medical studies and in the social sciences. In these cases, encyption and data keeping in particularly well protected places is absolutely mandatory. By retrospectively pseudonymising or anonymising the data, however, relations to specific individuals may be erased to a degree that even the publication of the data becomes legally possible.

Since 25 May 2018 the European General Data Protection Regulation is in force as a directly applicable law. The German Federal Data Protection Law and the Data Protection Law of Lower Saxony (German only) have been adapted accordingly. For further information please refer to the website of the Data Protection Officer of Leibniz University (German only).

Repository

A repository is a facility to store, manage and distribute digital objects. Besides repositories for software or text documents there are also repositories for research data. These repositories serve to publish data and generally also to archive them long-term at the same time. Most data repositories collect metadata in a searchable database, and on upload they offer to generate a persistent identifier (e.g. a DOI) and to choose a license. A repository may get a certificate for meeting certain quality standards (e.g. adequate technical security measures). The page www.re3data.org now lists and describes more than 2,000 repositories.

Research data

Research data are all data that arise in the course of scientific work. They form the basis of current and potential future scientific findings. They include raw data, data in different stages of processing and the final versions ready for publishing. The documentation of data generation and processing as well as of a scientific project in general also form part of the research data.

Research data management

Research data management (RDM) comprises every activity related to the handling of scientific data and already starts when a research project is being planned. The planning should take into account the tools, methods and infrastructure used for the following aspects:

collecting or generating data
storing, structuring and documenting data
backing data up and protecting data
analysing data
archiving data
publishing data

Glossary