Skip to main content

Research Data Management

Building a Data Preservation Strategy

Before sharing your data, you must define a data preservation strategy. This should actually be done very early in the project, which is why it is a question covered in any DMP submission.

Not all data should be preserved

A proper preservation strategy requires you to identify what you should conserve and what can (or should) be destroyed:

  • Data necessary to obtain your research results should be preserved. Original non-sensitive data you did not use (yet) might also be of interest.
  • If your data is sensitive, elements that could identify your research subjects should be destroyed as early as possible without compromising your research.
  • If your data is of a secondary nature, you can safely delete it as long as the source of the data is durable. In addition, there is obviously no point in preserving data you do not own the rights to.

How long should data be preserved?

While the value of research data often decreases with time, "long-term preservation" as defined by funding agencies usually means 10+ years. A lot can happen within that timeframe: software versions can change a lot, hardware can quickly become obsolete, and you should be wary of home-made preservation options. 

To reduce vulnerability to software obsolescence, you should convert your data files to more durable formats (see below).

As for hardware, open data repositories such as Zenodo usually guarantee data preservation for a minimum of 10 years. If you must preserve your data away from such repositories (why?), you must ensure your backups are transferred to new hardware regularly. You must also regularly make sure your data is not corrupted by bit rot using checksum verification.

File Formats and Preservation

Not all file formats are created equal for preservation purposes. Preparing your data for long-term (10+ years) preservation requires you to select the most appropriate solution, and that includes format conversion.

Favour open, low-tech formats

CSV (comma-separated values) files may not look as good as a native Excel file, but they have multiple advantages when preserving tabular data:

  • They are simple: they can be opened and read even with a simple text editor.
  • They are open: the development of software that can use them is not hindered by intellectual property.
  • Being open also means they are not attached to a single software system and are compatible with many different options.

Markdown text, while not as subtle and complex as that created in a Word-style processor, has an inherently longer lifespan because it is both open and simple. RTF is another option, as are Open Document formats.

When this is not possible, favour the most popular file format

While Word or Excel files are not open or low-tech formats, their ubiquity means that they should remain readable in the foreseeable future. You might lose some of the formatting or formulas, but some sort of compatibility should remain.

Some proprietary file formats were even developed specifically for preservation purposes: PDF/A, for example, will stand the test of time better than the average PDF.

Whenever using a specific format and software, you should always document the version of the software you used to create, use, and save the data.

A list of recommended file formats

The ETHZ library has a list of recommended file formats for data preservation.