Skip to Main Content

Research Data Management

Quantitative Research Software and Variable-Level Metadata

Research software such as SPSS, Stata, and others usually provide built-in tools to describe your data. For each variable, you should define measurement units, missing values and more as early as you start creating, collecting or working on a dataset. 

For any type of survey, the codebook (or data dictionary) will document the original questions and how possible answers are coded in your dataset (Stata, SPSS or other). This means you must decide a coding system before entering any data into your software. If your dataset is completely documented using embedded metadata, you may often generate a final codebook from the software itself later in the process. This can be integrated in your documentation (Readme file or other).

Read more about codebooks

Metadata in Stata

When sharing a Stata dataset, you should always include the following:

  • Your Stata data file
  • General documentation presenting the dataset (see Readme)
  • The codebook upon which you structured your data (can be included in the Readme file)
  • The do file(s) you used to reach your results
  • In some cases, you may wish to include logs, but they are generally not necessary as long as other files allow researchers to reproduce them

Variable-level documentation within your data file

Variables should receive a (short) name and a (detailed) label. These should be descriptive and understandable for other researchers without the need for interpretation. The variable name should not be longer than 8-10 characters for practical reasons (even though Stata allows up to 32 characters) and cannot include spaces (ex: R_height). The label should specify the unit of the variable if appropriate (ex: "Self-reported height of respondent in centimeters").

Value labels should also be created when their meaning is not as obvious: if you consider 1 means "yes" and 2 means "no", this should be documented. The same goes for agreement values stored as numbers (from "Strongly agree" to "Strongly disagree" or vice-versa) or for "male/female/non-binary" answers. It is of course especially important to create labels for missing values. 

Variable properties in the Stata data editor also include the variable type (integer, float, string...) and format. This is both necessary for Stata to work as intended and for later data users to understand your dataset.

Do files

The do files your research results are based on should be shared with the data file. Do files should be commented (using *, /* & */) so other researchers can interpret the commands they contain and how they are articulated. Ideally, your do files should also have useful, structured and informative filenames.

A good strategy to keep your project understandable is to have multiple do files rather than a large one, and to reference them in a "master" do file, which collects information and comments on all other files: what they are named, what they do, etc.

Read/watch more

Chapter 2 of A gentle introduction to stata by Alan C. Acock should help you learn the basics of Stata documentation. The 5th edition (2016) is available in the Library under call number 004(035), HEIA 115372

You can also check out the first videos in this official Stata data management playlist.

Metadata in Excel

While not as performant as Stata or SPSS, Excel is often used by researchers who do not require advanced research tools. Excel files can include basic embedded metadata similar to that in other Microsoft Office formats. Beyond that, different options are available for your Excel data.

Column headings

Your column headings should always provide any information necessary to properly assess the contents of the variable. They should indicate the measurement unit as well as keywords on what the contents of the column are about, as well as the question number (if appropriate).

Cell format (and a word of warning)

Excel allows you to format cells for different uses. Using them appropriately can be important in some cases for data manipulation, especially for dates or durations: 2.15 minutes (in decimal notation) is not the same as 02:15 (in sexagesimal notation). 

Be aware, though, that this aspect of your excel file will be lost if/when you convert your data to a .CSV file for conservation and sharing. Automatic data formatting by Excel can also lead to the corruption of your data and you should apply careful quality control throughout your project when using that software.

Embedding a codebook

Since Excel supports "sheets", you can embed your codebook on a secondary sheet in the same file. This ensures that users of your data always have access to the necessary documentation, but it is not a long-term solution: at some point, you might want to convert your file to a .CSV, which does not support sheets.

Colectica for Excel

More advanced documentation of your Excel dataset (including both study level and variable-level metadata) can be generated using Colectica for Excel. This very accessible Excel plugin allows you to describe every variable accurately, as you would in more advanced software, and to later export all the metadata you created into a DDI xml file (see below).

The basic version of Colectica for Excel is free. The professional version additionally allows you to import data from SPSS, Stata, and SAS.

Disciplinary Metadata

Data Documentation Initiative (DDI) 

In the social sciences, the most widely used metadata standard is DDI. It was designed to describe "the data produced by surveys and other observational methods in the social, behavioral, economic, and health sciences". It uses a series of controlled vocabularies.

DDI metadata can be exported into XML files from different software used in research (NVivo, etc.) or created ad hoc using specific tools (Colectica for Excel, etc.). Many open data repositories are compatible with DDI, such as Dataverse and Zenodo.

Other disciplinary metadata

To learn more about disciplinary metadata standards, please visit the Digital Curation Centre.