Documenting your data
Why document my data?
Documenting and describing your data makes it easier for you and others to reuse data at a later date. Imagine that you were taking over a project in the middle of a grant, but could not contact the principle researcher. What information would you need to continue the project? Here are some examples:
- File handling (naming convention, folder structure)
- Processing steps (how to get from point A to B)
- Protocols (what decisions were made and why)
- Field abbreviations/name glossary (what does ABC3130 stand for)
This is what is called metadata, which is "data about data" or the who, what, when, where, why, how of your research.
- Who created the data
- What the data file contains
- When the data were generated
- Where the data were generated
- Why the data were generated
- How the data were generated
What do I document and describe?
It is important to begin documenting your data at the start of your research and to continue doing so throughout the project. If you create the documentation only at the end of the project, important details may be lost or forgotten.
There are three types of documentation for a research project: study-level metadata, variable-level metadata, and catalogue metadata.
Study-level metadata
Provides context for understanding why the data were collected and how they were used. It could include:
- Rationale and context for data collection
- Data collection methods (protocols, sampling design, instruments or software used, etc.)
- Structure and organization of data files
- Secondary data sources used
- Data validation and quality assurance (checking, proofing, cleaning, calibration, etc.)
- Transformations of data from the raw data through analysis
- Information on confidentiality, access and use conditions
Variable-level metadata
Provides more granular information, as it explains, in detail, the data and dataset. It could include:
- Variable names, descriptions, units
- Data type (integer, Boolean, character, etc.)
- Explanation of codes and classification schemes used
- Data processing methods, software used, scripts, codes
- Data formats (.csv, .mat, .tiff, .txt, etc.) and software (including version) used
This information can be embedded in a data file. For example, variable, value and code labels can be added in an SPSS file. Interview transcripts can embed metadata in a header.
Find out more (documentation for quantitative, qualitative secondary data)
Catalogue metadata
When sharing data in a repository, the information added during data upload typically describes the content, context and provenance of the dataset(s) in a standardized and structured manner. This helps users find data, judge whether it is suitable for their research, and provides a bibliographic record for citing data.
The metadata in these data records often use international standards or schemes, consisting of mandatory and optional elements. Example schemes include Dublin Core (example) or the Data Documentation initiative (DDI) (example).
Example catalogue metadata could include:
- Name of the project
- Dataset title
- Project description
- Dataset abstract
- Principal investigator and collaborators
- Contact information
- Dataset handle (DOI or URL)
- Dataset citation
- Data publication date
- Geographic description
- Time period of data collection
- Subject/keywords
- Project sponsor
- Dataset usage rights
How do I document my data?
Documentation can take many forms. It can be written in free text, such as a readme file, or the metadata can be captured in a structured, machine readable file, encoded using an xml format.
Structured, discipline specific metadata is preferable, but if no standard exists, writing “readme” style files are the most simple way of recording metadata.
Readme files
A readme file provides information about a data file. It allows yourself and others to understand and reuse the data at a later date.
Best practices:
Follow the Cornell guide to writing "readme" files.
- Start writing the readme files at the beginning of the research project.
- Record the information in a text file (.txt)
- Use a template to help guide you, but tailor it to the needs of the project and kind of data that is being documented. Template examples:
- Update the file as the research progresses.
- When the research is complete and ready to be shared, deposit the readme file alongside the data in a repository.
Data dictionaries & codebooks
Data dictionaries and codebooks provide variable-level metadata. These two types of documents may provide overlapping information.
- Data dictionaries: describe the names, definitions, and attributes of the data elements in a file. Find out more:
- How to make a data dictionary (OSF)
- Describing your data with data dictionaries (Smithsonian Libraries)
- USGS Data Managament guide on data dictionaries
- Codebooks: used by survey researchers to provide information about the data from a survey instrument. Find out more.
Lab notebooks
Lab notebooks (print or online) are also a great way to document your research. They include methodology, results, calculations, etc. They are helpful for publishing, sharing, or reproducing your research.
- Information on lab notebook best practices
- Information on choosing an electronic lab notebook:
Metadata standards
Find out if your discipline uses a metadata standard to describe data. In fact, specific disciplinary data repositories may require a formal standard. These metadata files are often saved in a machine readable format, such as xml. There are tools that can help with the creation of these metadata files. See the Tools section for more information.
To find an appropriate metadata standard for your discipline, consult the following resources:
- Disciplinary metadata guide (Digital Curation Center)
- Open directory of metadata standards (Research Data Alliance)
- Metadata standards catalog (Research Data Alliance)
Tools to document my data
Creating standardized metadata can be difficult and time consuming. There are tools that can help. Some help you select controlled vocabularies to include in your documentation. Others help you complete the metadata schema.
Use this comparison chart, created by Stanford, to select the tool that is right for you