Data storage and file formats
At the beginning of your research project, you should devise a back-up strategy for all your data and associated files. The following elements should be considered:
What files should you backup?
Ideally, you should back up all the data files and associated documentation files, e.g. metadata files, files describing the methodology and/or the instruments used to obtain the data, files describing any manipulation or transformation of the dataset.
What should be the backup frequency?
There are no absolute rules prescribing how often data files should be backed up. However, critical files, especially dataset under construction should be backed up every time the file is modified. Less crucial files can be backed up at regular intervals, daily or weekly for instance. You should use a software or hardware solution that will automatize your backup plan and can handle incremental backups.
What type of storage should I use?
No storage medium is perfect; you should use multiple backup media and store at least one copy in a remote location. Here are some storage solutions (adapted from Mantra -- CC-BY):
Generally managed at the university or departmental level, these devices are regularly backed up (usually on tape) and provide for easy and secure access to your data.
PC or laptop hard drive
A flexible solution while you are working on a dataset, but these should not be the only storage solution you use. Hard drives can fail and PCs can be stolen.
External storage device (USB flash drive, CDs, DVDs)
Although this a common and affordable backup solution, there are several issues to be aware of:
- Depending on the size of your dataset, you may have to use multiple devices
- The longevity of these supports is questionable
- Follow the care and handling instructions carefully
- Regularly check your files to see if they are accessible and complete
- Make sure to “refresh” your data by making new copies on a new CD, or USB drive
- Encrypt any confidential data or protect it with a password
Services like Dropbox, Google Drive or OneDrive provide some free storage space on remote servers (more space can be obtained on a subscription base). Most cloud-storage solutions provide automated syncing and data encryption. However, remember that that there are drawbacks in using third-party online storage:
- Legal issues (copyright, data protection licenses) can be complicated or unsatisfactory, especially if the server is located outside of Canada. It is generally not recommended to use cloud-storage for sensitive data that includes identifying information on human subjects.
- Bandwidth may be a concern especially if you have large datasets
- You are at the mercy of changes in policies and commercial terms
When storing your data, it is important to choose the right file format in order to avoid obsolescence and facilitate reuse. As much as possible:
- Prefer open formats (txt, csv, tab, mp3, flac, xml) to proprietary formats (doc, xls, aiff)
- If you need to use proprietary formats, indicate the appropriate software needed to open the files
For quantitative social science datasets:
- Text files (txt, csv, tsv) are recommended, ideally with a Unicode encoding such as UTF-8
- If your dataset includes extensive metadata, use a delimited or structured text file with an accompanying setup file (for SPSS, SAS or Stata) containing metadata information
For more detailed information on file formats for different types of data (audio, video, geospatial, chemistry, images, quantitative and qualitative data), consult the excellent guide created by the University of Oregon’s library.