File formats and organization
File formats
Before starting a project, it is important to think about file formats as this may have implications for the life of the data. Follow these best practices to reduce the chances of data loss from software or data obsolescence.
Best practices
File formats when collecting data
- Open and non-propriety data are preferable
- If data must be in a proprietary format, ensure that it can easily be converted to open, non-proprietary format
- Select formats commonly used by research community
- More best practices...
File formats when sharing / preserving data
- Format should be open, non-proprietary, machine-readable
- Share multiple formats if format used by research community is typically proprietary (eg. MonaLisa_v1.psd AND MonaLisa_v1.tiff)
- For proprietary files, indicate (using a readme file) software/hardware needed to open files
- If compression is necessary, use lossless format
- More best practices...
Recommended formats for sharing, reuse and preservation
Sources: UK Data Service, Oregon State University
Type of data | Recommended formats | Acceptable formats |
---|---|---|
Tabular data (with extensive metadata) variable labels, code labels, defined missing values |
.por (SPSS portable format) | .sav, .dta, .mdb,.accdb |
Tabular data (with minimal metadata) column headings, variable names |
.csv, .tab | .txt, .xls, .xlsx, .mdb, .accdb, .dbf, .ods |
Geospatial data vector & raster data |
.shp, .shx, .dbf, .prj, .sbx, .sbn, .tif, .tfw, .dwg, .gml | .mdb, .mif, .kml, .dxf, .svg |
Textual data | .rtf, .txt, .xml | .html, .doc, .docx |
Image data | .tif (TIFF 6.0) | .jpeg, .jpg, .jp2, .gif, .tif, .tiff, .raw, .psd, .bmp, .png, .pdf |
Audio data | .flac | .mp3, .aif, .wav |
Video data | .mp4, ogv, .ogg, .mj2 | .avchd |
Documentation and scripts | .rtf, .pdf, .xhtml, .htm, .odt | ..txt, .doc, .docx, .xls, .xlsx, .xml |
Chemistry data spectroscopy |
.jdx (JCAMP) |
Consult the annually updated Library of Congress Recommended Formats Statement for more information on recommended file formats.
File naming, organization, versioning
Before starting a project, it is important to plan file management strategies. This will help you save time later on. When developing file organization conventions, be consistent, document them, and share them with anyone who may access the data.
Find out about...
Directory structure
Consider creating a readme.txt file, in your project's main file folder, that gives an explanation of the directory structure and describes the contents of the major folders. See MIT's README file & folder schema example.
Best practices
- The main folder should have an informative name. For example: title, unique identifier, and date (year).
- Subfolders should be divided by common theme. For example:
- research activity (interviews, surveys, experiment)
- parameter assessed
- data type (images, text, databases)
- kind of material (publications, deliverables, documentation)
- Consider restricting the level of folders to three or four deep and not to have more than ten items per folder.
Directory structure examples:
Psychology example | Marketing example |
---|---|
Source: Berenson, K.R. 2018. Managing your research data and documentation. American Psychology Association. |
Source: UK Data Service |
File naming
Consider creating a readme.txt file, in your project's main file folder, that explains your file naming convention, as well as any abbreviations or codes.
Best practices
Common elements in folder or file names:
- Project or experiment name or acronym
- File creator's name/initials
- Date
- Version number
- Data characteristics. For example:
- Location/spatial coordinates
- Type of data (eg. Survey)
- Conditions (eg. Lab instrument, Solvent, Temperature, etc.)
Rules of thumb for file names:
- Keep file names as short as possible while including all necessary information.
- Do not use spaces, full stops (.), or special characters (eg. &, *%#;()!@$^~'{}[]?<>)
- Use hyphens (-), underscores ( _ ), or camel case (FileName) to separate elements in a file name
- Dates should use consistent formatting (eg. YYYYMMDD)
- Version numbers should have leading zeros to allow for multi-digit versions (eg. v_05, v_023)
Examples of useful file names | Examples of poor file names |
---|---|
FG1_CONS_20100212.rtf interview transcript of the first focus group with consumers, that took place on 12 February 2010 |
|
Int024_AP_20080605.doc interview with participant 024, interviewed by Anne Parsons on 5 June 2008 |
Focus group consumers 12 Feb?.doc |
BDHSurveyProcedures_v04.pdf version 4 of the survey procedures for the British Dental Health Survey |
Health&Safety Procedures1 |
Source: UK Data Service
File renaming
Software is available for batch renaming multiple files using an automated process. Example software include Renamer (Mac) or Bulk Rename Utility (Windows). Find out more.
File versioning
Version control strategies depend on whether files are being accessed by multiple users and in multiple locations. Consider using these best practices to keep track of file versions.
Best practices
- Keep a copy of the 'master' data, and never edit it.
- Add version information in file naming convention (eg. creation or modification date OR version number)
- Use tools or software to help track file versioning. This could include:
- Tools that automatically assign version numbers (eg. Electronic Lab Notebooks)
- File sharing services (eg. Dropbox, Google Docs)
- Version control software (eg. Subversion, Git)
- Version control tables (see below)
- Find out more
Version control table example
Source: UK Data Service
Title: | Hearing screening tests in Montreal daycares | ||
---|---|---|---|
File name: | HearingScreenResults_v05.csv | ||
Description: | Results data of 120 Hearning Screen Tests carried out in 7 daycares in Montreal during June 2017 | ||
Created by: | Kate Smith | ||
Maintained by: | Mandy Watson | ||
Created: | 04/07/2017 | ||
Last modified: | 25/11/2017 | ||
Version | Responsible | Notes | Last amended |
05 | Mandy Watson | Version 03 and 04 compared and merged by MW | 25/11/2017 |
04 | Alex Thakor | Entries checked by AT, independent from SK | 17/10/2017 |
03 | Steve Knight | Entries checked by SK | 29/07/2017 |
02 | Karen Miller | Test results 81-120 entered | 05/07/2017 |
01 | Mandy Watson | Test results 1-80 entered | 04/07/2017 |