Archive and share data
On this page
- What is a data management plan
- Funder requirements
- Data management plan tools & examples
- Find data
- Collect data ethically
- File formats
- File naming, organization, versioning
- Document & describe
- Storage & backup
- Analyze & visualize data
- Prepare data for archiving, sharing
- Where to share data
- Data licenses
- Cite data
Archive and share data
Backing up your data is different from archiving. Backups can be discarded after a certain amount of time. Archiving is used to preserve a file as-is at the end of a project and acts as a static, final record. (Source: Oregon State University).
What data to keep?
Data can be archived and preserved locally or shared in a public data repository. Note that archiving can be costly and there may not be enough space to archive everything. Researchers should carefully identify which data to preserve. Consider the following:
- Does the data support published research?
- Are the data likely to be reused?
- Are the data unique or historically significant?
- Are there funder or institutional requirements?
- Are the data difficult to reproduce?
Examples of data that should be kept by discipline (from Stanford University).
Best practices when preparing data for archiving and sharing
|File formats||Choose long-term storage file-formats, preferably non-proprietary, to overcome software obsolescence. More information|
|Documentation||Add it alongside your data to make it understandable, reusable. More information|
|Ownership and privacy||If sharing data, make sure that:|
|Data integrity||If keeping a local copy, avoid bit rot through refreshment (copy data on a new drive every 2-5 years) and replication (maintain three copies of the data, on two forms of storage with one in an external location).|
Preparing sensitive data for sharing
Some data cannot be shared for legal or ethical reasons. However, if sharing the dataset is required, ensure that this has been stated in consent forms and cleared with the Research Ethics Unit. More information.
It may also be possible to retain multiple versions of the data: one for public release that has been de-identified, and one that is available on a highly restricted basis.
De-identification is the process used to remove identifying data. Identifiers can be direct, which point directly to an individual, or indirect, which point to an individual when combined with other data (see examples from Stanford University)
Below are two methods of data de-identification, with their benefits and drawbacks. (Source: UBC)
- Anonymization (removing identifiers altogether)
- Pseudonymization (replacing identifiers with pseudonyms or other identifiers)
- Anonymisation: managing data protection risk code of practice (UK Information Commissioner's Office).
- Anonymisation (UK Data Service)