Archive and share data

Content

Why Archive and share your data?

A growing number of granting agencies and journals require data deposit. In addition, depositing your data in an appropriate repository safeguards it for reuse by your team or other researchers. It also supports the validation or replication of your research.

Here is what Emad Shihab, Associate Professor of Computer Science and Software Engineering at Concordia, has to say about data sharing:

"Sharing datasets for our research has significantly improved its impact and allowed for more transparency. Our most cited and reproduced work is work where we made our datasets publicly available. It requires some work to prepare and support these datasets, but it is well worth it and appreciated by the wider research community!"

What data to keep?

Data can be archived and preserved locally or shared in a public data repository. Note that archiving can be costly and there may not be enough space to archive everything. Researchers should carefully identify which data to preserve. Consider the following:

Does the data support published research?
Are the data likely to be reused?
Are the data unique or historically significant?
Are there funder or institutional requirements?
Are the data difficult to reproduce?
Are there any ethical issues to consider?
Are the data in support of a patent application?

Examples of data that should be kept by discipline (from Stanford University).

Best practices when preparing data for archiving and sharing

File formats	Choose long-term storage file-formats, preferably non-proprietary, to overcome software obsolescence. More information
Documentation	Add it alongside your data to make it understandable, reusable. More information
Ownership and privacy	If sharing data, make sure that: you or your organization own the data. More information on data licenses. all ethical requirements are followed. More information on ethics.
Data integrity	If keeping a local copy, avoid bit rot through refreshment (copy data on a new drive every 2-5 years) and replication (maintain three copies of the data, on two forms of storage with one in an external location).

Preparing sensitive data for sharing

Consent forms are key to data sharing:
- Some data cannot be shared for legal or ethical reasons. However, if sharing the dataset is required, ensure that this has been stated in consent forms and cleared with the Research Ethics Unit. Find out more.
De-identification allows sharing of sensitive data:
- De-identification is the process used to remove identifying data. Identifiers can be direct, which point directly to an individual, or indirect, which point to an individual when combined with other data.
  - Examples of direct and indirect identifiers
- De-identification guidance:
  - De-identification guide (Portage)
  - De-identification guidelines for structured data (Information privacy commissioner of Ontario)
  - Methods of data de-identification:
    - Anonymization (removing identifiers altogether)
    - Pseudonymization (replacing identifiers with pseudonyms or other identifiers)
Protecting sensitive species data:
- Guidance exists on how to make this type of data available without exposing species to harm.Find out more

See also:

Can I share my data? Decision tree (Portage)
Data Deposit & Access section of the Human Participant Research Data Risk Matrix (p. 8) (Portage)
Anonymisation: managing data protection risk code of practice (UK Information Commissioner's Office).
Anonymisation: Guide from the UK Data Service
McGill Data Anonymization Workshop Series 2023: recordings and slides from a a workshop series providing theoretical and practical knowledge in data anonymization and de-identification of sensitive data to promote and facilitate data deposit and data sharing.

Where to share data

There are many ways that researchers can share their data. These include:

Depositing in a discipline-specific data repository

Depositing in a general purpose repository

Depositing in an institutional or recommended repository

Publishing a data paper

Criteria in selecting a data repository

Source: University of Iowa

FAIR Principles: FAIR means that data publishing platforms should enable data to be Findable, Accessible, Interoperable, and Re-usable. The FORCE11 FAIR Principles (simplified here) are:
1. To be Findable any Data Object should be uniquely and persistently identifiable (have an identifer, such as a DOI)
2. Data is Accessible in that it can be always obtained by machines and humans, upon authorization, through a well-defined protocol
3. Data Objects are Interoperable (i.e. interpretable by a computer, so that they can be automatically combined with other data) if metadata and data use community agreed formats, language, vocabularies, and standards.
4. Data Objects are Re-Usable if the above are met, if the data can be automatically linked or integrated with other data sources, with proper citation of the source, and have a clear machine-readable licence.
Cost: Is there a cost to depositing data? Is it ongoing? Are these costs budgeted for?
Discoverability: Are there adequate metadata fields to describe your data? Is the repository indexed by Google?
Persistent identifiers: Does the repository register your data to create a persistent identifier (eg. a DOI)? These are necessary for citing your data.
Policies and licenses: Are data use agreements and/or licensing (Creative Commons) clearly presented, to allow depositors to state explicitly up front what uses they would be willing to allow?
Scholarly impact: Does it track data citation or download?
Certification: It is possible for repositories to get certification (eg. CoreTrust Seal of Approval) which indicates how well they preserve digital content. Although good to have, note that very few repositories have achieved certification.

Discipline-specific data repositories

Discipline-spcific or domain repositories accept datasets related to either a specific discipline (e.g. genomics) or a broad subject-area (e.g. social sciences). Some repositories allow for self-archival and will provide limited or no curation service; others, like ICPSR, will provide in-depth curation services to subscribing institutions (Concordia is an ICPSR member) provided that the data fits within their collection development policy.

Search for a disciplinary data repository:
- re3data.org (Registry of Data Repositories)
- PLoS ONE Recommended repositories (by discipline)
- Springer Nature Recommended repositories (by discipline)

General-purpose repositories

If a discipline-specific repository is not available, general-purpose repositories are the next best option. They typically accept a wide range of data types, and are suitable for cross-disciplinary data. Below are some examples:

Canadian general-purpose repositories

Concordia University Dataverse (from Borealis)
Description	Concordia Library service offer
Why should you use this repository? Support from Concordia Library (see next column) Available for free to all Concordia researchers Useful for small to medium sized datasets: files ≤ 5GB datasets ≤ 10GB Data hosted on Canadian servers Embargo options available Ready to deposit? Consult the following: Deposit Checklist Deposit Quick Guide How to deposit data in Borealis (video) Concordia University Dataverse Policy Borealis Statement on Sensitive and Confidential Data Borealis Sensitive and Confidential Data Deposit Checklist Need help?: lib-research.data@concordia.ca	What we offer: Provide guidance on the deposit process and the preparation of the dataset for deposit Create subDataverses Assign permissions to different members of a team within a subDataverse Rewiew deposited datasets Review all metadata for completeness Review adequacy of file naming convention Review preservation friendliness of file formats Check to see if files can be opened Publish deposited datasets Provide workshops on preparing and depositing data in Dataverse What we do not offer: Create metadata (readme files, data dictionaries, code books, catalogue metadata) Review data for scientific quality Data cleaning File conversions Data anonymization or de-identification Data deposit

Federated Data Research Repository (FRDR):
Free to all Canadian researchers Useful for very large datasets (files greater than 5GB) Long term preservation of data is assured upon data ingestion with the help of the Archivematica preservation software Data hosted on Canadian servers

Other commonly used general-purpose repositories

Dryad	Non-profit repository allowing a total storage space of 50GB for US$120.
Figshare	Commercial repository allowing a total storage space of 20GB for free.
Open ICPSR	Accepts social and behavioural science research data. Different levels of curation services (from none to complete) are offered at varying prices.
OSF	Open Science Framework (OSF) is a free and open source project management repository that supports researchers across their entire project lifecycle.
Zenodo	A multidisciplinary platform hosted by CERN. Accepts all research outputs from all fields of science.

Institutional or recommended repository

Institutional repositories

Depositing in discipline-specific or general-purpose repositories is encouraged, as they are generally better suited for data curation and dissemination. However, if there is no suitable discipline-specific repository for your dataset, and you do not wish to deposit in the Concordia University Dataverse repository, consider using Spectrum, Concordia University's institutional repository.

Recommended repositories

Some journals are requiring that researchers make the data associated with their papers publicly available to facilitate verification and replication of results. These publishers may either recommend a data repository, and in some cases, require that authors deposit their data in a specific repository. Note that if there is a cost to depositing data, it may be covered either by the submitter or by the publisher.

Below are examples of publisher recommended data repositories:

Nature
PLoS (including Criteria for recommended data repositories)

Data papers

Data papers describe datasets, and do not typically include any interpretation or discussion. Data papers are published either in a journal’s “Data Papers” section, or in a journal that exclusively publishes data papers (for example, see Nature’s Scientific Data).

According to Oregon State University:

"The purpose of a data journal is to provide quick access to high-quality datasets that are of broad interest to the scientific community. They are intended to facilitate reuse of the dataset, which increases its original value and impact, and speeds the pace of research by avoiding unintentional duplication of effort."

Data licenses

A license defines what others may or may not do with your data.

When sharing your data in a repository, for a example, you can choose to apply a broad license to your data that allows anyone to do whatever they like with it. Alternatively, you can choose a narrower license that restricts use to strictly non-commercial activities and requires attribution to the data creator when it is used.

There are two primary sources for data licenses:

Creative Commons (CC)
Open Data Commons (ODC)

If you deposit your data in Concordia's Dataverse, the default license is CC-0, however all the other CC licenses are available as well.

NOTE: Be sure you own the data! You can only publish data that you own or for which you've received permission to share. More information on data ownership and licenses.

Cite data

FORCE11's Data Citation Principles indicate that data should be considered legitimate, citable products of research and should be accorded the same importance in the scholarly record as citations of other research objects, such as publications.

A data citation should try to include the same elements as a publication citation:

Author
Publication date
Title
Publisher (this is often the archive where it is housed.)
Edition or version
Resource type (eg. dataset or database)
Access information (a URL or other persistent identifier)

Data Citation Generator:

If you have a dataset's DOI, use CrossCite's DOI Citation Formatter to create a data citation for you based on your selected style.

Examples:

Source: DCC

APA	Cool, H. E. M., & Bell, M. (2011). Excavations at St Peter's Church, Barton-upon-Humber [Data set]. doi:10.5284/1000389
Chicago	(Footnote) H. E. M. Cool and Mark Bell, Excavations at St Peter’s Church, Barton-upon-Humber (accessed May 1, 2011), doi:10.5284/1000389. (Bibliography) Cool, H. E. M., and Mark Bell. Excavations at St Peter’s Church, Barton-upon-Humber (accessed May 1, 2011). doi:10.5284/1000389

Citing software:

Proper attribution and credit of software can also help reproducibility, collaboration and reuse. For more on how to cite software, consult the following:

Katz DS, Chue Hong NP, Clark T et al. Recognizing the value of software: a software citation guide [version 2; peer review: 2 approved]. F1000Research 2021, 9:1257 (https://doi.org/10.12688/f1000research.26932.2)

Research data metrics

Research datasets can be cited like other research outputs, such as articles and books. Find out how to measure the impact of research data on the Library's Research data metrics guide.

Library learning

Useful links

Archive and share data