Data sharing

Why should I share my data?

There are many reasons to share one’s data:

  1. Collaboration and connection.  Sharing data allows one to become a visible member of the open science community, and can lead to new collaborations and recognition of one’s contributions by the broader community.

  2. Improving reproducibility. Without access to the raw data, it is impossible to fully reproduce the results from a published study.  Data sharing allows researchers to confirm published findings and test their generalizability (for example, using different analysis methods).  In addition, by allowing researchers to combine shared datasets, data sharing improves the statistical power of research, which is directly related to its reproducibility.

  3. Receiving credit for data generation.  It is increasingly common for the generation and sharing of high-value datasets to be viewed as an important scientific contribution in its own right.  This is particularly evident in the advent of “data papers”, which are publications that describe a shared dataset and can provide citation credit for shared data.

  4. Responsibility to research participants.  One of the fundamental principles of human subjects research outlined in the Belmont Report is the principle of beneficence: “maximize possible benefits and minimize possible harms”.  Because failing to share data (while minimizing risks to the participant) will necessarily reduce the possible benefits of the data, there is an ethical argument that researchers are ethically obligated to share data unless the risks to confidentiality and privacy cannot be reduced.

  5. Responsibility to research funders.  Most research is funded by taxpayers or private foundations, who expect researchers to maximize the potential benefit of that investment.  When researchers fail to share data effectively, they are limiting the potential impact of the funding, by preventing others from using those data to generate additional knowledge or test new hypotheses.  In addition, some funding agencies (such as the National Institute of Mental Health and Wellcome Trust) require the sharing of data from research that they fund.

  6. Improving power through data aggregation.  There are many cases in which an individual researcher cannot feasibly obtain enough data to robustly test a particular scientific hypothesis; this is particularly the case for rare diseases, or for finding effects such as genetic associations that require very large samples.  In such cases, the sharing of data across sites can allow a larger consortium of researchers to combine data in order to more robustly ask particular scientific questions.

Why doesn’t everyone share their data?

There are a number of reasons why researchers may not wish to share data.

  1. Added effort and time.  Organizing one’s data for sharing can require a significant time commitment, depending on how they were initially organized.

    • This concern can be mitigated by following good organizational practices from the beginning of a project.

  2. Lack of incentives.  Many researchers feel that they will not receive suitable credit for sharing their data (for example, in the context of hiring or promotion), compared to other activities that they could instead engage in.

    • This concern is mitigated to some degree by the changing landscape of publication, including the growing prevalence of “data papers”.

  3. Potential to be “scooped”.  Some researchers worry that if their data are shared, other researchers may be able to ask the same questions that they wish to ask, and thus rob them of the priority on publishing those findings.   -The potential for “scooping” is certainly possible, but in domains where data sharing is common, this has occurred relatively rarely in practice.

  4. *Concerns about errors being found. * Researchers sometimes worry that sharing data could open them up to the possibility of others finding errors in their research.

  5. Concerns about “weaponization”.  In some highly politicized domains of science (such as climate science), politically motivated actors may use shared data in an attempt to discredit published work that contradicts their agenda.

What are “metadata” and why are they important?

Metadata refers to information that describes the content or structure of a dataset, or related information that is needed to properly interpret the data.  This can include information about the provenance of the data (e.g. who created the data, when they were created, and how they were created), the structure of the data (for example, what units are a particular variable specified in), and other annotations (for example, ratings of the quality of the data).  Metadata are important because they allow us to interpret the data properly, and are also important for finding the dataset, as outlined by the FAIR principles. It is important that these be included along with the data to allow proper interpretation.

The FAIR Principles for open data

The FAIR principles describe a set of features that shared data should have in order to be maximally useful. Shared data should be:

  • Findable: discoverable with metadata, identifiable and locatable by means of a standard identification mechanism

  • Accessible: always available and obtainable; even if the data is restricted, the metadata is open

  • Interoperable: both parseable and understandable, allowing data exchange and reuse between researchers, institutions, organisations or countries

  • Reusable: sufficiently described and shared with the least restrictive licences, allowing the widest reuse possible and the least cumbersome integration with other data sources.

See here for more on how to make one’s data FAIR.

Getting started with data sharing

Step 1: Plan for sharing (prior to data collection):

  • In the United States, the sharing of data is governed by federal regulations including Health Insurance Portability and Accountability Act of 1996 (HIPAA) Privacy Rule.  This rule states data that have been deidentified are not treated as “protected health information” (PHI) and their sharing is not restricted.

  • De-identification according to HIPAA requires the removal of a set of 18 identifiers, which are further detailed below.

  • Sharing of deidentified data in the US does not require explicit consent of the participant, but it is good practice to notify the subject of one’s intentions for data sharing.

    • An example of how to include language regarding data sharing in the consent form is available at the Open Brain Consent.

  • Deidentification should be built into data collection procedures

    • Never use any of the HIPAA identifiers to label datasets.

    • Best practice is to label participants using an arbitrary identifier, with a separate key that relates these identifiers to the subject identity. This key can later be destroyed in order to ensure complete de-identification.

      • For example, one might label subjects using a scheme such as “study-<studyname>_sub-<subcode>” where the subject code is simply an incremental counter of subjects who have participated. Thus, the third subject to participate in the study called “stoptask” would be “study-stoptask_sub-003”.

      • Any numbers should be zero-padded (i.e. “sub-003” rather than “sub-3”), so that a listing in alphabetical order will match the listing in numerical order.

    • Facial structure should be removed from brain imaging data

Step 2: Organize your data and metadata

  • It is best to use a consistent organizational scheme for all of one’s projects that involve similar data types.  This will allow reusability of analysis code, as well as allowing one to more easily find data of interest long after they have been used.

  • In many subfields there are existing or emerging standards that describe how to organize data; whenever such a standard exists, it should be used.

    • Psych-DS is an emerging standard for behavioral datasets

    • Brain Imaging Data Structure is a widely adopted community standard for neuroimaging data (including MRI, EEG, and other modalities)

      • BIDS can also be used for some simpler behavioral datasets

    • Neurodata Without Borders is an emerging standard for electrophysiology data

  • The article Nine simple ways to make it easier to (re)use your data provides some very useful suggestions on how to organize data in ways that make them easier to share and reuse:

    • Provide clear metadata

    • Provide an unprocessed form of the data along with any processed forms

    • Use standard file formats (preferably non-proprietary; e.g. use tab-delimited text rather than Excel format for data tables)

    • Use standard table formats (e.g. following the tidy data format)

    • Use standard formats within cells

      • Be consistent regarding capitalization, naming conventions, and delimiters (e.g. dashes vs underscores)

      • Each cell should contain only a single value

      • Avoid special characters

      • Avoid using the delimiter in the data itself (e.g. don’t use commas as delimiters if the data also includes commas)

        • This is a good argument in favor of tab-delimited versus comma-delimited text

    • Use standard formats when possible

      • e.g. for dates, use the standard format YYYY-MM-DD)

    • Use good null values

      • Missing values should be denoted using a notation such as “NA” or “NULL”.

      • Be consistent in your null values across the entire dataset.

      • Never use a numerical value (such as 0 or -999) to denote missing or null values

  • Additional metadata can be provided in separate files (“sidecar” files) that provide information that cannot be stored within the data files themselves

    • E.g. within the BIDS framework, data files can have an associated sidecar file in JSON format that provides detailed metadata.

Step 3: Determine the most appropriate repository

  • It is best to choose a repository that follows the FAIR principles

  • The repository should provide a persistent identifier for the data (such as a digital object identifier [DOI] or permanent URL [PURL]). The identifiers are meant to remain stable over time, making them findable and citable.

    • The persistent identifier is provided by the repository when the data are deposited

  • The repository should provide the ability to denote specific versions of the data

    • This allows the user of the data to specify a particular version used for a particular analysis, and allows the data owner to update the dataset (e.g. to fix problems that are identified or add additional metadata or derived data)

  • At Stanford:

    • Anyone can use the Stanford Digital Repository

    • Anyone can deposit data to Dryad for free. Depositing data in Dryad fulfills funder and publisher data availability requirements and deposited data can be cited and found through Google Dataset Search, PubMed, and other platforms.

      • Users should sign up using their ORCID and verify the institutional membership using SUNet (which is only required once).

  • Outside of Stanford, there are both general-purpose and domain-specific repositories that one can use

    • When a domain-specific repository exists, using it can make one’s data more easily available to others in the field.

    • Examples include:

  • Some funding agencies have specific requirements regarding the sharing of data:

  • There are also a number of general-purpose repositories that one can choose from:

  • One can also share data through Github, but this doesn’t provide the same level of FAIR support (e.g. it doesn’t provide a unique identifier)

    • One can configure Zenodo to create a DOI for any release on Github; see the section on Code Sharing for more on this.

Step 4: Determine the terms of sharing for your data

  • All shared data should be accompanied by a set of terms under which the data are released.  This makes it clear to any potential user of the data what rights and obligations they have, and prevents them from needing to contact you with questions about these issues.

    • This is often referred to as a “license”, though the concept of licensing does not generally apply to data since they are considered as “facts” rather than as “original creations” under US law.

  • Can the data be shared openly?

    • If so, then a public domain dedication (also known as the Creative Commons Zero or CC0) is generally the best way to release data, as it provides the maximal potential for re-use by others.

    • For more details, see https://www.openaire.eu/research-data-how-to-license/

  • If not, then a more restrictive data use agreement must be devised

  • This would generally be done in collaboration with the University legal team.

Step 5: Upload the data

  • Many sites provide simple uploading via a web browser

  • For larger datasets, it can often be useful to use a command-line uploader when available

Step 6 (optional): Publish a data descriptor

  • For datasets that are likely to be re-used, publishing a data descriptor can provide the data owner with a mechanism to receive academic credit (via citations) for their contribution.

  • A number of journals support the publication of “data papers”:

    • The premier outlet for data descriptors is Scientific Data

    • There are many others, both domain-specific and domain-general.  See a list here

Frequently Asked Questions

How should my files be named?

  • NEVER use any identifying information in file names.  See section on deidentification below for more on the particular details that should be excluded.

  • Data should be named consistently across a project

  • File names should be as concise as possible, while avoiding any potential naming collisions

    • Additional metadata can be stored separately, in addition to source files

  • Consider using a key-value scheme for file names, like that used in the BIDS project

    • Each key and value are connected by a dash

    • Key-value pairs are separated by underscores

      • E.g. study-1_sub-005_task-stroop_data.tsv would contain data for subject 5 in study 1 on the stroop task.

      • This allows the file name to be automatically parsed

  • Always zero-pad numerical values

    • This ensures that the alphabetical and numerical listing orders are identical

    • Pad to one order of magnitude more than you expect

      • E.g. if you expect to collect data from 125 subjects, then use four digits (e.g. sub-0125).

  • Avoid using spaces in file names

    • These can make parsing of file names more difficult on some systems.

  • Stick with lower-case letters

    • Computer systems differ in whether they are case-sensitive or case-insensitive (even within the same operating system; for example, there are both case-sensitive and case-insensitive versions of the Mac OS file system).

    • Using upper case letters can cause confusion, e.g. whereby one system would treat “Data” and “data” as the same while another would not.

    • For these reasons, snake case (e.g. “my_large_data_file”) is preferable to camel case (“myLargeDataFile”).

What does “deidentification” mean and how can I do it?

  • Deidentification refers to the removal of any information by which the identity of a human subject could potentially be recovered from the data.

  • The Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule specifies that a dataset can be considered “deidentified” (and thus no longer treated as Protected Health Information) if it meets the following criteria:

    • The following identifiers of the individual or of relatives, employers, or household members of the individual, are removed:

      1. Names

      2. All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geocodes, except for the initial three digits of the ZIP code if, according to the current publicly available data from the Bureau of the Census:

        • The geographic unit formed by combining all ZIP codes with the same three initial digits contains more than 20,000 people; and

        • The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000

      3. All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older

      4. Telephone numbers

      5. Fax numbers

      6. Email addresses

      7. Social security numbers

      8. Medical record numbers

      9. Health plan beneficiary numbers

      10. Account numbers

      11. Certificate/license numbers

      12. Vehicle identifiers and serial numbers, including license plate numbers

      13. Device identifiers and serial numbers

      14. Web Universal Resource Locators (URLs)

      15. Internet Protocol (IP) addresses

      16. Biometric identifiers, including finger and voice prints

      17. Full-face photographs and any comparable images

      18. Any other unique identifying number, characteristic, or code

    • The researcher does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information.

  • Some datasets may be identifiable even if they do not contain any of the 18 identifiers. For example, a dataset that contains only families with 4 or more children within a particular age range from a particular state could allow those individuals to be re-identified. In this case, one would need additional protections for data sharing.

How can I maximize the impact of my shared dataset?

  • Publish a data descriptor (see Step 6 above)

  • Provide a top level description of the dataset, covering information such as:

    • Subject that data was acquired from (i.e., Humans, Adults)

    • Number of subjects / samples

    • Modalities of data (i.e., structural and functional MRI)

    • Demographics

  • Provide a README file

    • High level description of dataset tree

    • Brief description of the data

  • Provide contact information

    • Allows data users to clarify any questions / discretions

  • Structure your metadata so that it can be indexed by Google Dataset Search

What if I discover an error in my shared dataset?

  • The need to fix errors highlights the importance of using a repository that allows versioning of the data.

  • Some repositories allow one to simply upload a revised version of the data

    • Consider writing a description for the README file describing the error and how it was resolved.

    • Alternatively, include a CHANGES file that details all changes made to the data for each version.

  • For repositories that involve manual submission, immediately reach out to the repository hosting the data. describing what the error is and how you would like to proceed with remedying it.