Data sharing¶
What are “metadata” and why are they important?¶
Metadata refers to information that describes the content or structure of a dataset, or related information that is needed to properly interpret the data. This can include information about the provenance of the data (e.g. who created the data, when they were created, and how they were created), the structure of the data (for example, what units are a particular variable specified in), and other annotations (for example, ratings of the quality of the data). Metadata are important because they allow us to interpret the data properly, and are also important for finding the dataset, as outlined by the FAIR principles. It is important that these be included along with the data to allow proper interpretation.
The FAIR Principles for open data¶
The FAIR principles describe a set of features that shared data should have in order to be maximally useful. Shared data should be:
Findable: discoverable with metadata, identifiable and locatable by means of a standard identification mechanism
Accessible: always available and obtainable; even if the data is restricted, the metadata is open
Interoperable: both parseable and understandable, allowing data exchange and reuse between researchers, institutions, organisations or countries
Reusable: sufficiently described and shared with the least restrictive licences, allowing the widest reuse possible and the least cumbersome integration with other data sources.
See here for more on how to make one’s data FAIR.
Getting started with data sharing¶
Step 1: Plan for sharing (prior to data collection):¶
In the United States, the sharing of data is governed by federal regulations including Health Insurance Portability and Accountability Act of 1996 (HIPAA) Privacy Rule. This rule states data that have been deidentified are not treated as “protected health information” (PHI) and their sharing is not restricted.
De-identification according to HIPAA requires the removal of a set of 18 identifiers, which are further detailed below.
Sharing of deidentified data in the US does not require explicit consent of the participant, but it is good practice to notify the subject of one’s intentions for data sharing.
An example of how to include language regarding data sharing in the consent form is available at the Open Brain Consent.
Deidentification should be built into data collection procedures
Never use any of the HIPAA identifiers to label datasets.
Best practice is to label participants using an arbitrary identifier, with a separate key that relates these identifiers to the subject identity. This key can later be destroyed in order to ensure complete de-identification.
For example, one might label subjects using a scheme such as “study-<studyname>_sub-<subcode>” where the subject code is simply an incremental counter of subjects who have participated. Thus, the third subject to participate in the study called “stoptask” would be “study-stoptask_sub-003”.
Any numbers should be zero-padded (i.e. “sub-003” rather than “sub-3”), so that a listing in alphabetical order will match the listing in numerical order.
Facial structure should be removed from brain imaging data
Recommended defacing algorithm: https://github.com/poldracklab/pydeface
Step 2: Organize your data and metadata¶
It is best to use a consistent organizational scheme for all of one’s projects that involve similar data types. This will allow reusability of analysis code, as well as allowing one to more easily find data of interest long after they have been used.
In many subfields there are existing or emerging standards that describe how to organize data; whenever such a standard exists, it should be used.
Psych-DS is an emerging standard for behavioral datasets
Brain Imaging Data Structure is a widely adopted community standard for neuroimaging data (including MRI, EEG, and other modalities)
BIDS can also be used for some simpler behavioral datasets
Neurodata Without Borders is an emerging standard for electrophysiology data
The article Nine simple ways to make it easier to (re)use your data provides some very useful suggestions on how to organize data in ways that make them easier to share and reuse:
Provide clear metadata
Provide an unprocessed form of the data along with any processed forms
Use standard file formats (preferably non-proprietary; e.g. use tab-delimited text rather than Excel format for data tables)
Use standard table formats (e.g. following the tidy data format)
Use standard formats within cells
Be consistent regarding capitalization, naming conventions, and delimiters (e.g. dashes vs underscores)
Each cell should contain only a single value
Avoid special characters
Avoid using the delimiter in the data itself (e.g. don’t use commas as delimiters if the data also includes commas)
This is a good argument in favor of tab-delimited versus comma-delimited text
Use standard formats when possible
e.g. for dates, use the standard format YYYY-MM-DD)
Use good null values
Missing values should be denoted using a notation such as “NA” or “NULL”.
Be consistent in your null values across the entire dataset.
Never use a numerical value (such as 0 or -999) to denote missing or null values
Additional metadata can be provided in separate files (“sidecar” files) that provide information that cannot be stored within the data files themselves
E.g. within the BIDS framework, data files can have an associated sidecar file in JSON format that provides detailed metadata.
Step 3: Determine the most appropriate repository¶
It is best to choose a repository that follows the FAIR principles
The repository should provide a persistent identifier for the data (such as a digital object identifier [DOI] or permanent URL [PURL]). The identifiers are meant to remain stable over time, making them findable and citable.
The persistent identifier is provided by the repository when the data are deposited
The repository should provide the ability to denote specific versions of the data
This allows the user of the data to specify a particular version used for a particular analysis, and allows the data owner to update the dataset (e.g. to fix problems that are identified or add additional metadata or derived data)
At Stanford:
Anyone can use the Stanford Digital Repository
Anyone can deposit data to Dryad for free. Depositing data in Dryad fulfills funder and publisher data availability requirements and deposited data can be cited and found through Google Dataset Search, PubMed, and other platforms.
Users should sign up using their ORCID and verify the institutional membership using SUNet (which is only required once).
Outside of Stanford, there are both general-purpose and domain-specific repositories that one can use
When a domain-specific repository exists, using it can make one’s data more easily available to others in the field.
Examples include:
OpenNeuro.org (neuroimaging data)
Databrary (audio/video files)
Wordbank (children’s vocabulary data)
Talkbank (spoken language data)
DANDI (neurophysiology data)
Some funding agencies have specific requirements regarding the sharing of data:
NIMH requires submission to the NIMH Data Archive
There are also a number of general-purpose repositories that one can choose from:
One can also share data through Github, but this doesn’t provide the same level of FAIR support (e.g. it doesn’t provide a unique identifier)
One can configure Zenodo to create a DOI for any release on Github; see the section on Code Sharing for more on this.
Step 4: Determine the terms of sharing for your data¶
All shared data should be accompanied by a set of terms under which the data are released. This makes it clear to any potential user of the data what rights and obligations they have, and prevents them from needing to contact you with questions about these issues.
This is often referred to as a “license”, though the concept of licensing does not generally apply to data since they are considered as “facts” rather than as “original creations” under US law.
Can the data be shared openly?
If so, then a public domain dedication (also known as the Creative Commons Zero or CC0) is generally the best way to release data, as it provides the maximal potential for re-use by others.
For more details, see https://www.openaire.eu/research-data-how-to-license/
If not, then a more restrictive data use agreement must be devised
This would generally be done in collaboration with the University legal team.
Step 5: Upload the data¶
Many sites provide simple uploading via a web browser
For larger datasets, it can often be useful to use a command-line uploader when available
E.g. OpenNeuro.org provides a command line tool for uploading
Step 6 (optional): Publish a data descriptor¶
For datasets that are likely to be re-used, publishing a data descriptor can provide the data owner with a mechanism to receive academic credit (via citations) for their contribution.
A number of journals support the publication of “data papers”:
The premier outlet for data descriptors is Scientific Data
There are many others, both domain-specific and domain-general. See a list here
Frequently Asked Questions¶
How should my files be named?¶
NEVER use any identifying information in file names. See section on deidentification below for more on the particular details that should be excluded.
Data should be named consistently across a project
File names should be as concise as possible, while avoiding any potential naming collisions
Additional metadata can be stored separately, in addition to source files
Consider using a key-value scheme for file names, like that used in the BIDS project
Each key and value are connected by a dash
Key-value pairs are separated by underscores
E.g. study-1_sub-005_task-stroop_data.tsv would contain data for subject 5 in study 1 on the stroop task.
This allows the file name to be automatically parsed
Always zero-pad numerical values
This ensures that the alphabetical and numerical listing orders are identical
Pad to one order of magnitude more than you expect
E.g. if you expect to collect data from 125 subjects, then use four digits (e.g. sub-0125).
Avoid using spaces in file names
These can make parsing of file names more difficult on some systems.
Stick with lower-case letters
Computer systems differ in whether they are case-sensitive or case-insensitive (even within the same operating system; for example, there are both case-sensitive and case-insensitive versions of the Mac OS file system).
Using upper case letters can cause confusion, e.g. whereby one system would treat “Data” and “data” as the same while another would not.
For these reasons, snake case (e.g. “my_large_data_file”) is preferable to camel case (“myLargeDataFile”).
What does “deidentification” mean and how can I do it?¶
Deidentification refers to the removal of any information by which the identity of a human subject could potentially be recovered from the data.
The Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule specifies that a dataset can be considered “deidentified” (and thus no longer treated as Protected Health Information) if it meets the following criteria:
The following identifiers of the individual or of relatives, employers, or household members of the individual, are removed:
Names
All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geocodes, except for the initial three digits of the ZIP code if, according to the current publicly available data from the Bureau of the Census:
The geographic unit formed by combining all ZIP codes with the same three initial digits contains more than 20,000 people; and
The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000
All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older
Telephone numbers
Fax numbers
Email addresses
Social security numbers
Medical record numbers
Health plan beneficiary numbers
Account numbers
Certificate/license numbers
Vehicle identifiers and serial numbers, including license plate numbers
Device identifiers and serial numbers
Web Universal Resource Locators (URLs)
Internet Protocol (IP) addresses
Biometric identifiers, including finger and voice prints
Full-face photographs and any comparable images
Any other unique identifying number, characteristic, or code
The researcher does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information.
Some datasets may be identifiable even if they do not contain any of the 18 identifiers. For example, a dataset that contains only families with 4 or more children within a particular age range from a particular state could allow those individuals to be re-identified. In this case, one would need additional protections for data sharing.