Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Data Organization and Management

Research data are like the water of science: When they stop flowing and dry up, everything withers and ultimately dies. In this chapter we discuss the principles and practices for good research data management and organization.

Principles of data management

Research data vary in value, and in some cases can be highly valuable. For example, the Large Hadron Collider at CERN, which was responsible for the data supporting the discovery of the Higgs Boson in 2012, has annual computing costs alone estimated at about $286 Million, such that the loss of the resulting data from those computations would have enormous costs. And in some cases where unique events are detected, such as the LIGO gravitational wave detector or telescope images of cosmic events, the data cannot be recreated if they are lost, making them immensely valuable. For this reason, scientific agencies have long focused on developing frameworks for research data management. In the US, the National Institute of Standards and Technology (NIST) has developed a Research Data Framework that provides researchers with a detailed set of best practices for research data management.

The FAIR Principles

The FAIR Principles Wilkinson et al., 2016 are a set of guiding principles for the effective sharing of research objects, including but not limited to research data. The FAIR acronym refers to four features of research objects that are essential to effective sharing:

Findable

Data are findable if they could be reasonably found by another researcher, usually via the standard web searches or database queries. Making data findable involves:

Accessible

Data are accessible if they can be accessed via clear procedures once they have been found. Making data accessible involves:

Note that “accessible” doesn’t necessarily imply “open”; in many cases, access to the data themselves may require additional data usage agreements between institutions. Accessibility in the FAIR sense simply requires that there is a clear process by which the data can be accessed.

Interoperable

Data are interoperable if they can be integrated with other data or processed automatically after they have been accessed. Making data interoperable primarily involves:

Reusable

Data are reusable if the requirements for reuse are clearly specified. Making data reusable involves:

The FAIR principles are relatively abstract, in the sense that they don’t provide specific guidance about what FAIR means in any particular domain. However, there are numerous resources that can help implement these principles, such as RDMKit and the FAIR Cookbook, both generated by the European ELIXIR organization.

The data lifecycle

An important concept in research data management is the data lifecycle, which describes the role of data management in each of the different stages of a research project. Figure 1 shows an example of how the RDMkit project outlines the stages of the data lifecycle. This figure highlights the fact that data management should be part of the discussion at each stage in a project. In this chapter we will discuss several of the stages in the data lifecycle in detail, though we leave in-depth discussion of data processing and analysis workflows to a later chapter.

A depiction of the data management lifecycle, from the RDMkit project by ELIXIR, CC-BY-4.0.

Figure 1:A depiction of the data management lifecycle, from the RDMkit project by ELIXIR, CC-BY-4.0.

Planning a study

It is essential to think about data management from the inception of a study, in order to ensure that early decisions don’t lead to pain down the road. Here are some examples of problems that might arise:

These are just a few of the problems that can arise, making it important to have a plan for data management at the onset of a research study.

Data Management Plans

Research funding agencies are increasingly requiring data management plans (DMPs) for grant submissions. In the US, both the National Science Foundation (NSF) and National Institutes of Health (NIH) require a DMP to accompany grant proposals, and the European Research Council (ERC) requires that funded projects submit a DMP within their first six months. Creating a data management plan in advance of a project can be very helpful, as it helps encourage the early integration of methods that can help make data management and sharing easier as the project matures. We will not go into detail regarding these plans, which will vary in their requirements depending on the funding agency. However, there are online tools available to assist with generation of a DMP:

I recommend always creating a data management plan at the start of a project, even if it’s not required by your institution or funding agency.

Collecting data

Once data collection starts, it is essential to ensure that the data are properly saved and readable. One should never assume that the code responsible for saving the data has actually worked properly; instead, the data should be separately loaded and checked to make sure that the data are being correctly stored and are properly readable. One good practice is to develop a checklist for new studies to ensure that the data are being collected properly; the details of such a checklist will depend on the specific area of study, but an example from our lab can be found here.

Storing data

There are several different ways that one can store research data, which vary in their ease of use, speed, reliability, and resilience. One major distinction is between the use of file systems (either physical or cloud systems) or database systems.

Before discussing different options, it is useful to lay out the important considerations regarding different data storage solutions. These are the dimensions across which different options will vary:

It’s also important to point out that most projects end up using multiple storage solutions for different portions of the data lifecycle.

File system storage

A file system is an organized system for naming and locating computer files on a storage system such as a hard disk. Readers of this book will undoubtedly be familiar with the file systems present on Mac, Windows, or UNIX/Linux systems, which represent a hierarchical tree of folders and files. Here is an example of the file tree for the source code folder in the book project:

➤  tree -L 2 src
src
└── BetterCodeBetterScience
    ├── __init__.py
    ├── __pycache__
    ├── bug_driven_testing.py
    ├── claudecode
    ├── constants.py
    ├── data_management.ipynb
    ├── distance.py
    ├── distance_testing
    ├── docker-example
    ├── escape_velocity.py
    ├── formatting_example.py
    ├── formatting_example_ai.py
    ├── formatting_example_ruff.py
    ├── incontext_learning_example.ipynb
    ├── language_model_api_prompting.ipynb
    ├── llm_utils.py
    ├── modify_data.py
    ├── my_linear_regression.py
    ├── simple_testing.py
    ├── simpleScaler.py
    ├── test_independence.py
    └── textmining

We often use spatial metaphors to describe file systems; we say that a file is “inside” a folder, or that we are going to “move” a file from one folder to another. Working effectively and efficiently with data stored on a file system will be enhanced by a solid knowledge of the various tools that one can use to interact with a file system. In the examples throughout the book I will focus on POSIX-compliant operating systems like MacOS and Linux, but most of the same concepts also apply to other file systems such as Windows.

Storage on a PC/laptop hard drive

The simplest way to store data is on a hard drive on a researcher’s personal computer workstation or laptop. While this is easy and relatively cheap for smaller datasets, it is also fraught for numerous reasons:

For these reasons, I generally recommend to researchers in my lab that they should never rely solely on their own computer as the storage solution for research data.

Storage on a network drive

Research data is often stored on network drives. These can vary from a network-attached storage system dedicated to one or more users within a research group, to large-scale network drives managed by an institutional computing center. One common feature of network storage is the use of redundant drive systems, such as RAID (Redundant Array of Independent Disks). These systems combine multiple individual hard drives in ways that provide some degree of redundancy, such that the system can withstand the loss of one or more individual disks (depending on the setup) with no data loss. However, it is critically important to remember that while RAID does provide some degree of fault-tolerance, it does not provide the disaster recovery benefits of a true backup.

Many researchers run and manage their own RAID systems for their group’s use, either attached to a single workstation or to a network. This can be a cost-effective solution for large data storage, especially in situations where institutional data storage resources are not available. However, I think that the apparent robustness of RAID systems can provide a false sense of security to their users. Take the most common RAID setup for redundant storage, RAID 5, in which the system is robust to the failure of one of its hard drives. When a drive fails, the system enters a “degraded” mode, often providing a notice to the user such as a flashing red light or beeping sounds. If this failure goes unnoticed, or the system administrator puts off fixing it, the failure of a second drive during degraded mode or during the rebuilding of the array after replacing the first failed drive (which can often take many hours) can lead to complete data loss. Similarly, if the rebuilding of the array fails (for example, due to power loss during the rebuild or an unrecoverable error in reading from the other drives), this can compromise the data. Safe use of RAID arrays requires consistent attention (including email notifications of failure if possible) and a strong backup strategy.

Most research institutions now offer large network-attached storage systems for research data, often connected directly to high-performance computing systems. We have used systems like these for our research data for more than 15 years, and I personally would never go back to running my group’s own RAID system (which we did for years beforehand). Foremost, the system administration and hardware management resources of an institutional computing center will almost always outstrip those of most research groups. These large systems will have monitoring and repair procedures in place to ensure against data loss, and in the 15 years that we have used these systems (at Stanford and the University of Texas), we have never experienced data loss due to hardware failure. However, they are still liable to potential disasters. These systems are also highly performant, providing parallel access to the data through high-speed interconnections with the compute system.

Backing up one’s data from a large network drive is a great idea in theory, but in our experience it has often been either impossible or too costly, given the many terabytes of research data that we store on the systems. Given the relatively low likelihood of failure, we have adopted a more risk-tolerant strategy to big data storage:

In this way, we have redundant copies of the code and original data that could be used to recreate the processed data if necessary. This is a risky strategy, but the more risk-averse alternative of continuously backing up our entire 250TB partition would be cost-prohibitive.

Cloud drives

Cloud drives, such as Dropbox or Google Drive, have become very popular for storage and sharing of data. I personally keep all of my code and documents synced to Dropbox from my laptop, and the file recovery capabilities of Dropbox have saved me from myself more than once after accidentally deleting the wrong files. I also regularly share files with other researchers using the Dropbox file sharing features. Because of the potential impact of loss of my laptop, I also keep a “hot spare” laptop that is constantly kept in sync with my primary laptop via Dropbox. Thus, cloud drives are essential for my own research and productivity workflow. However, cloud drives on their own are unlikely to be a primary solution for data storage with large datasets, for several reasons including:

In addition, many institutions have specific restrictions regarding the use of specific cloud drives, especially with regard to private or protected information.

Cloud object storage

An increasingly common storage option, especially for very large datasets, is the use of cloud-based object stores, such as Amazon’s Simple Storage Service (S3) or Google Cloud Storage. In some ways object storage is similar to a standard file system, in that it allows the storage of arbitrary types of files, which can be retrieved using a key that functions like a file path on a file system. However, there are also important differences between object storage and file systems. Most importantly, cloud object stores are accessed via web API calls rather than by operations on a local storage system. Cloud object stores have several features that can make them very attractive for research data storage:

These systems are most effective when they are accessed directly using computing resources hosted by the same cloud provider. If they are located within the same datacenter, then the network connectivity can be substantially faster. It rarely makes sense to access data directly on a cloud object store from a local computing system, both because of the potentially high cost of reading and writing data from these systems and because of the relatively long latency and low bandwidth of connections between a local system and a cloud provider.

Database storage

In some areas of science, such as genomics, it is common to store data using database systems rather than files on a filesystem. A database system is a software system that stores data records and allows the user to query the records based on specific features and to add, modify, or delete records. A database system can run locally on one’s own computer, or can be accessed remotely via the Internet; most cloud computing providers provide database systems that can be hosted virtually, providing access to storage space that is limited only by one’s budget.

There are many potential benefits to the use of database storage that will be outlined below. However, one important factor in the choice of database versus flat file storage is what software tools will be used to analyze the data. If the analyses are primarily being performed using custom code in Python or R, then it is relatively easy to either retrieve information from a database or load data from a flat file. However, in some fields (including the field of neuroimaging where I work) it is common to use software packages that are built to process flat files, which strongly drives researchers in the field towards that approach.

I will first briefly outline several of the most common forms of database systems, and then show an example that employs each of them.

Relational databases

The best known form of database is the relational database, which organizes tabular data into a set of tables with well-defined relationships between them. They also enable queries using a query language, of which Structured Query Language (SQL) is a well-known example. For me, SQL has always been one of those things that I use just infrequently enough that I never actually learn it. Fortunately, LLMs are very good at translating natural language into SQL queries, lowering the barrier of entry for researchers who want to try out database storage.

ACID

One important feature of a relational databases is that they generally implement features to ensure data integrity and reliability. These are often referred to as the ACID properties:

The adherence of relational database systems to these principles helps ensure the integrity of scientific data, in comparison to the use of flat files which do not necessarily achieve these goals.

Analytic databases

There is a particular kind of relational database known as an analytic database that is specialized for operations that work across many rows in the database, rather than the focus on individual records in a standard relational database. One widely-used analytic database in the Python ecosystem is DuckDB, which supports very fast operations on large datasets, and integrates well with Pandas and other tools. Unlike traditional relational database systems, it doesn’t require any specialized server setup.

NoSQL databases

While relational databases were the only game in town for many years, there are now a number of other kinds of database, collectively referred to as NoSQL databases because they use non-relational data models (like documents, graphs, or key-value pairs) rather than the tables with fixed schemas that define a standard relational database. Each of these can be very useful for specific problems that match the database’s strengths. Some, but not all, NoSQL databases are ACID compliant. It’s important to ensure that one has the right safeguards in place when using a non-compliant database system.

Here I provide a brief overview of several popular forms of NoSQL databases; see the Appendix to this chapter for a full-stack example that combines several of these.

Document stores

A document store is basically what it sounds like: a system into which one can dump documents and then query them. I think of this as in some ways the opposite of a SQL database. In the SQL database, most of the work comes in designing the database schema, which will determine up front how the data are represented; after that, querying is fairly straightforward. In a document store, one can insert documents with varying structure into the database without the need for a predefined schema. The hard work in a document store comes in figuring out how to structure queries and indexes effectively, especially when the structure of the data varies. For most of the tasks where I have used databases I have chosen document stores over relational databases because of the flexibility that they offer.

Graph databases

A graph database is built to efficiently store and query graph-structured data. These are data where the primary features of interest for querying are the relationships between entities, rather than the entities themselves. Scientific examples could include social network relationships, protein-protein interactions, or connections between neurons or brain areas. Graph databases are particularly good at finding multi-step relationships within the graph, which are much more difficult to find using a relational database or document store. A commonly used graph database is Neo4j, which has its own query language called Cypher that is specifically designed for queries on graph structure.

Vector databases

A relatively recent entry into the database field is the vector database, which is optimized for finding similar numerical vectors. These have become essential in the context of AI, because they can be used to quickly find similar items that are embedded in a vector space, typically using neural networks. These items can include text documents, images, molecular structures, or any other kind of data that can be embedded in a vector space. Vector databases differ in that can return ranked similarity ratings in addition to a discrete set of matches, and thus they are best for performing analyses that involve similarity-based search.

Managing original data

By original data I mean data that were obtained by the researcher from another source, such as from a recording device or by manual recording. These are the data that, if lost, would constitute an irretrievable loss of the scientific record. As such, these data should be protected in a way that is different from derivative data that are obtained via processing of the original data.

Immutable storage

It is important to ensure that the original data are not modified, either accidentally or intentionally. This can be achieved by setting the permissions on the files as read-only, though it is important to note that a user with superuser privileges could still make changes to the data. For this reason, it is also important to store information that allows the validation of each file as matching its original. This can be easily achieved by computing a checksum for each file and storing it separately. A checksum is a mathematical function of the file contents, which changes if any of the data in the file are changed and thus can serve as a “fingerprint” for the file contents:

➤  echo "file contents" > file1.txt
➤  echo "file content" > file2.txt
➤  diff file1.txt file2.txt
1c1
< file contents
---
> file content

➤  md5sum file1.txt
081404b3d2ae5bf599add15b7445ac07  file1.txt
➤  md5sum file2.txt                                                       1 ↵
186e4cee4e00635b35be4236193f33fb  file2.txt

There are various types of checksums, such as MD5 (used in the example above) or SHA-1, any of which should be suitable for validating file identity. However, the different methods produce different checksums so it’s important to document which method was used to create them.

Backup

Disasters can occur that can render data irretrievable, from a major flood or earthquake to a cup of coffee spilled onto a laptop. It is essential to maintain a backup of the original data for disaster recovery purposes. This backup should be located on a system that is geographically separate from the main storage, such as a cloud server.

Many researchers store their data using RAID systems (which stands for “redundant array of independent disks”), which can provide some degree of resilience against disk failure. Depending on the RAID configuration, these systems can survive the failure of one or more individual disks without data loss. However, storage on a RAID system is not a suitable replacement for backup, as these systems can still fail (e.g. due to a fire or flood that damages multiple disks).

Data access

Access to research data within a research group should operate by the Principle of Least Privilege, which is a general approach to computer security that states that any user should have only enough privileges to perform their intended actions, and no more. While this is most relevant to multi-user systems, it is often relevant even to systems with only a single user. With regard to data access, this principle has two important implications. First, any particular dataset should be accessible only to users who have permission and need to use those data. Second, if data are meant to be accessed by a user but not modified by that user, then the user should have read-only access to the data.

Even when a user may have a need to modify data, it often makes sense to set the data as read-only, so that any modifications require an explicit permissions change before the modifications can be made. This can help prevent accidental changes or data deletion, even on a single-user system.

Data formats and file types

There is a wide range of different data formats and file types used across science. Here I will cover some of the most widely used formats and file types, realizing that there are many areas of science that use highly specialized formats/file types that I can’t cover in depth here.

Tabular data

Tabular data are loosely defined as data stored in rows and columns, as in a spreadsheet. A data frame is a particular representation of tabular data, in which each column in the dataset has a label and each row has an index value that refers to that row. The packages supporting data frames (such as pandas in Python) generally provide a set of operations that can be performed on the data frame, such as filtering, sorting, merging, or pivoting.

Long, wide, and tidy tabular data

There are multiple ways to organize data within a tabular dataset, reflecting what each column and row refer to. This distinction is often referred to as long versus wide data, though in reality there is really a spectrum of organization between these extremes. Wide data generally refers to data where each row refers to a single observational unit (such as a one site, person, or planet), and each column refers to different variables measured for that unit. For example, let’s say that we had measurements of height, weight, and blood pressure from three individuals. A wide representation of these data would have one row per individual, with an identifier column identifying the individual and separate columns for each of the measurements:

idheightweightblood_pressure
117070120
218080130
317575125

Long data generally refers to data where different variables measured for a unit are spread across rows. In this case, there are separate columns that specifies the variable being measured, the value of the variable for that unit, and the identity of the unit being measured.

idmeasurementvalue
1height170
2height180
3height175
1weight70
2weight80
3weight75
1blood_pressure120
2blood_pressure130
3blood_pressure125

A related concept is the notion of tidy data, popularized by Hadley Wickham who is leader in the R language community. While it is often associated with long data, the concept of tidy data actually refers to the meaning of the data rather than its shape. Tidy data has three features, as specified by Wickham, 2014:

It’s easiest to understand these concepts by looking at some examples of datasets that are not tidy, following the examples laid out by Wickham, 2014

Column headers are values, not variable names

Sometimes data are spread across columns where each column refers to a different value of the variable. For example, the following table shows cancer pathology data from three hospitals, with different columns quantifying the number of samples falling into each of four different tumor stages:

siteStage1Stage2Stage3Stage4
H146273832
H237483127
H325373323

There are really three different variables represented in this table: Site, Stage, and frequency. What we really want is to have three columns, representing those three variables. We can achieve this using the melt function from pandas:

df_tidy = pd.melt(df, id_vars=["site"], 
    var_name="Stage", value_name="Frequency")
# make stage an integer
df_tidy.Stage = df_tidy.Stage.str.replace("Stage", "").astype(int)
print(df_tidy.to_markdown(index=False))
siteStageFrequency
H1146
H2137
H3125
H1227
H2248
H3237
H1338
H2331
H3333
H1432
H2427
H3423
Multiple variables are stored in one column

This pattern takes the previous one a step further, by defining columns based on the values of more than one variable. For example, let’s say that there data for both lung and prostate cancer:

siteStg1LngStg2LngStg3LngStg4LngStg1PrsStg2PrsStg3PrsStg4Prs
H14432212848244434
H22230224526493132
H34840263346332425

In this example, each value column represents a combination of stage and type of cancer. We can tidy this by first melting the data frame, and then splitting the combined column names into separate variables for Stage and Cancer type:

# tidy this, first by melting
df_tidy = pd.melt(df, id_vars=["site"], 
    var_name="Stage_Cancer", value_name="Freq")
# then split Stage_Cancer into two columns
df_tidy[["Stage", "Cancer"]] = df_tidy.Stage_Cancer.str.extract(r'Stg(\d)(\w{3})')
del df_tidy["Stage_Cancer"]
# make Stage an integer
df_tidy.Stage = df_tidy.Stage.astype(int)
# reorder columns
df_tidy = df_tidy[["site", "Stage", "Cancer", "Freq"]]
print(df_tidy.to_markdown(index=False))
siteStageCancerFreq
H11Lng44
H21Lng22
H31Lng48
H12Lng32
H22Lng30
H32Lng40
H13Lng21
H23Lng22
H33Lng26
H14Lng28
H24Lng45
H34Lng33
H11Prs48
H21Prs26
H31Prs46
H12Prs24
H22Prs49
H32Prs33
H13Prs44
H23Prs31
H33Prs24
H14Prs34
H24Prs32
H34Prs25
Variables are stored in both rows and columns

We could also have some variables denoted by their own column with others split across columns:

siteCancerStage1Stage2Stage3Stage4
H1Lng44322128
H1Prs48244434
H2Lng22302245
H2Prs26493132
H3Lng48402633
H3Prs46332425

Here we can melt the data frame to collect the Stage columns:

# gather to make it tidy
df_both_tidy = pd.melt(df_both, id_vars=["site", "Cancer"], 
    var_name="Stage", value_name="Frequency")
# make Stage an integer
df_both_tidy.Stage = df_both_tidy.Stage.str.replace("Stage", "").astype(int)
print(df_both_tidy.to_markdown(index=False))
siteCancerStageFrequency
H1Lng144
H1Prs148
H2Lng122
H2Prs126
H3Lng148
H3Prs146
H1Lng232
H1Prs224
H2Lng230
H2Prs249
H3Lng240
H3Prs233
H1Lng321
H1Prs344
H2Lng322
H2Prs331
H3Lng326
H3Prs324
H1Lng428
H1Prs434
H2Lng445
H2Prs432
H3Lng433
H3Prs425
Single type of observational unit spread across multiple tables

Sometimes we might have different data frames for each observation unit, such as a different data frame for each hospital in our example. To fix this we can simple merge the data frames by concatenating them:

df_merged = pd.concat([df1, df2, df3], ignore_index=True)

Tabular file formats

The most common file formats are comma-separated value (CSV) or tab-separated value (TSV) files. Both of these have the benefit of being represented in plain text, so their contents can be easily examined without any special software. I generally prefer to use tabs rather than commas as the separator (or delimiter), primarily because they can more easily naturally represent longer pieces of text that may include commas. These can also be represented using CSV, but they require additional processing in order to escape the commas within the text so that they are not interpreted as delimiters.

Text file formats like CSV and TSV are nice for their ease of interpretability, but they are highly inefficient for large data compared to optimized file formats, such as the Parquet format. To see this in action, I loaded a brain image and saved all of the non-zero data points (857,785 to be exact) to a data frame, which I then saved to CSV and Parquet formats; see the management notebook for details. Looking at the resulting files, we can see that the Parquet file is only about 20% the size of the CSV file:

➤  du -sk /tmp/brain_tabular.*
19464	/tmp/brain_tabular.csv
3804	/tmp/brain_tabular.parquet

When we look at the amount of time needed to load these files, we see an even stronger edge for the Parquet format. Because the loading times can vary due to other activity on the system, we load each 100 times to get an average load time:

import time
# time loading of each format
# load 100 times to get average loading time of each format

nreps = 100
start = time.time()
for _ in range(nreps):
    df_csv = pd.read_csv('/tmp/brain_tabular.csv')
end = time.time()
csv_time = (end - start)/nreps
print(f"CSV load time: {csv_time:.4f} seconds")

start = time.time()
for _ in range(nreps):
    df_parquet = pd.read_parquet('/tmp/brain_tabular.parquet')
end = time.time()
parquet_time = (end - start)/nreps
print(f"Parquet load time: {parquet_time:.4f} seconds")
print(f'ratio {csv_time/parquet_time:.2f}')
CSV load time: 0.0719 seconds
Parquet load time: 0.0082 seconds
ratio 8.77

Here we see that loading the CSV file takes almost 9 times as long as the Parquet file. For a single file of this size this is not a meaningful difference in times, but for projects involving many files or much larger files the difference in loading times could become a real game-changer.

Russ’s First Law of Tabular Data Management

“Don’t use spreadsheets to manage scientific data.”

Spreadsheet software such as Microsoft Excel is commonly used by researchers for all sorts of data management and processing operations. Why are spreadsheets problematic?

I will occasionally use Microsoft Excel to examine a data file, but I think that spreadsheet tools should almost never be used as part of a scientific data workflow.

Multidimensional array data

Many forms of data are represented as multidimensional arrays. For example, many forms of microscopy data are represented as either two-dimensional (slice) or three-dimensional (volume) arrays, while dynamic imaging modalities are often represented as three- or four-dimensional data, with the first two or three dimensions representing space and the final dimension reflecting time. Within Python these are generally processed using numpy, which efficiently processes large multidimensional arrays. As an example, I loaded a four-dimensional brain template, where the first three dimensions are spatial and the final dimension reflects 512 different components that define a probabilistic atlas of the brain. After loading the data into a numpy array, we can see its shape:

data.shape
(91, 109, 91, 512)

In this case, the first three dimensions are spatial (representing the left/right, front/back, and up/down axes of the brain) and the final dimension represents timepoints. We can also use the imshow function from matplotlib.pyplot to view a two-dimensional slice of the image at a particular timepoint:

import matplotlib.pyplot as plt
# view z==50 at timepoint 5
plt.imshow(data[:, :, 50, 5], cmap='gray')
2d image example

Multidimensional array file formats

A common file format for array data in Python is the native Numpy format, which is known by its extension, .npy. This file format has a number of benefits for Python users:

However, the .npy format also has a number of drawbacks for scientific data:

There are several other commonly used standard file formats for array data; we will focus on HDF5 and Zarr. HDF5 is a longstanding format for storage of large datasets, which is supported by nearly all programming languages. It has built-in support for compression, and allows access to specific chunks without loading the entire dataset. However, may researchers (at least within the Python ecosystem) are moving towards the Zarr format, which stores data in a set of smaller chunk files rather than a single large file as in HDF5. Zarr has several advantages over HDF5:

Let’s use the data from above as an example. This is a dataset that is highly sparse; after thresholding to remove very small values, only about 0.1% of the values are nonzero. This means that we should be able to get a high degree of compression for this dataset, given the redundancy of the data. Let’s first save the data to a .npy file and look at its (uncompressed) size:

np.save('/tmp/difumo.npy', data)
!du -sm /tmp/difumo.npy
3526	/tmp/difumo.npy

That’s about 3.5 gigabytes. Now let’s save it to HDF5:

import h5py
with h5py.File('/tmp/difumo.h5', 'w') as f:
    f.create_dataset('difumo', data=data, compression='gzip')
!du -sm /tmp/difumo.h5
10	/tmp/difumo.h5

Due to the compression by HDF5, the data file is about 350 times smaller with HDF5! We can also look at the same with Zarr:

import zarr
zarr_data = zarr.open('/tmp/difumo.zarr', mode='w', shape=data.shape, dtype=data.dtype)
zarr_data[:] = data
!du -sm /tmp/difumo.zarr
5	/tmp/difumo.zarr

This shows that Zarr obtains even more compression than HDF5, each using its default compression method; note that it might be possible to get better compression using custom compression methods for each. We can also compare the time needed to load each of the files; because of the relatively longer loading time for these files, we only perform 10 repetitions in this example:

nreps = 10

for ext in ['npy', 'h5', 'zarr']:
    start_time = time.time()
    filename = f'/tmp/difumo.{ext}'
    if ext == 'npy':
        for _ in range(nreps):
            data_loaded = np.load(filename)
    elif ext == 'h5':
        for _ in range(nreps):
            with h5py.File(filename, 'r') as f:
                data_loaded = f['difumo'][:]
    elif ext == 'zarr':
        for _ in range(nreps):
            zarr_data = zarr.open(filename, mode='r')
            data_loaded = zarr_data[:]
    end_time = time.time()
    avg_load_time = (end_time - start_time) / nreps
    print(f"Average loading time for {ext}: {avg_load_time:.6f} seconds")
Average loading time for npy: 0.451627 seconds
Average loading time for h5: 3.138907 seconds
Average loading time for zarr: 0.745648 seconds

This shows that Zarr is slightly slower than the native numpy file load, but much faster than HDF5.

I would suggest that unless you have a strong need for HDF5 compatibility, that you use Zarr files to store large binary data files such as matrices.

Symmetrical matrices

It’s not uncommon to work with symmetrical matrices, such as correlation matrices. Since the upper triangle and lower triangle of a symmetric matrix are simply transposed versions of one another, there is no need to save both - we can simply save either the upper or lower triangle. Depending on the application, this may be a useful way to save space and loading time.

Network/graph data

A graph is a representation that includes a set of nodes along with a set of edges that represent connections between those nodes. In some cases these are unweighted (binary), such as whether two individuals are friends on a social network. In other cases they are weighted, such as number of common friends that two people have on a social network. Graphs also vary in being either undirected, such that the relationship is bidirectional between the two nodes, or directional, such as the relationship has a direction. For example, friendship on a social network is an undirected relationship, whereas following on a social network is a directed relationship, since one person can follow the other, vice versa, or both.

One way to represent a graph is through a set of adjacencies, or connections. Here is an example of a small social network, in which we generate a graph within the networkx package using a set of adjacencies:

import networkx as nx

friends = [
    ('Bill', 'Sally'),
    ('Bill', 'Mark'),
    ('Bill', 'Elise'),
    ('Mark', 'Elise'),
    ('Mark', 'Lisa')
]
G = nx.Graph()
G.add_edges_from(friends)
G.edges
EdgeView([('Bill', 'Sally'), ('Bill', 'Mark'), ('Bill', 'Elise'), ('Mark', 'Elise'), ('Mark', 'Lisa')])

We can view this as a graph using the plotting functions in networkx:

graph example

Another way to represent a graph is via an adjacency matrix, which numerically represents the relations between nodes in the graph. We can generate this from the networkx graph:

adj_matrix = nx.adjacency_matrix(G).todense()
labels = list(G.nodes)
adj_df = pd.DataFrame(adj_matrix, index=labels, columns=labels)
print(adj_df.to_markdown())
BillSallyMarkEliseLisa
Bill01110
Sally10000
Mark10011
Elise10100
Lisa00100

Graph data file formats

There are a number of different file formats for graph data, used variously across different software packages. For example, the networkx package supports 13 different graph formats! The decision about which format to use will likely depend on the specific software packages that you plan to use. I have had good luck with both GraphML and Pajek formats.

For large networks that are dense (i.e. networks where most connections are nonzero) or weighted (i.e. the edges have continuous weights) it might make more sense to save the network as an adjacency matrix, making sure to also save the associated node labels. This allows taking advantage of the efficiency of formats like Zarr, though it will incur the overhead of having to load the adjacency matrix into the graph engine.

Specialized data formats

Many subfields of science have developed specific file formats for the data in their domain, such as:

There is always a tension between using a domain-specific standard versus a general standard; the choice for any particular project will depend heavily on which software packages are being used for data analysis and what data formats they support, as well as the efficiency of the file formats.

Data organization schemes

It is essential to use a consistent data organization scheme for one’s research data. This is obvious when the data are shared with other researchers, but even if the data will never be shared with anyone else, good organization is essential when one looks back at one’s own data in the future. Thus, good data organization is a gift to your future self!

In this section we discuss data organization. The most important principle of data organization is that the scheme should be easily understood consistently applied. If a standard scheme exists in one’s field of research, then I would strongly suggest using that scheme, or at least adapting it to one’s local requirements. A second important principle is that file and folder names should be machine readable. Increasingly we want to use automated tools to parse large datasets, and a machine-readable organization scheme (as I discuss below) is essential to doing this effectively.

File granularity

One common decision that we need to make when managing data is to save data in more smaller files versus fewer larger files. The right answer to this question depends in part on how we will have to access the data. If we only need to access a small portion of the data and we can easily determine which file to open to obtain those data, then it probably makes sense to save many small files. However, if we need to combine data across many small files, then it likely makes sense to save the data as one large file. For example, in the data management notebook there is an example where we create a large (10000 x 100000) matrix of random numbers, and save them either to a single file or to a separate file for each row. When loading these data, the loading of the single file is about 5 times faster than loading the individual files.

Another consideration about the number of files has to do with storage systems that are commonly used on high-performance computing systems. On these systems, it is common to have separate quotas for total space used (e.g. in terabytes) as well as for the number of inodes, which are structures that store information about files and folders on a UNIX filesystem. Thus, generating many small files (e.g. millions) can sometimes cause problems on these systems. For this reason, we generally err on the side of generating fewer larger files versus more smaller files when working on high-performance computing systems.

Data file/folder naming conventions

From my standpoint, the most important consideration for naming of files and folders is the ability to automatically parse the file/folder names. While there are may possible ways to do this, I prefer the approach used in the Brain Imaging Data Structure (BIDS), which our group was involved in developing Gorgolewski et al., 2016. This is a standard for organizing a wide range of brain imaging data types, but the strategy behind the standard is applicable to almost any scientific data types. The basic idea is to embed a set of key-value pairs in the name, along with a suffix that defines the data type along with a file type extension for files. The schema looks like this:

<key>-<value>_<key>-<value>_suffix.extension

This is useful because it is very easy to automatically parse such a file name. For example, let’s say we have a file called ‘sub-001_sess-1A_desc-Diffusion_fa.nii.gz’. We can easily parse file names like this as follows:

filename = 'sub-001_sess-1A_desc-Diffusion_fa.nii.gz'

def split_filename(filename):
    extension = '.'.join(filename.split('.')[1:])
    name = filename.split('.')[0]
    key_values = {k:v for k,v in (item.split('-') for item in name.split('_')[:-1])}
    key_values['suffix'] = name.split('_')[-1]
    return extension, key_values

extension, key_values = split_filename(filename)
print(key_values)
{'desc': 'Diffusion', 'sess': '1A', 'sub': '001', 'suffix': 'fa'}

This is very useful because it allows one to easily query a large set of files for particular key-value pairs, and also allows one to easily parse the key-value pairs for a particular file.

It’s worth nothing that using a naming scheme like this requires strict attention to naming hygiene. In particular, it’s essential to ensure that the delimiter characters (“-” and “_”) don’t accidentally get used within the values. For example, if one were using an analysis called “IS-RSA”, using this for the description (e.g. “'sub-001_sess-1A_desc-IS-RSA_corr.zarr”) would cause file parsing to fail.

Metadata

Metadata refers to “data about data”, and generally is meant to contain the information that is needed to interpret a dataset. In principle, someone who obtains a dataset should be able to understand and reuse the data using only the metadata provided alongside the dataset. There are many different types of metadata that might be associated with a study, and it is usually necessary to decide how comprehensive to be in providing detailed metadata. This will often rely upon the scientific expertise and judgment of the researcher, to determine which particular metadata would be essential for others to usefully interpret and reuse the data.

An important concept in metadata is the ontology. In the context of bioinformatics, an ontology is a structured representation of the entities that exist in a domain (defined by a controlled vocabulary) and the relationships between these entities. One of the best known examples in the Gene Ontology, which represents classes of biological entities including Molecular Functions, Cellular Components, and Biological Processes. As an example, Figure 2 shows a hierarchical depiction for a specific cellular component.

An example of a Gene Ontology graph for the entity “node of Ranvier”, which is a component of a neuron.  Obtained from https://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0033268.

Figure 2:An example of a Gene Ontology graph for the entity “node of Ranvier”, which is a component of a neuron. Obtained from https://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0033268.

Ontologies are very useful for specifying metadata, because they allow us to know exactly what a particular entry in the metadata means, and thus allow us to establish link between equivalent entities across datasets. For example, let’s say that a researcher wants to query a database for datasets related to insulin signaling in pancreatic beta cells in Type II diabetes, and that there are three relevant datasets in the database. Without an ontology, each of the teams might use different terms to refer to these cells (such as “pancreatic beta cells”, “insulin-producing cells”, and “islet beta cells”), making it difficult to link the datasets. However, if each of the datasets were to include metadata linked to a specific ontology (in this case, the identifier CL:0000169 from the Cell Ontolog, which refers to “type B pancreatic cell”), then it becomes much easier to find and link these datasets. There are at present a broad range of ontologies available for nearly every scientific domain; the BioPortal project provides a tool to search across a wide range of existing ontologies.

Metadata file formats

An important feature of metadata is that it needs to be machine-readable, meaning that it is provided in a structured format that be automatically parsed by a computer. Common formats are Extensible Markup Language (XML) and JavaScript Object Notation (JSON). JSON is generally simpler and more human-readable, but it doesn’t natively provide the ability to define attributes for particular entries (such as the units of measurement) or link to ontologies. An extension of JSON known as JSON-LD (JSON for Linked Data) provides support for the latter, by allowing links to controlled vocabularies.

For example, let’s say that I wanted to represent information about an author (myself) in JSON, which I might do like this:

{
  "name": "Russell Poldrack",
  "affiliation": "Stanford University",
  "email": "russpold@stanford.edu"
}

Now let’s say that someone else wanted to search across datasets to find researchers from Stanford University. They would have no way of knowing that I used the term “affiliation” as opposed to “organization”, “institiution”, or other terms. We could instead represent this using JSON-LD, which is more verbose but allows us to link to a vocabulary (in this case schema.org) that defines these entities by providing a @context tag:

{
  "@context": "https://schema.org",
  "@type": "Person",
  "name": "Russell Poldrack",
  "affiliation": {
    "@type": "Organization",
    "name": "Stanford University"
  },
  "email": "russpold@stanford.edu"
}

Data documentation

While metadata is generally meant to be used by computers, it is also important to provide human-readable documentation for a dataset, so that other researchers (or one’s own self in the future) can understand and reuse the data successfully. There are two forms of documentation that can be important to provide.

Data dictionaries

A data dictionary provides information about each of the variables in a dataset. These are meant to be human readable, though it can often be useful to share them in a machine-readable format (such as JSON) so that they can also be used in programmatic ways. A data dictionary includes information such as:

For example, a study of immune system function in human participants might include the following in its data dictionary:

Variable NameData TypeAllowable ValuesDescription
ageInteger0-120Age of the participant in years
genderStringM, W, OParticipant’s self-identified gender
crpNumeric0.1-50.0, -90, -98, -99C-reactive protein level (mg/L)

Codebooks

A codebook is meant to be a more human-friendly description of the content of the dataset, focusing on how the data were collected and coded. It often includes a detailed description of each variable that is meant to help understand and interpret the data. For the example above, the codebook might include the following:

Variable Information

Measurement and Coding

Collection Protocol and Provenance

It is essential to generate data dictionaries and codebooks upon the generation of the dataset; otherwise important details may be lost.

Provenance

Provenance refers to particular metadata regarding the history of processes and inputs that give rise to a particular file. Tracking of provenance is essential to ensure that one knows exactly how a particular file was created. This includes:

Tracking of provenance is non-trivial. The World Wide Web Consortium (W3C) has developed a framework called PROV which defines a model for the representation of provenance information. This framework provides an overview of the many features of provenance that one might want to record for an information that is shared online. The PROV data models defines three main concepts:

In addition, the model defines a set of relationships between these concepts, as seen in Figure 3:

A depiction of the PROV data model entities and relations.  Copyright © [2013] World Wide Web Consortium.

Figure 3:A depiction of the PROV data model entities and relations. Copyright © [2013] World Wide Web Consortium.

This data model highlights the breadth of information that needs to be represented in order to accurately record provenance.

There are several different ways to track provenance in practice, which vary in their complexity, comprehensiveness, and ease of use. We will discuss this in much more detail in a later chapter on workflows.

Handling of sensitive data

Researchers in some fields, particularly those who work with data obtained from human subjects, often handle data are sensitive, meaning that they may require a higher degree of security and/or additional procedures to protect the privacy and confidentiality of the research subjects.

Data security

Sensitive data often require additional protections from potential breach. The minimum requirement is generally that the data are housed on an encrypted file system and any transfers are made via an encrypted channel, and that access to the system is controlled. Some datasets include more stringent security measures in their Data Use Agreement. For example, the Adolescent Brain Cognitive Development (ABCD) study, a widely used dataset on brain and cognitive development, requires that any systems used to house or process the data must meet a specific standard for sensitive information known as NIST SP 800-171. This standard comprises 17 “families” of security requirements that a system must meet to be compliant:

In general this level of security certification will be limited to computer systems run by an organizational IT group rather than by an individual investigator, due to the stringency of the requirements.

Deidentification

Deidentification generally involves the removal of specific identifying information that could potentially be used to reidentify a human subject. In the US, this generally relies upon the Safe Harbor provision in the Health Insurance Portability and Accountability Act of 1996 (HIPAA), which states the following criteria for rendering a dataset deidentified:

(i) The following identifiers of the individual or of relatives, employers, or household members of the individual, are removed:
- (A) Names
- (B) All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geocodes, except for the initial three digits of the ZIP code if, according to the current publicly available data from the Bureau of the Census:(1) The geographic unit formed by combining all ZIP codes with the same three initial digits contains more than 20,000 people; and(2) The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000
- (C) All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older
- (D) Telephone numbers
- (E) Fax numbers
- (F) Email addresses
- (G) Social security numbers
- (H) Medical record numbers
- (I) Health plan beneficiary numbers
- (J) Account numbers
- (K) Certificate/license numbers
- (L) Vehicle identifiers and serial numbers, including license plate numbers
- (M) Device identifiers and serial numbers
- (N) Web Universal Resource Locators (URLs)
- (O) Internet Protocol (IP) addresses
- (P) Biometric identifiers, including finger and voice prints
- (Q) Full-face photographs and any comparable images
- (R) Any other unique identifying number, characteristic, or code, except as permitted by paragraph (c) of this section [Paragraph (c) is presented below in the section “Re-identification”]; and

(ii) The covered entity does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information.

In the US, deidentification of data is generally sufficient to render them non-sensitive, whereas this is generally not the case in European countries covered by the General Data Protection Regulation (GDPR).

Anonymization

Anonymization refers to the modification of data in a way that can essentially guarantee that the subjects cannot be reidentified. For example, one might modify ages so that they are stated in ranges (such as 20-25 years old) instead of a specific year. These methods generally change the data in ways that could potentially affect downstream analyses, and thus many researchers shy away from using anonymized data unless absolutely necessary.

One method that is often used for large datasets is known as differential privacy, which involves adding noise to analytic results in a way that can provably prevent reidentification. For example, this method is now used by the US Census Bureau to protect individuals. This has the benefit of providing a provable mathematical guarantee of privacy by quantifying the maximum degree of privacy loss given a particular amount of noise added. However, this method may have adverse effects on the data, such by disparately impacting small sub-populations within a larger dataset Santos-Lozada et al., 2020.

Version control for data

In the case of original data we never want to allow any changes, but for derived data we will often end up making changes to our workflows that result in changes in the data. As an example, let’s say that we are analyzing RNA-sequencing data, and we receive a notice that a bug was found in the specific version of STAR that we had used for sequence alignment. We would like to be able to track these changes, so that we know which data we are working with at any point in time. In many laboratories, this achieved through file naming, resulting in a menagerie of files with names like “dataset_new_fixed_v2.tsv”. This can make it difficult to determine exactly which data were used in any analysis. In Chapter 2 we discussed the many reasons why we use version control for code, and many of those also apply to data as well. In the case of data, it is particularly important to be able to track the what, when, and why of any changes to the data, which is exactly the purpose of version control systems.

Using git for data version control

When the relevant data are small (e.g. smaller than a few megabytes) and stored in a text format (such as CSV/TSV), one can simply use git to track changes in the data. (We will discuss in a later chapter why Github is not an optimal platform for sharing data, at least not on its own.).

However, git does not work well for version control on larger datasets using binary data files. Git is able to efficiently store version information about code because it tracks the specific differences in the code between versions (known as a diff), and only stores the differences. Thus, if one has a very large code file and changes one line, only that one line difference is stored in the git database. However, with binary data this strategy is not effective, and git has to store the entire new dataset each time, leading to bloated repositories and very slow performance.

Using Datalad for version control on larger datasets

A solution to this problem is to use a version control tool that is specifically designed for large data. There are several tools that address this problem; we will focus on Datalad, which is a data management system that functions very similarly to git. It is based on a tool called git-annex, but provides much greater ease of use for researchers. (Full disclosure: Our group collaborates with the Datalad group and our grants have supported some of their development work.)

An important note: Datalad is quite powerful but has a significant learning curve, and takes a bit of time to get accustomed to. In particular, its use of symbolic links can sometimes confuse new users. Having said that, let’s look at some simple examples.

Creating a local Datalad dataset

Let’s say that we want to create a new dataset on our local computer that will be tracked by Datalad. We first create a new repository:

➤  datalad create my_datalad_repo

create(ok): /Users/poldrack/Dropbox/code/BetterCodeBetterScience/my_datalad_repo (dataset)

This creates a new directory, called my_datalad_repo and sets it up as a Datalad dataset. We then go into the directory and create a subdirectory called data, and then download some data files from another project. We do this using the datalad download-url function, which will both download the data and save them to the datalad dataset:

➤  cd my_datalad_repo
➤  mkdir data
➤  datalad download-url --dataset . -O data/ \
  https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/meaningful_variables_clean.csv

[INFO   ] Downloading 'https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/meaningful_variables_clean.csv' into '/Users/poldrack/Dropbox/code/BetterCodeBetterScience/my_datalad_repo/data/'
download_url(ok): /Users/poldrack/Dropbox/code/BetterCodeBetterScience/my_datalad_repo/data/meaningful_variables_clean.csv (file)
add(ok): data/meaningful_variables_clean.csv (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  download_url (ok: 1)
  save (ok: 1)

➤  datalad download-url --dataset . -O data/ \
  https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/demographics.csv

[INFO   ] Downloading 'https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/demographics.csv' into '/Users/poldrack/Dropbox/code/BetterCodeBetterScience/my_datalad_repo/data/'
download_url(ok): /Users/poldrack/Dropbox/code/BetterCodeBetterScience/my_datalad_repo/data/demographics.csv (file)
add(ok): data/demographics.csv (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  download_url (ok: 1)
  save (ok: 1)

A Datalad dataset is also a git repository, which we can see if we use the git log command:

➤  git log

commit a5696c1a32d69dd24781652d04902047c5d3df50 (HEAD -> main)
Author: Russell Poldrack <poldrack@gmail.com>
Date:   Sun Nov 16 12:13:22 2025 -0800

    [DATALAD] Download URLs

    URLs:
      https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/demographics.csv

commit 2603a1fb98f00d6cdd029194f010a845d73cdc7c
Author: Russell Poldrack <poldrack@gmail.com>
Date:   Sun Nov 16 12:13:20 2025 -0800

    [DATALAD] Download URLs

    URLs:
      https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/meaningful_variables_clean.csv

commit 95b9016840e1d35bef0edf3afa06e41fdaaefce4
Author: Russell Poldrack <poldrack@gmail.com>
Date:   Sun Nov 16 12:12:08 2025 -0800

    [DATALAD] new dataset

Here we see the commit messages that were automatically created by Datalad, first for creating the new dataset and then for downloading the URLS. The datalad download-url function adds the URL to the log, which is useful for provenance tracking.

Modifying files

Now let’s say that we want to make a change to one of the files and save the changes to the dataset. Files tracked by Datalad are read-only (“locked”) by default. If we want to edit them, then we need to use datalad unlock to unlock the file:

➤  datalad unlock data/demographics.csv

unlock(ok): data/demographics.csv (file)

We then use a Python script to make the change, which in this case is removing some columns from the dataset:

➤  python ../src/BetterCodeBetterScience/modify_data.py data/demographics.csv

We can now use datalad status to see that the file has been modified:

➤  datalad status
 modified: data/demographics.csv (file)

And we can then save it using datalad save:

➤  datalad save -m "removed Motivation variables from demographics.csv"

add(ok): data/demographics.csv (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

Datalad doesn’t have a staging area like git does, so there is no need to first add and then commit the file; datalad save is equivalent to adding and then committing the changes. If we then check the status we see that there are no changes waiting to be saved:

➤  datalad status

nothing to save, working tree clean

Pushing data to a remote repository

Datalad is a particularly powerful tool for sharing data across systems. It allows one to push or pull data from a number of different remote storage systems; in this example we will use the Open Science Framework (OSF) as our storage location, because it is particularly easy to use with Datalad.

We first need to install and set up the datalad-osf Python package, per the Datalad documentation. We also need to create an account on the OSF site, and obtain a Personal Access Token for login. We can then use Datalad to authenticate with OSF:

➤  datalad osf-credentials                                                1 ↵
You need to authenticate with 'https://osf.io' credentials. https://osf.io/settings/tokens provides information on how to gain access
token:
osf_credentials(ok): [authenticated as Russell Poldrack <poldrack@stanford.edu>]

Having authenticated with OSF, we can now create a new OSF project using Datalad:

➤  datalad create-sibling-osf --title datalad-test-project -s osf
create-sibling-osf(ok): https://osf.io/htprk/
[INFO   ] Configure additional publication dependency on "osf-storage"
configure-sibling(ok): . (sibling)

Once the project is created, we can push the contents of our dataset to our OSF project:

➤  datalad push --to osf
copy(ok): data/demographics.csv (file) [to osf-storage...]
copy(ok): data/meaningful_variables_clean.csv (file) [to osf-storage...]
publish(ok): . (dataset) [refs/heads/main->osf:refs/heads/main [new branch]]
publish(ok): . (dataset) [refs/heads/git-annex->osf:refs/heads/git-annex [new branch]]
action summary:
  copy (ok: 2)
  publish (ok: 2)

These data now exist on OSF, and can be cloned to our local machine using datalad clone:

➤  datalad clone osf://htprk/
[INFO   ] Remote origin uses a protocol not supported by git-annex; setting annex-ignore
install(ok): /Users/poldrack/Downloads/htprk (dataset)

➤  tree htprk
htprk
└── data
    ├── demographics.csv -> ../.git/annex/objects/f7/Mm/MD5E-s58237--dc5b157fb9937eae2166d73ee943c766.csv/MD5E-s58237--dc5b157fb9937eae2166d73ee943c766.csv
    └── meaningful_variables_clean.csv -> ../.git/annex/objects/J5/X6/MD5E-s1248729--e4fbac610f1f5e25e04474e55209ef56.csv/MD5E-s1248729--e4fbac610f1f5e25e04474e55209ef56.csv

Notice that the files in the cloned dataset directory are actually symbolic links; the actual file contents are not downloaded when the dataset is cloned. We can see this if we try to look at the size of the datafile:

➤  wc data/demographics.csv
wc: data/demographics.csv: open: No such file or directory

To actually download the file contents, we can use datalad get, after which we will see that the file contents are available:

➤  datalad get .                                                          1 ↵
get(ok): data/demographics.csv (file) [from web...]
get(ok): data/meaningful_variables_clean.csv (file) [from web...]
action summary:
  get (ok: 2)

➤  wc data/demographics.csv
     523    1276   58237 data/demographics.csv

One can also push data using Datalad to a range of other remote hosts; see the Datalad documentation for more on this.

Archiving data

At the end of a project the data may seem like they are no longer needed, but in many cases there are reasons to retain the data beyond the end of the project. Funding agencies often have a required data retention period beyond the end of the grant. For example, the US National Institutes of Health (NIH) requires that records be retained for three years beyond the end of the funding. Some universities also have their own data retention requirements; for example, my institution (Stanford University) also has a three-year data retention requirement, whereas Johns Hopkins University has a five-year retention requirement. In my opinion it is preferable to retain data, at least in archival form, as long as possible. I have received requests to share data more than 15 years after the original study completion, and it was only due to long-term retention of these data that we were able to honor these requests.

Archiving of research data can take several forms:

As we will discuss in more detail in our later chapter on sharing of research objects, it is generally preferably to archive data in a location that has a long-term preservation policy and verifiability. This can include institutional repositories (usually run by librarians, who have deep expertise in archiving), general purpose repositories (like OSF or Zenodo), or domain-specific repositories.

Appendix: An example of database usage

Here I will work through an example of a real scientific question using several database systems. I will focus on NoSQL databases, for two reasons:

1: They are less well known amongst scientists compared to relational databases 2: I personally prefer the NoSQL approach in most cases

Note that while many researchers may never use all the different databases used here, I am intentionally presenting a full-stack example to show how they can be combined.

The question that I will ask is as follows: How well can the biological similarity between traits (including diseases as well as other phenotypes) be estimated from the semantic similarity of publications that refer to the trait? We will use two primary datasets to assess this:

I will not present all of the code for each step; this can be found here and here. Rather, I will show portions that are particularly relevant to the databases being used.

Adding GWAS data to a document store

We start by uploading the GWAS data and adding them to a document store, using MongoDB, which I installed on my local machine. We start by reading the CSV file containing the data. Looking at those data, we see that they are not properly normalized. Normalization is a concept that is essential for relational databases but can also be very helpful for document stores. We won’t go into the details of normalization here (for more, see here); rather, we will simply outline the primary requirements of the first normal form, which is the most basic form of normalization. This requires that:

  1. There are no duplicate records

  2. The columns contain scalar values (and thus do not contain composite values such as sets of values)

  3. There are not multiple columns that contain the same kind of data

In this case, looking at the data we see that several columns contain multiple values for genes, due to the fact that some single nucleotide polymorphisms (SNPs) are located in intergenic space and thu can be mapped to multiple genes. These values separated by commas (e.g. “FADS2, FADS1”). To normalize this, we can explode the data frame, which involves separating out these values into separate rows, which have otherwise identical contents.

gwas_data = get_exploded_gwas_data()

We can now import the data from this data frame into a MongoDB collection, mapping each unique trait to the genes that are reported as being associated with it. First I generated a separate function that sets up a MongoDB collection (see setup_mongo_collection here). We can then use that function to set up our gene set collection:

# since the collection names are shared across functions, 
# create global variable with their names
COLLECTION_GENESET = 'geneset_annotations_by_trait'

def import_genesets_by_trait(
    gwas_data_melted: pd.DataFrame
) -> None:

    geneset_annotations_by_trait = setup_mongo_collection(COLLECTION_GENESET)
    # "mapped_trait_uri" is the field that contains the identifier for each trait
    geneset_annotations_by_trait.create_index('mapped_trait_uri', unique=True)

    # first get a mapping from MAPPED_TRAIT_URI to TRAIT_NAME
    trait_name_mapping = gwas_data_melted.set_index('MAPPED_TRAIT_URI')[
        'MAPPED_TRAIT'
    ].to_dict()

    # loop through each unique MAPPED_TRAIT_URI in gwas_data data frame 
    # add all of its gene sets to the mongo collection 
    for mapped_trait_uri in tqdm(
        gwas_data_melted['MAPPED_TRAIT_URI'].unique()
    ):
        gene_sets = (
            gwas_data_melted[
                gwas_data_melted['MAPPED_TRAIT_URI'] == mapped_trait_uri
            ]['GENE_ID']
            .unique()
            .tolist()
        )
        # clean up gene names
        gene_sets = [gene.strip() for gene in gene_sets]
        # add the item to the mongodb collection, also
        # including the trait name
        geneset_annotations_by_trait.update_one(
            {'mapped_trait_uri': str(mapped_trait_uri)},
            {
                '$set': {
                    'mapped_trait_uri': str(mapped_trait_uri),
                    'gene_sets': gene_sets,
                    'trait_name': trait_name_mapping.get(mapped_trait_uri, ''),
                }
            },
            upsert=True,
        )


import_genesets_by_trait(gwas_data)

Note that this function uses an upsert operation, which is a combination of insertion (if the document with this key doesn’t already exist) or updating (if the document with this key already exists). This results in a collection with 3,047 records, each of which is indexed by the trait identifier and includes a list of all of the genes associated with the trait across the GWAS studies. We also include the trait names to make the records human-readable, but we rely upon the unique identifiers (which in this case are defined as URLs, such as “http://www.ebi.ac.uk/efo/EFO_0004309” which maps to the trait of “platelet count”.

Annotating gene sets

Remember that our goal in this analysis is to identify the biological overlap between traits. We could do this by assessing the degree to which they are associated with the same genes, but this would miss out on the fact that genes work together in networks that are often associated with the function of specific biological pathways. Given a particular set of genes, we can use bioinformatics tools to identify the biological processes that are associated with that set of genes. In this case I used the g:Profiler tool from ELIXIR, which comes with a handy Python package. This tool returns a set of pathways that are statistically enriched for each gene set, each of which is defined by a unique identifier that refers to a particular ontology such as the Gene Ontology. For example, the 877 genes associated with the “platelet count” trait are involved in a range of biological processes and pathways, which range from the very general (e.g. GO:0005515, referring to “protein binding”) to the very specific (e.g. GO:0007599, referring to “hemostasis”, which is the stopping of bleeding).

We can use g:Profiler to obtain the annotation for the gene set associated with each trait:


def annotate_genesets_by_trait() -> None:
    # loop over all entries in the geneset_annotations_by_trait collection
    # and do functional annotation of the gene sets

    geneset_annotations_by_trait = setup_mongo_collection(COLLECTION_GENESET)

    gp = GProfiler(return_dataframe=True)

    # use a list here so that we can use tqdm to show progress
    # skip any entries that already have functional_annotation
    annotations = [
        i
        for i in geneset_annotations_by_trait.find({})
        if 'functional_annotation' not in i
    ]

    for entry in tqdm(annotations):
        mapped_trait_uri = entry['mapped_trait_uri']
        gene_sets = entry['gene_sets']
        if len(gene_sets) == 0:
            continue
        # do functional annotation
        try:
            annotation_results = gp.profile(
                organism='hsapiens',
                query=gene_sets,
                sources=['GO:BP', 'GO:MF', 'GO:CC', 'KEGG', 'REAC'],
            )
        except Exception as e:
            # catch any exception to avoid breaking the loop
            print(f'Error annotating {mapped_trait_uri}: {e}')
            continue

        # convert the dataframe to a dictionary
        annotation_results_dict = annotation_results.to_dict(orient='records')
        # update the entry in the mongo collection with the annotation results
        geneset_annotations_by_trait.update_one(
            {'mapped_trait_uri': str(mapped_trait_uri)},
            {'$set': {'functional_annotation': annotation_results_dict}},
        )
    # drop members of geneset_annotations_by_trait with empty annotation
    geneset_annotations_by_trait.delete_many(
        {'functional_annotation': {'$in': [None, [], {}]}}
    )

    print(
        f'Remaining entries with functional annotation: {
          geneset_annotations_by_trait.count_documents({})}'
    )


annotate_genesets_by_trait()
Number of documents in geneset_annotations_by_trait: 3047
100%|██████████| 3047/3047 [56:31<00:00,  1.11s/it] 
Remaining entries with functional annotation: 1845

Having now annotated the gene sets with information about their biological functions, we can now move to assessing the similarity of traits based on their associated functions.

Mapping pathway information to traits

The previous analysis added a functional annotation element to each trait, which includes information about the associated pathways. While we could use this collection to obtain the mappings from traits to pathways, the deeply embedded nature of the annotation data would make the queries somewhat complicated. Next we will use that information to generate a collection including all unique pathways, which we will then use to compute biological similarity between traits:

COLLECTION_PATHWAYS = 'pathways'

def get_pathway_info_by_trait() -> None:
    geneset_collection = setup_mongo_collection(
        COLLECTION_GENESET, clear_existing=False
    )
    pathway_collection = setup_mongo_collection(
        COLLECTION_PATHWAYS, clear_existing=False
    )

    # loop through traits and add pathway information to the database
    traits = [
        i
        for i in geneset_collection.find()
        if 'functional_annotation' in i and len(i['functional_annotation']) > 0
    ]

    for trait in tqdm(traits):
        annotations = trait['functional_annotation']
        for pathway in annotations:
            # change key for clarity
            pathway['pathway_id'] = pathway.pop('native')

            pathway_collection.update_one(
                {'pathway_id': pathway['pathway_id']},
                {
                    '$set': {
                        'name': pathway.get('name', ''),
                        'source': pathway.get('source', ''),
                        'description': pathway.get('description', ''),
                    }
                },
                upsert=True,
            )


get_pathway_info_by_trait()
Number of documents in geneset_annotations_by_trait: 1845
Number of documents in pathways: 0
100%|██████████| 1845/1845 [00:18<00:00, 98.22it/s] 

Generate the graph database linking publications pathways to traits

Graph databases are designed to store and query relational information, making them perfect for our dataset linking pathways to traits. In the next step I created a graph database using the Neo4j database package, which involves creating a Neo4j session and then issuing a set of Cyper commands:

def get_neo4j_session():
    assert 'NEO4J_PASSWORD' in os.environ, 'NEO4J_PASSWORD should be set in .env'
    neo4j_driver = GraphDatabase.driver(
        'bolt://localhost:7687', auth=('neo4j', os.environ['NEO4J_PASSWORD'])
    )
    return neo4j_driver.session()


def build_neo4j_graph() -> None:
    geneset_collection = setup_mongo_collection(COLLECTION_GENESET)
    pathway_collection = setup_mongo_collection(COLLECTION_PATHWAYS)

    with get_neo4j_session() as session:
        # Clear DB
        session.run('MATCH (n) DETACH DELETE n')
        
        # Create Indexes
        session.run((
          'CREATE CONSTRAINT IF NOT EXISTS FOR (p:Phenotype) '
          'REQUIRE p.id IS UNIQUE'))
        session.run(
          ('CREATE CONSTRAINT IF NOT EXISTS FOR (p:Pathway) '
          'REQUIRE p.id IS UNIQUE'))

        # 1. Batch Import Pathways
        # Fetch all pathways into a list of dicts
        print("Loading pathways...")
        pathways_data = list(pathway_collection.find({},
          {'_id': 0, 'pathway_id': 1, 'name': 1, 'source': 1, 'description': 1}))
        
        # Use UNWIND to insert all pathways in one transaction
        session.run("""
            UNWIND $batch AS row
            MERGE (pw:Pathway {id: row.pathway_id})
            SET pw.name = row.name,
                pw.source = row.source,
                pw.description = row.description
        """, batch=pathways_data)

        # 2. Batch Import Phenotypes and Relationships
        print("Loading phenotypes and relationships...")
        # We need to restructure the Mongo data slightly for Neo4j consumption
        pheno_batch = []
        for doc in geneset_collection.find():
            pheno_id = str(doc['mapped_trait_uri'])
            # Extract list of pathway IDs
            pathway_ids = [
                i['native'] 
                for i in doc.get('functional_annotation', []) 
                if 'native' in i]
             
            if pathway_ids:
                pheno_batch.append({
                    'id': pheno_id,
                    'name': doc.get('trait_name', ''),
                    'pathway_ids': pathway_ids
                })

        # Insert Phenotypes and create edges to Pathways
        session.run("""
            UNWIND $batch AS row
            MERGE (p:Phenotype {id: row.id})
            SET p.name = row.name
            
            WITH p, row
            UNWIND row.pathway_ids AS pw_id
            MATCH (pw:Pathway {id: pw_id})
            MERGE (p)-[:MAPPED_TO]->(pw)
        """, batch=pheno_batch)

    print("Graph build complete.")


build_neo4j_graph()
Number of documents in geneset_annotations_by_trait: 1845
Number of documents in pathways: 6051
Loading pathways...
Loading phenotypes and relationships...
Graph build complete.

Having populated the database, we can then use the gds.nodeSimilarity function from the Neo4j Graph Data Science library to compute the similarity of each trait in terms of their pathway overlap. This function takes a bipartite graph (which is a graph that contains two sets of nodes - in our case, pathways and traits) and returns the Jaccard similarity coefficient, in which in this case reflects the ratio of common pathways to the total number of pathways for each pair of nodes. We filter out nodes that have fewer than two pathways associated with them to make the estimate more stable. (Note that here we use the term “phenotype” interchangeably with “trait”.)


def compute_phenotype_similarities(
    graph_name: str = 'phenotype-pathway-graph',
    min_pathways: int = 2
) -> pd.DataFrame:

    with get_neo4j_session() as session:
        # 1. Clean up any existing graph with the same name
        try:
            session.run("CALL gds.graph.drop($name)", name=graph_name)
        except ClientError:
            # Graph did not exist, safe to ignore
            pass

        # 2. Project the Graph
        # We project Phenotypes, Pathways, and the relationship between them.
        # 'UNDIRECTED' orientation allows the algorithm to see the shared connections.
        session.run("""
            CALL gds.graph.project(
                $graph_name,
                ['Phenotype', 'Pathway'],
                {
                    MAPPED_TO: {
                        orientation: 'UNDIRECTED'
                    }
                }
            )
        """, graph_name=graph_name)

        # 3. Run Node Similarity
        # We use 'degreeCutoff' to filter out phenotypes with too few pathways 
        # BEFORE the calculation - much faster than filtering the results.
        result = session.run("""
            CALL gds.nodeSimilarity.stream($graph_name, {
                degreeCutoff: $min_pathways
            })
            YIELD node1, node2, similarity
            
            // Map internal IDs back to our Phenotype IDs
            WITH node1, node2, similarity
            MATCH (p1:Phenotype), (p2:Phenotype)
            WHERE id(p1) = node1 AND id(p2) = node2
            
            RETURN p1.id AS phenotype1, p2.id AS phenotype2, similarity
            ORDER BY similarity DESC
        """, graph_name=graph_name, min_pathways=min_pathways)

        # 4. Convert to DataFrame
        # result.data() fetches all records into a list of dictionaries
        df = pd.DataFrame(result.data())
        
        # 5. Cleanup: Drop the graph to free up memory
        try:
            session.run("CALL gds.graph.drop($name)", name=graph_name)
        except ClientError:
            pass

        return df


similarity_result_df = compute_phenotype_similarities()

Now that we have the similarity between phenotypes based on their biological functions, we can move on to assessing their similarity based on associated publications.

To assess semantic similarity between traits we need to obtain abstracts related to each trait. To do this, we first wish to obtain any synonyms for the traits, to maximize the effectiveness of the search. We can obtain these from the EBI Ontology Search API:

COLLECTION_TRAIT_INFO = 'trait_info_by_trait'

def get_trait_info_from_ols(
    client_url: str = 'http://www.ebi.ac.uk/ols',
) -> None:
    # use EBI OLS API to get trait information for all traits
    trait_info_by_trait = setup_mongo_collection(
        collection_name=COLLECTION_TRAIT_INFO
    )

    trait_info_by_trait.create_index('trait_uri', unique=True)

    geneset_annotations_by_trait = setup_mongo_collection(
        collection_name=COLLECTION_GENESET
    )

    # get all unique trait URIs that are not already in the collection
    # use lstrip because many have a leading space
    unique_trait_uris = [
        i.lstrip()
        for i in geneset_annotations_by_trait.distinct('mapped_trait_uri')
        if trait_info_by_trait.count_documents({'trait_uri': i.lstrip()}) == 0
    ]
    print(f'Found {len(unique_trait_uris)} un-annotated trait URIs.')

    client = ols_client.Client(client_url)

    for trait_uri in tqdm(unique_trait_uris):
        trait_id = trait_uri.split('/')[-1]
        trait_uri = str(trait_uri)
        # skip if already in the collection
        if trait_info_by_trait.count_documents({'trait_uri': trait_uri}) > 0:
            continue
        try:
            term_data = get_info_from_ols(trait_id, client)
        except HTTPError:
            print(f'HTTPError for {trait_uri}')
            continue
        if term_data is None:
            print(f'No data returned for {trait_uri}')
            continue
        trait_info_by_trait.update_one(
            {'trait_uri': str(trait_uri)},
            {'$set': {'trait_uri': str(trait_uri), 'trait_info': term_data}},
            upsert=True,
        )


get_trait_info_from_ols()
Number of documents in geneset_annotations_by_trait: 1845
Number of documents in pathways: 0
100%|██████████| 1845/1845 [00:18<00:00, 98.22it/s] 

We then use the Biopython.Entrez module to search PubMed for each of the traits, along with any synonyms, obtaining a maximum of 100 abstracts for each query:

COLLECTION_PMID_BY_TRAIT = 'pmids_by_trait'

def get_pmids_for_traits(
    n_abstracts_per_trait: int = 100, verbose: bool = False
) -> None:

    pmid_collection = setup_mongo_collection(COLLECTION_PMID_BY_TRAIT)
    _ = pmid_collection.create_index(
        [('trait_uri', pymongo.ASCENDING)], unique=True
    )

    # get all entries from the trait_info_by_trait collection and pull out 
    # the label and synonyms to use as pubmed search terms
    trait_info_collection = setup_mongo_collection(COLLECTION_TRAIT_INFO)
    db_result = list(trait_info_collection.find({}))

    for result in tqdm(db_result, desc='Searching PubMed'):
        trait_uri = result['trait_uri']
        lbl = result['trait_info']['label']
        synonyms = result['trait_info'].get('synonyms', [])
        # create a pubmed query using the label and synonyms
        query_terms = [lbl] + synonyms
        query = ' OR '.join([f'"{term}"' for term in query_terms])

        existing_entry = pmid_collection.find_one({'trait_uri': trait_uri})
        # Skip existing entry if pmid entry is not empty
        if (
            existing_entry is not None
            and existing_entry.get('pmids')
            and len(existing_entry.get('pmids')) > 0
        ):
            if verbose:
                print(f'PMIDs already exist for {lbl}, skipping...')
            continue
        # run pubmed search - retry up to 3 times if it fails
        pmids = run_pubmed_search(query, retmax=n_abstracts_per_trait,
          max_retries=3)
        if pmids is None or len(pmids) == 0:
          continue
        pmid_collection.update_one(
            {'trait_uri': trait_uri},
            {'$set': {'label': lbl, 'pmids': pmids, 'search_query': query}},
            upsert=True,
        )


get_pmids_for_traits()
Number of documents in pmids_by_trait: 0
Number of documents in trait_info_by_trait: 1590
Searching PubMed: 100%|██████████| 1590/1590 [13:58<00:00,  1.90it/s]

We then identify all of the unique Pubmed IDs (PMIDs) across these queries and fetch the full Pubmed record (including the title and abstract, which we will use in our semantic analysis) for each of them, placing them in a new collection:

COLLECTION_PUBMED = 'pubmed_abstracts'


def fetch_and_store_pubmed_abstracts(
    batch_size: int = 500
) -> None:

    pubmed_collection = setup_mongo_collection(
        COLLECTION_PUBMED, clear_existing=False
    )
    pubmed_collection.create_index([('PMID', pymongo.ASCENDING)], unique=True)

    # remove any PMIDs already in the pubmed_collection
    existing_pmids = set()
    for entry in pubmed_collection.find({}, {'PMID': 1}):
        existing_pmids.add(entry['PMID'])

    pmids_to_fetch = [
        pmid
        for pmid in get_unique_pmids_from_trait_collection()
        if pmid not in existing_pmids
    ]

    if len(pmids_to_fetch) == 0:
        print('All PMIDs are already fetched. Skipping.')
    else:
        print(f'Fetching {len(pmids_to_fetch)} PMIDs...')

    for i in tqdm(
        range(0, len(pmids_to_fetch), batch_size),
        desc='Fetching PubMed abstracts',
    ):
        batch = pmids_to_fetch[i : i + batch_size]
        pubmed_records = fetch_pubmed_records(batch, retmax=batch_size)
        parsed_records = parse_pubmed_query_result(pubmed_records)
        if not parsed_records:
            print(f'No new records to insert for batch {i // batch_size + 1}.')
            continue
        pubmed_collection.insert_many(parsed_records.values())
        # print(f"Inserted {len(parsed_records)} abstracts")


fetch_and_store_pubmed_abstracts()
Number of documents in pubmed_abstracts: 0
Number of documents in pmids_by_trait: 1590
Fetching 106387 PMIDs...

Add documents to vector database

We want to use the documents downloaded from Pubmed for each trait to compute the semantic similarity between traits. This is a good application for a vector database, which can ingest documents, embed them into a vector space, which we can then use to perform similarity computations between documents based on their vector embeddings. We will use ChromaDB which is a popular open-source vector database. By default ChromaDB uses the all-MiniLM-L6-v2 embedding model from the Sentence Transformers package, but we will instead use the more powerful text-embedding-3-large via the OpenAI API.

def get_chromadb_collection(collection_name='pubmed_docs', 
    path='../../data/chroma_data',
    embedding: str = "text-embedding-3-large"):

    assert 'OPENAI' in os.environ, 'OPENAI API key should be set in .env'
    if embedding is not None:
        embedding_function = embedding_functions.OpenAIEmbeddingFunction(
                    api_key=os.getenv('OPENAI'),
                    model_name=embedding
            )
    else:
        embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(
            model_name="all-MiniLM-L6-v2")
    client = PersistentClient(path=path)
    # check if collection exists, if not create it
    if collection_name in [col.name for col in client.list_collections()]:
        return client.get_collection(name=collection_name)
    else:
        print(f'Created new collection: {collection_name}')
        return client.create_collection(name=collection_name, embedding_function=embedding_function)


def add_pubmed_abstracts_to_chromadb(batch_size: int = 5000) -> None:

    pubmed_collection = setup_mongo_collection(
        COLLECTION_PUBMED, clear_existing=False
    )

    collection = get_chromadb_collection()
    # get ids (pmid) and documents (title + abstract) from pubmed_collection
    ids = []
    documents = []
    for entry in pubmed_collection.find({}):
        full_text = entry.get('title', '') + ' ' + entry.get('abstract', '')
        documents.append(full_text)
        ids.append(str(entry['PMID']))

    # exclude ids that are already in the chromadb collection
    existing_ids = set(collection.get(include=[])['ids'])
    ids_to_add = []
    documents_to_add = []
    for i, id_ in enumerate(ids):
        if id_ not in existing_ids:
            ids_to_add.append(id_)
            documents_to_add.append(documents[i])

    # add in batches
    for i in range(0, len(ids_to_add), batch_size):
        batch_ids = ids_to_add[i : i + batch_size]
        batch_documents = documents_to_add[i : i + batch_size]
        collection.add(ids=batch_ids, documents=batch_documents)
        print(f'Added {len(batch_ids)} documents to chromadb collection')


add_pubmed_abstracts_to_chromadb()
Number of documents in pubmed_abstracts: 106120
Using existing collection: pubmed_docs

We then compute the vector similarity between the document embeddings associated with each pair of traits. We compute the Euclidean distance between each pair of documents associated with both traits, and then compute the mean across the documents. These are added to the data frame that also contains the pathway distances, for each in subsequent analyses:

text_similarity_df = compute_text_similarities(similarity_result_df)

We now have the semantic and biological similarity values for each pair of traits in a single data frame, which we can use for our statistical analysis.

Analyzing and visualizing the results

We can first visualize the relationship between semantic and biological similarity:

sns.scatterplot(
    data=text_similarity_df,
    y='pathway_similarity',
    x='text_similarity',
    alpha=0.5,
    size=1
)
corr_value = text_similarity_df["pathway_similarity"].corr(
    text_similarity_df["text_similarity"])
plt.title(
  f'Pathway Similarity vs Text Similarity (r={corr_value:.2f})'
)
A scatterplot of biological similarity (estimated as overlap in pathways) versus semantic similarity (estimated as embedding distance of Pubmed abstracts) on the GWAS dataset.

Figure 4:A scatterplot of biological similarity (estimated as overlap in pathways) versus semantic similarity (estimated as embedding distance of Pubmed abstracts) on the GWAS dataset.

There is a small but robust correlation between these two similarity measures. In order to more accurately estimate this association we need to take into account the fact that different documents vary in their overall similarity by including a random effect of document within a mixed effects model. We use the lmer() function from the R lme4 package, via the R magic within Jupyter:

%%R -i text_similarity_df

if (!requireNamespace("lme4", quietly = TRUE)) {
    install.packages("lme4")
}
if (!requireNamespace("lmerTest", quietly = TRUE)) {
    install.packages("lmerTest")
}
library(lme4)
library(lmerTest)

model <- lmer(pathway_similarity ~ text_similarity  + (1 | phenotype1) + (1 | phenotype2), data = text_similarity_df)
summary(model)  
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: pathway_similarity ~ text_similarity + (1 | phenotype1) + (1 |  
    phenotype2)
   Data: text_similarity_df

REML criterion at convergence: -9301.4

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.4960 -0.4077 -0.0685  0.2873  8.9175 

Random effects:
 Groups     Name        Variance Std.Dev.
 phenotype1 (Intercept) 0.040351 0.20087 
 phenotype2 (Intercept) 0.009485 0.09739 
 Residual               0.007039 0.08390 
Number of obs: 6904, groups:  phenotype1, 981; phenotype2, 905

Fixed effects:
                 Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)     2.484e-01  7.802e-03 1.507e+03   31.83   <2e-16 ***
text_similarity 3.666e-01  2.532e-02 5.858e+03   14.48   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr)
text_smlrty -0.334

This result is statistically significant, but we also want to ask how practically significant it is by looking at the amount of variance accounted for by the model:

%%R

if (!requireNamespace("MuMIn", quietly = TRUE)) {
    install.packages("MuMIn")
}

library(MuMIn)
r.squaredGLMM(model)
            R2m       R2c
[1,] 0.01068788 0.8775626

For a mixed effect model there are two R-squared values: The conditional R-squared (R2c), which refers to the total variance accounted for by both fixed and random effects, and the marginal R-squared (R2m) that refers to the variance accounted for by the fixed effects, which is the figure of interest here. This shows that while the association is strongly statistically significant, semantic similarity only accounts for about 1% of the variability in biological similarity, and thus is not a particularly strong predictor in practice. The high conditional R-squared demonstrates that the variability in the data is dominated by document-level differences in similarity, such that documents vary in the degree to which they are more generally similar to others on average.

Footnotes
  1. Note that the term “license” is often used to describe these data usage agreements, but this terminology is technically inappropriate in jurisdictions such as the U.S. where most data are treated as “facts” and thus are not subject to intellectual property laws (such as copyright laws).

References
  1. Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., … Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Sci Data, 3, 160018. 10.1038/sdata.2016.18
  2. Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1–23. 10.18637/jss.v059.i10
  3. Ziemann, M., Eren, Y., & El-Osta, A. (2016). Gene name errors are widespread in the scientific literature. Genome Biol, 17(1), 177. 10.1186/s13059-016-1044-7
  4. Gorgolewski, K. J., Auer, T., Calhoun, V. D., Craddock, R. C., Das, S., Duff, E. P., Flandin, G., Ghosh, S. S., Glatard, T., Halchenko, Y. O., Handwerker, D. A., Hanke, M., Keator, D., Li, X., Michael, Z., Maumet, C., Nichols, B. N., Nichols, T. E., Pellman, J., … Poldrack, R. A. (2016). The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments. Sci Data, 3, 160044. 10.1038/sdata.2016.44
  5. Santos-Lozada, A. R., Howard, J. T., & Verdery, A. M. (2020). How differential privacy will affect our understanding of health disparities in the United States. Proc Natl Acad Sci U S A, 117(24), 13405–13412. 10.1073/pnas.2003714117