Loading Studysets from parquet

NiMARE can load a Studyset from a directory of parquet files. This table-backed format is useful for large NeuroStore Studysets because it avoids parsing a single large nested JSON file and keeps the Studyset lazy until nested Study/Analysis objects are explicitly needed.

The main use case for this format will be distributed Studyset releases from https://www.neurostore.org/api/neurostore-studyset-releases/. Release archives are expected to contain the same manifest and table layout demonstrated here.

from pathlib import Path

import pandas as pd

from nimare.nimads import Studyset
from nimare.utils import get_resource_path

Find the example parquet Studyset

A parquet Studyset directory contains a studyset.json manifest and one parquet file per canonical Studyset table. This example uses a small packaged slice of a NeuroStore release.

parquet_dir = Path(get_resource_path()) / "neurostore_parquet_studyset"
if not parquet_dir.exists():
    # Support running this example directly from a source checkout before the
    # new packaged resource has been installed into the active environment.
    parquet_dir = (
        Path(__file__).resolve().parents[2]
        / "nimare"
        / "resources"
        / "neurostore_parquet_studyset"
    )
print(sorted(path.name for path in parquet_dir.iterdir()))

['analyses.parquet', 'annotations.parquet', 'coordinates.parquet', 'images.parquet', 'metadata.parquet', 'studies.parquet', 'studyset.json', 'texts.parquet']

Inspect the manifest

The manifest records the Studyset id/name, schema version, annotation ids, and table filenames.

print((parquet_dir / "studyset.json").read_text())

{
  "annotations": [
    {
      "id": "test-neurostore-annotation"
    }
  ],
  "format": "nimare-studyset-parquet",
  "id": "test-neurostore-parquet-studyset",
  "name": "test-neurostore-parquet-studyset",
  "tables": {
    "analyses": "analyses.parquet",
    "annotations": "annotations.parquet",
    "coordinates": "coordinates.parquet",
    "images": "images.parquet",
    "metadata": "metadata.parquet",
    "studies": "studies.parquet",
    "texts": "texts.parquet"
  },
  "version": 1
}

Inspect the parquet table shapes

The table layout is:

studies.parquet: one row per study, with study_id, name, description, authors, and publication.
analyses.parquet: one row per analysis, with the full analysis id.
coordinates.parquet: coordinate rows keyed by analysis id.
metadata.parquet: one row per analysis with metadata descriptors.
annotations.parquet: one row per analysis with annotation feature columns.
images.parquet: image references keyed by analysis id.
texts.parquet: text fields keyed by analysis id.

for table_file in sorted(parquet_dir.glob("*.parquet")):
    table = pd.read_parquet(table_file)
    print(f"{table_file.name}: {table.shape}")

analyses.parquet: (8, 4)
annotations.parquet: (8, 502)
coordinates.parquet: (47, 15)
images.parquet: (8, 5)
metadata.parquet: (8, 155)
studies.parquet: (4, 5)
texts.parquet: (8, 3)

Load the Studyset

The constructor recognizes a parquet Studyset directory and returns a table-backed Studyset. The nested Study/Analysis object graph is not materialized during loading.

studyset = Studyset(parquet_dir)
print(studyset)
print(f"Studyset ID: {studyset.id}")
print(f"Number of studies: {len(studyset.study_ids)}")
print(f"Number of analyses: {len(studyset.ids)}")
print(f"Materialized nested objects? {studyset.is_materialized}")

Studyset: test-neurostore-parquet-studyset :: studies: 4
Studyset ID: test-neurostore-parquet-studyset
Number of studies: 4
Number of analyses: 8
Materialized nested objects? False

Work with table-backed views

The standard Studyset table views are available immediately.

print(studyset.coordinates.head())
print(studyset.metadata.head())

annotation_columns = [
    column
    for column in studyset.annotations_df.columns
    if column not in {"id", "study_id", "contrast_id"}
]
print(f"Annotation feature columns: {len(annotation_columns)}")
print(studyset.annotations_df[["id"] + annotation_columns[:5]].head())

                          id      study_id   contrast_id  ...  value_f   p  value_r
0  22XctM7fX2Dw-D68jH5p6HXSj  22XctM7fX2Dw  D68jH5p6HXSj  ...      NaN NaN      NaN
1  22XctM7fX2Dw-D68jH5p6HXSj  22XctM7fX2Dw  D68jH5p6HXSj  ...      NaN NaN      NaN
2  22XctM7fX2Dw-D68jH5p6HXSj  22XctM7fX2Dw  D68jH5p6HXSj  ...      NaN NaN      NaN
3  22XctM7fX2Dw-D68jH5p6HXSj  22XctM7fX2Dw  D68jH5p6HXSj  ...      NaN NaN      NaN
4  22XctM7fX2Dw-D68jH5p6HXSj  22XctM7fX2Dw  D68jH5p6HXSj  ...      NaN NaN      NaN

[5 rows x 15 columns]
                          id      study_id  ... ADNI als_diagnostic_criteria
0  22XctM7fX2Dw-D68jH5p6HXSj  22XctM7fX2Dw  ...  NaN                     NaN
1  22XctM7fX2Dw-Ghcw82nz5KLD  22XctM7fX2Dw  ...  NaN                     NaN
2  22iyhNgni5Du-fhbM8khTqcVx  22iyhNgni5Du  ...  NaN                     NaN
3  22iyhNgni5Du-hzpLkdj5mGWX  22iyhNgni5Du  ...  NaN                     NaN
4  22iyhNgni5Du-nAc8LAP6RwTB  22iyhNgni5Du  ...  NaN                     NaN

[5 rows x 155 columns]
Annotation feature columns: 499
                          id  ... ParticipantDemographicsExtractor.groups[0].age_median
0  22XctM7fX2Dw-D68jH5p6HXSj  ...                                                NaN
1  22XctM7fX2Dw-Ghcw82nz5KLD  ...                                                NaN
2  22iyhNgni5Du-fhbM8khTqcVx  ...                                                NaN
3  22iyhNgni5Du-hzpLkdj5mGWX  ...                                                NaN
4  22iyhNgni5Du-nAc8LAP6RwTB  ...                                                NaN

[5 rows x 6 columns]

Materialize only when needed

Accessing studyset.studies reconstructs nested Study, Analysis, and Point objects from the parquet-backed tables. Most Studyset-aware NiMARE workflows can use the table-backed views without this step.

first_study = studyset.studies[0]
print(f"First study: {first_study.id}")
print(f"Analyses in first study: {len(first_study.analyses)}")
print(f"Materialized nested objects? {studyset.is_materialized}")

First study: 22XctM7fX2Dw
Analyses in first study: 2
Materialized nested objects? True

Total running time of the script: (0 minutes 0.699 seconds)

Gallery generated by Sphinx-Gallery