Note
Go to the end to download the full example code.
Loading Studysets from parquet.
Loading Studysets from parquet
NiMARE can load a Studyset from a directory of
parquet files. This table-backed format is useful for large NeuroStore
Studysets because it avoids parsing a single large nested JSON file and keeps
the Studyset lazy until nested Study/Analysis objects are explicitly needed.
The main use case for this format will be distributed Studyset releases from https://www.neurostore.org/api/neurostore-studyset-releases/. Release archives are expected to contain the same manifest and table layout demonstrated here.
from pathlib import Path
import pandas as pd
from nimare.nimads import Studyset
from nimare.utils import get_resource_path
Find the example parquet Studyset
A parquet Studyset directory contains a studyset.json manifest and one
parquet file per canonical Studyset table. This example uses a small packaged
slice of a NeuroStore release.
parquet_dir = Path(get_resource_path()) / "neurostore_parquet_studyset"
if not parquet_dir.exists():
# Support running this example directly from a source checkout before the
# new packaged resource has been installed into the active environment.
parquet_dir = (
Path(__file__).resolve().parents[2]
/ "nimare"
/ "resources"
/ "neurostore_parquet_studyset"
)
print(sorted(path.name for path in parquet_dir.iterdir()))
['analyses.parquet', 'annotations.parquet', 'coordinates.parquet', 'images.parquet', 'metadata.parquet', 'studies.parquet', 'studyset.json', 'texts.parquet']
Inspect the manifest
The manifest records the Studyset id/name, schema version, annotation ids, and table filenames.
print((parquet_dir / "studyset.json").read_text())
{
"annotations": [
{
"id": "test-neurostore-annotation"
}
],
"format": "nimare-studyset-parquet",
"id": "test-neurostore-parquet-studyset",
"name": "test-neurostore-parquet-studyset",
"tables": {
"analyses": "analyses.parquet",
"annotations": "annotations.parquet",
"coordinates": "coordinates.parquet",
"images": "images.parquet",
"metadata": "metadata.parquet",
"studies": "studies.parquet",
"texts": "texts.parquet"
},
"version": 1
}
Inspect the parquet table shapes
The table layout is:
studies.parquet: one row per study, withstudy_id,name,description,authors, andpublication.analyses.parquet: one row per analysis, with the full analysisid.coordinates.parquet: coordinate rows keyed by analysis id.metadata.parquet: one row per analysis with metadata descriptors.annotations.parquet: one row per analysis with annotation feature columns.images.parquet: image references keyed by analysis id.texts.parquet: text fields keyed by analysis id.
for table_file in sorted(parquet_dir.glob("*.parquet")):
table = pd.read_parquet(table_file)
print(f"{table_file.name}: {table.shape}")
analyses.parquet: (8, 4)
annotations.parquet: (8, 502)
coordinates.parquet: (47, 15)
images.parquet: (8, 5)
metadata.parquet: (8, 155)
studies.parquet: (4, 5)
texts.parquet: (8, 3)
Load the Studyset
The constructor recognizes a parquet Studyset directory and returns a table-backed Studyset. The nested Study/Analysis object graph is not materialized during loading.
studyset = Studyset(parquet_dir)
print(studyset)
print(f"Studyset ID: {studyset.id}")
print(f"Number of studies: {len(studyset.study_ids)}")
print(f"Number of analyses: {len(studyset.ids)}")
print(f"Materialized nested objects? {studyset.is_materialized}")
Studyset: test-neurostore-parquet-studyset :: studies: 4
Studyset ID: test-neurostore-parquet-studyset
Number of studies: 4
Number of analyses: 8
Materialized nested objects? False
Work with table-backed views
The standard Studyset table views are available immediately.
print(studyset.coordinates.head())
print(studyset.metadata.head())
annotation_columns = [
column
for column in studyset.annotations_df.columns
if column not in {"id", "study_id", "contrast_id"}
]
print(f"Annotation feature columns: {len(annotation_columns)}")
print(studyset.annotations_df[["id"] + annotation_columns[:5]].head())
id study_id contrast_id ... value_f p value_r
0 22XctM7fX2Dw-D68jH5p6HXSj 22XctM7fX2Dw D68jH5p6HXSj ... NaN NaN NaN
1 22XctM7fX2Dw-D68jH5p6HXSj 22XctM7fX2Dw D68jH5p6HXSj ... NaN NaN NaN
2 22XctM7fX2Dw-D68jH5p6HXSj 22XctM7fX2Dw D68jH5p6HXSj ... NaN NaN NaN
3 22XctM7fX2Dw-D68jH5p6HXSj 22XctM7fX2Dw D68jH5p6HXSj ... NaN NaN NaN
4 22XctM7fX2Dw-D68jH5p6HXSj 22XctM7fX2Dw D68jH5p6HXSj ... NaN NaN NaN
[5 rows x 15 columns]
id study_id ... ADNI als_diagnostic_criteria
0 22XctM7fX2Dw-D68jH5p6HXSj 22XctM7fX2Dw ... NaN NaN
1 22XctM7fX2Dw-Ghcw82nz5KLD 22XctM7fX2Dw ... NaN NaN
2 22iyhNgni5Du-fhbM8khTqcVx 22iyhNgni5Du ... NaN NaN
3 22iyhNgni5Du-hzpLkdj5mGWX 22iyhNgni5Du ... NaN NaN
4 22iyhNgni5Du-nAc8LAP6RwTB 22iyhNgni5Du ... NaN NaN
[5 rows x 155 columns]
Annotation feature columns: 499
id ... ParticipantDemographicsExtractor.groups[0].age_median
0 22XctM7fX2Dw-D68jH5p6HXSj ... NaN
1 22XctM7fX2Dw-Ghcw82nz5KLD ... NaN
2 22iyhNgni5Du-fhbM8khTqcVx ... NaN
3 22iyhNgni5Du-hzpLkdj5mGWX ... NaN
4 22iyhNgni5Du-nAc8LAP6RwTB ... NaN
[5 rows x 6 columns]
Materialize only when needed
Accessing studyset.studies reconstructs nested Study, Analysis, and Point
objects from the parquet-backed tables. Most Studyset-aware NiMARE workflows
can use the table-backed views without this step.
first_study = studyset.studies[0]
print(f"First study: {first_study.id}")
print(f"Analyses in first study: {len(first_study.analyses)}")
print(f"Materialized nested objects? {studyset.is_materialized}")
First study: 22XctM7fX2Dw
Analyses in first study: 2
Materialized nested objects? True
Total running time of the script: (0 minutes 0.340 seconds)