Train a GCLDA model and use it

This example trains a generalized correspondence latent Dirichlet allocation model using abstracts from Neurosynth and then uses it for decoding.

Warning

The model in this example is trained using (1) a very small, nonrepresentative dataset and (2) very few iterations. As such, it will not provide useful results. If you are interested in using GCLDA, we recommend using a large dataset like Neurosynth, and training with at least 10k iterations.

import os

import nibabel as nib
import numpy as np
from nilearn import image, masking, plotting

import nimare
from nimare import annotate, decode
from nimare.tests.utils import get_test_data_path

Load dataset with abstracts

We’ll load a small dataset composed only of studies in Neurosynth with Angela Laird as a coauthor, for the sake of speed.

dset = nimare.dataset.Dataset(os.path.join(get_test_data_path(), "neurosynth_laird_studies.json"))
dset.texts.head(2)

	id	study_id	contrast_id	abstract
0	17029760-1	17029760	1	Repetitive transcranial magnetic stimulation (...
1	18760263-1	18760263	1	In an effort to clarify how deductive reasonin...

Generate term counts

GCLDA uses raw word counts instead of the tf-idf values generated by Neurosynth.

counts_df = annotate.text.generate_counts(
    dset.texts,
    text_column="abstract",
    tfidf=False,
    max_df=0.99,
    min_df=0.01,
)
counts_df.head(5)

Out:

/home/docs/checkouts/readthedocs.org/user_builds/nimare/envs/0.0.10/lib/python3.7/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)

	10	10 brains	10 located	11	11 published	11 showing	17	17 sca17	2005	2005 major	2012	2012 evidence	aberrant	aberrant hotspots	abilities	abilities action	abnormal	abnormal sexual	abnormal structure	abstract	abstract cognitive	abstract emulation	accessible	accessible ensuing	accompanied	accompanied differential	accomplished	accomplished substrates	account	account common	accurate	accurate robust	acetylcholine	acetylcholine receptor	acquired	acquired standard	action	action cognition	action selection	activating	...	versus	versus baseline	vi	vi extent	vi ix	viewed	viewed problem	viib	viib viiia	viiia	viiia viiib	viiib	viiib cerebellar	vmpfc	vmpfc pcc	vmpfc posterior	vocalization	vocalization altered	voice	voice control	voice network	voice perturbation	vowel	vowel phonation	voxel	voxel applying	voxel morphometry	voxel syllable	voxels	voxels fp	way	way disrupted	weaknesses	weaknesses conventional	wernicke	wernicke responded	widespread	widespread functional	working	working memory
id
17029760-1	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
18760263-1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0
19162389-1	0	0	0	2	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
19603407-1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0
20197097-1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

5 rows × 2520 columns

Run model

Five iterations will take ~10 minutes with the full Neurosynth dataset. It’s much faster with this reduced example dataset. Note that we’re using only 10 topics here. This is because there are only 13 studies in the dataset. If the number of topics is higher than the number of studies in the dataset, errors can occur during training.

model = annotate.gclda.GCLDAModel(
    counts_df,
    dset.coordinates,
    mask=dset.masker.mask_img,
    n_topics=10,
    n_regions=4,
    symmetric=True,
)
model.fit(n_iters=100, loglikely_freq=20)
model.save("gclda_model.pkl.gz")

# Let's remove the model now that you know how to generate it.
os.remove("gclda_model.pkl.gz")

Out:

/home/docs/checkouts/readthedocs.org/user_builds/nimare/checkouts/0.0.10/nimare/annotate/gclda.py:180: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only
  count_df = count_df.drop("id", 1)

Look at topics

topic_img_4d = masking.unmask(model.p_voxel_g_topic_.T, model.mask)
for i_topic in range(5):
    topic_img_3d = image.index_img(topic_img_4d, i_topic)
    plotting.plot_stat_map(
        topic_img_3d,
        draw_cross=False,
        colorbar=False,
        annotate=False,
        title=f"Topic {i_topic + 1}",
    )

Generate a pseudo-statistic image from text

text = "dorsal anterior cingulate cortex"
encoded_img, _ = decode.encode.gclda_encode(model, text)
plotting.plot_stat_map(encoded_img, draw_cross=False)

Out:

<nilearn.plotting.displays.OrthoSlicer object at 0x7f1889abba50>

Decode an unthresholded statistical map

For the sake of simplicity, we will use the pseudo-statistic map generated in the previous step.

# Run the decoder
decoded_df, _ = decode.continuous.gclda_decode_map(model, encoded_img)
decoded_df.sort_values(by="Weight", ascending=False).head(10)

	Weight
Term
cognition	0.053079
functions	0.053079
pmd	0.047181
cortex	0.042285
networks	0.037038
anterior	0.035490
fp	0.030304
frontal	0.029671
cbp	0.029488
functionally	0.029488

Decode an ROI image

First we’ll make an ROI

arr = np.zeros(dset.masker.mask_img.shape, int)
arr[65:75, 50:60, 50:60] = 1
mask_img = nib.Nifti1Image(arr, dset.masker.mask_img.affine)
plotting.plot_roi(mask_img, draw_cross=False)

Out:

<nilearn.plotting.displays.OrthoSlicer object at 0x7f1889a7f190>

Run the decoder

decoded_df, _ = decode.discrete.gclda_decode_roi(model, mask_img)
decoded_df.sort_values(by="Weight", ascending=False).head(10)

	Weight
Term
motor	26.539804
seed	21.714385
structural	19.301676
cortex	16.896085
behavioral	16.888966
methods	12.063550
speech	12.063549
connectivity modeling	12.063547
sca17	12.063547
tms pet	9.650838

Total running time of the script: ( 0 minutes 41.164 seconds)

Gallery generated by Sphinx-Gallery