LDA topic modeling

Trains a latent Dirichlet allocation model with scikit-learn using abstracts from Neurosynth.

import os

import pandas as pd

from nimare import annotate
from nimare.dataset import Dataset
from nimare.utils import get_resource_path

Load dataset with abstracts

dset = Dataset(os.path.join(get_resource_path(), "neurosynth_laird_studies.json"))

Initialize LDA model

model = annotate.lda.LDAModel(n_topics=5, max_iter=1000, text_column="abstract")

Run model

new_dset = model.fit(dset)

View results

This DataFrame is very large, so we will only show a slice of it.

new_dset.annotations[new_dset.annotations.columns[:10]].head(10)

	id	study_id	contrast_id	Neurosynth_TFIDF__10	Neurosynth_TFIDF__11
0	17029760-1	17029760	1	0.000000	0.000000
1	18760263-1	18760263	1	0.000000	0.000000
2	19162389-1	19162389	1	0.000000	0.176321
3	19603407-1	19603407	1	0.000000	0.000000
4	20197097-1	20197097	1	0.000000	0.000000
5	22569543-1	22569543	1	0.000000	0.000000
6	22659444-1	22659444	1	0.000000	0.000000
7	23042731-1	23042731	1	0.000000	0.000000
8	23702412-1	23702412	1	0.061006	0.000000
9	24681401-1	24681401	1	0.000000	0.000000

Given that this DataFrame is very wide (many terms), we will transpose it before presenting it.

model.distributions_["p_topic_g_word_df"].T.head(10)

	LDA5__1_functional_cbp_literature	LDA5__2_cortex_prefrontal_lateral	LDA5__3_connectivity_functional_anterior	LDA5__4_motor_cortex_functional	LDA5__5_social_functional_maps
10	1.001059	0.001000	0.001000	0.001	1.000941
abstract	0.001000	0.001000	1.000677	0.001	1.001323
action	0.001000	0.001000	0.001000	0.001	2.001000
active	0.001000	3.001603	1.000397	0.001	0.001000
addition	1.967435	1.679096	1.356469	0.001	0.001000
additionally	0.001000	2.001000	0.001000	0.001	0.001000
affective	1.000970	0.001000	2.000340	0.001	3.001691
affective processes	0.001000	0.001000	2.001000	0.001	0.001000
ale	1.001000	0.001000	0.001000	0.001	1.001000
altered	0.001000	1.000236	0.001000	0.001	3.001764

n_top_terms = 10
top_term_df = model.distributions_["p_topic_g_word_df"].T
temp_df = top_term_df.copy()
top_term_df = pd.DataFrame(columns=top_term_df.columns, index=range(n_top_terms))
top_term_df.index.name = "Token"
for col in top_term_df.columns:
    top_tokens = temp_df.sort_values(by=col, ascending=False).index.tolist()[:n_top_terms]
    top_term_df.loc[:, col] = top_tokens

top_term_df

	LDA5__1_functional_cbp_literature	LDA5__2_cortex_prefrontal_lateral	LDA5__3_connectivity_functional_anterior	LDA5__4_motor_cortex_functional	LDA5__5_social_functional_maps
Token
0	functional	cortex	connectivity	motor	social
1	cbp	prefrontal	functional	cortex	functional
2	literature	lateral	anterior	functional	maps
3	parcellation	stimulation	functional connectivity	reflecting	human
4	clusters	identified	macm	primary	connectivity
5	higher	cognition	posterior	shown	function
6	talairach	prefrontal cortex	cognitive	role	structural
7	frontal	network	insula	presence	functionally
8	ventromedial	control	seed	task	structure function
9	analytic	frontal	approaches	voxel	altered

Total running time of the script: ( 0 minutes 3.523 seconds)

Gallery generated by Sphinx-Gallery