Simple annotation from text

Perform simple term count or tf-idf value extraction from texts stored in a Studyset.

import os

from nimare import annotate, utils
from nimare.nimads import Studyset

Load Studyset with abstracts

studyset = Studyset(
    os.path.join(utils.get_resource_path(), "neurosynth_laird_studyset.json"),
    target="mni152_2mm",
)
studyset.texts.head(2)

	id	study_id	contrast_id	abstract
0	17029760-1	17029760	1	Repetitive transcranial magnetic stimulation (...
1	18760263-1	18760263	1	In an effort to clarify how deductive reasonin...

Generate term counts

Let’s start by extracting terms and their associated counts from article abstracts.

counts_df = annotate.text.generate_counts(
    studyset.texts,
    text_column="abstract",
    tfidf=False,
    max_df=0.99,
    min_df=0.01,
)
counts_df.head(5)

	10	10 brains	10 located	11	11 published	11 showing	17	17 sca17	2005	2005 major	2012	2012 evidence	aberrant	aberrant hotspots	abilities	abilities action	abnormal	abnormal sexual	abnormal structure	abstract	abstract cognitive	abstract emulation	accessible	accessible ensuing	accompanied	accompanied differential	accomplished	accomplished substrates	account	account common	accurate	accurate robust	acetylcholine	acetylcholine receptor	acquired	acquired standard	action	action cognition	action selection	activating	...	versus	versus baseline	vi	vi extent	vi ix	viewed	viewed problem	viib	viib viiia	viiia	viiia viiib	viiib	viiib cerebellar	vmpfc	vmpfc pcc	vmpfc posterior	vocalization	vocalization altered	voice	voice control	voice network	voice perturbation	vowel	vowel phonation	voxel	voxel applying	voxel morphometry	voxel syllable	voxels	voxels fp	way	way disrupted	weaknesses	weaknesses conventional	wernicke	wernicke responded	widespread	widespread functional	working	working memory
id
17029760-1	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
18760263-1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0
19162389-1	0	0	0	2	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
19603407-1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0
20197097-1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

5 rows × 2520 columns

Generate term counts

We can also extract term frequency-inverse document frequency (tf-idf) values from text using the same function. While the terms and values will differ based on the dataset provided and the settings used, this is the same general approach used to generate Neurosynth’s standard features.

tfidf_df = annotate.text.generate_counts(
    studyset.texts,
    text_column="abstract",
    tfidf=True,
    max_df=0.99,
    min_df=0.01,
)
tfidf_df.head(5)

	10	10 brains	10 located	11	11 published	11 showing	17	17 sca17	2005	2005 major	2012	2012 evidence	aberrant	aberrant hotspots	abilities	abilities action	abnormal	abnormal sexual	abnormal structure	abstract	abstract cognitive	abstract emulation	accessible	accessible ensuing	accompanied	accompanied differential	accomplished	accomplished substrates	account	account common	accurate	accurate robust	acetylcholine	acetylcholine receptor	acquired	acquired standard	action	action cognition	action selection	activating	...	versus	versus baseline	vi	vi extent	vi ix	viewed	viewed problem	viib	viib viiia	viiia	viiia viiib	viiib	viiib cerebellar	vmpfc	vmpfc pcc	vmpfc posterior	vocalization	vocalization altered	voice	voice control	voice network	voice perturbation	vowel	vowel phonation	voxel	voxel applying	voxel morphometry	voxel syllable	voxels	voxels fp	way	way disrupted	weaknesses	weaknesses conventional	wernicke	wernicke responded	widespread	widespread functional	working	working memory
id
17029760-1	0.0	0.0	0.0	0.000000	0.000000	0.000000	0.0	0.0	0.063355	0.063355	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.00000	0.00000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.00000	0.00000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	0.000000	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	0.000000	0.000000	0.00000	0.00000	0.0	0.0	0.0	0.0
18760263-1	0.0	0.0	0.0	0.000000	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.06714	0.06714	0.0	0.0	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.06714	0.06714	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	0.000000	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	0.000000	0.000000	0.06714	0.06714	0.0	0.0	0.0	0.0
19162389-1	0.0	0.0	0.0	0.130638	0.065319	0.065319	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.00000	0.00000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.00000	0.00000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.065319	0.065319	0.046599	0.0	0.0	0.065319	0.0	0.0	0.0	0.0	0.000000	0.000000	0.00000	0.00000	0.0	0.0	0.0	0.0
19603407-1	0.0	0.0	0.0	0.000000	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.00000	0.00000	0.0	0.0	0.078713	0.078713	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.00000	0.00000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	0.000000	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	0.078713	0.078713	0.00000	0.00000	0.0	0.0	0.0	0.0
20197097-1	0.0	0.0	0.0	0.000000	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.00000	0.00000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.054908	0.054908	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.00000	0.00000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	0.000000	0.0	0.0	0.000000	0.0	0.0	0.0	0.0	0.000000	0.000000	0.00000	0.00000	0.0	0.0	0.0	0.0

5 rows × 2520 columns

Add annotations to the Studyset

Now we can add the generated annotations back into the Studyset object. The annotation functions return DataFrames with ‘id’ as the index, so we need to reset the index to make ‘id’ a column before assigning to the Studyset.

This will replace any existing annotations. If you want to add to existing annotations instead of replacing them, you can merge the DataFrames: studyset.annotations_df = studyset.annotations_df.merge(tfidf_df.reset_index(), on='id', how='left')

studyset.annotations_df = tfidf_df.reset_index()

# Now the Studyset has the new annotations
print(f"Studyset now has {len(studyset.annotations_df.columns)} annotation columns")
studyset.annotations_df.head(5)

Studyset now has 2523 annotation columns

	id	study_id	contrast_id	11	11 published	11 showing	2005	2005 major	accomplished	accomplished substrates	accurate	accurate robust	acquired	acquired standard	...	viewed	viewed problem	vowel	vowel phonation	voxel	voxel syllable	weaknesses	weaknesses conventional	wernicke	wernicke responded
0	17029760-1	17029760	1	0.000000	0.000000	0.000000	0.063355	0.063355	0.00000	0.00000	0.000000	0.000000	0.000000	0.000000	...	0.00000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000	0.00000
1	18760263-1	18760263	1	0.000000	0.000000	0.000000	0.000000	0.000000	0.06714	0.06714	0.000000	0.000000	0.000000	0.000000	...	0.06714	0.06714	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.06714	0.06714
2	19162389-1	19162389	1	0.130638	0.065319	0.065319	0.000000	0.000000	0.00000	0.00000	0.000000	0.000000	0.000000	0.000000	...	0.00000	0.00000	0.065319	0.065319	0.046599	0.065319	0.000000	0.000000	0.00000	0.00000
3	19603407-1	19603407	1	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000	0.00000	0.078713	0.078713	0.000000	0.000000	...	0.00000	0.00000	0.000000	0.000000	0.000000	0.000000	0.078713	0.078713	0.00000	0.00000
4	20197097-1	20197097	1	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000	0.00000	0.000000	0.000000	0.054908	0.054908	...	0.00000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000	0.00000

5 rows × 2523 columns

Total running time of the script: (0 minutes 0.764 seconds)

Gallery generated by Sphinx-Gallery