Note
Go to the end to download the full example code.
Simple annotation from text
Perform simple term count or tf-idf value extraction from texts stored in a Studyset.
import os
from nimare import annotate, utils
from nimare.nimads import Studyset
Load Studyset with abstracts
studyset = Studyset(
os.path.join(utils.get_resource_path(), "neurosynth_laird_studyset.json"),
target="mni152_2mm",
)
studyset.texts.head(2)
| id | study_id | contrast_id | abstract | |
|---|---|---|---|---|
| 0 | 17029760-1 | 17029760 | 1 | Repetitive transcranial magnetic stimulation (... |
| 1 | 18760263-1 | 18760263 | 1 | In an effort to clarify how deductive reasonin... |
Generate term counts
Let’s start by extracting terms and their associated counts from article abstracts.
counts_df = annotate.text.generate_counts(
studyset.texts,
text_column="abstract",
tfidf=False,
max_df=0.99,
min_df=0.01,
)
counts_df.head(5)
| 10 | 10 brains | 10 located | 11 | 11 published | 11 showing | 17 | 17 sca17 | 2005 | 2005 major | 2012 | 2012 evidence | aberrant | aberrant hotspots | abilities | abilities action | abnormal | abnormal sexual | abnormal structure | abstract | abstract cognitive | abstract emulation | accessible | accessible ensuing | accompanied | accompanied differential | accomplished | accomplished substrates | account | account common | accurate | accurate robust | acetylcholine | acetylcholine receptor | acquired | acquired standard | action | action cognition | action selection | activating | ... | versus | versus baseline | vi | vi extent | vi ix | viewed | viewed problem | viib | viib viiia | viiia | viiia viiib | viiib | viiib cerebellar | vmpfc | vmpfc pcc | vmpfc posterior | vocalization | vocalization altered | voice | voice control | voice network | voice perturbation | vowel | vowel phonation | voxel | voxel applying | voxel morphometry | voxel syllable | voxels | voxels fp | way | way disrupted | weaknesses | weaknesses conventional | wernicke | wernicke responded | widespread | widespread functional | working | working memory | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 17029760-1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 18760263-1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
| 19162389-1 | 0 | 0 | 0 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 19603407-1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 20197097-1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 2520 columns
Generate term counts
We can also extract term frequency-inverse document frequency (tf-idf) values from text using the same function. While the terms and values will differ based on the dataset provided and the settings used, this is the same general approach used to generate Neurosynth’s standard features.
tfidf_df = annotate.text.generate_counts(
studyset.texts,
text_column="abstract",
tfidf=True,
max_df=0.99,
min_df=0.01,
)
tfidf_df.head(5)
| 10 | 10 brains | 10 located | 11 | 11 published | 11 showing | 17 | 17 sca17 | 2005 | 2005 major | 2012 | 2012 evidence | aberrant | aberrant hotspots | abilities | abilities action | abnormal | abnormal sexual | abnormal structure | abstract | abstract cognitive | abstract emulation | accessible | accessible ensuing | accompanied | accompanied differential | accomplished | accomplished substrates | account | account common | accurate | accurate robust | acetylcholine | acetylcholine receptor | acquired | acquired standard | action | action cognition | action selection | activating | ... | versus | versus baseline | vi | vi extent | vi ix | viewed | viewed problem | viib | viib viiia | viiia | viiia viiib | viiib | viiib cerebellar | vmpfc | vmpfc pcc | vmpfc posterior | vocalization | vocalization altered | voice | voice control | voice network | voice perturbation | vowel | vowel phonation | voxel | voxel applying | voxel morphometry | voxel syllable | voxels | voxels fp | way | way disrupted | weaknesses | weaknesses conventional | wernicke | wernicke responded | widespread | widespread functional | working | working memory | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 17029760-1 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.063355 | 0.063355 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 |
| 18760263-1 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.06714 | 0.06714 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.06714 | 0.06714 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.06714 | 0.06714 | 0.0 | 0.0 | 0.0 | 0.0 |
| 19162389-1 | 0.0 | 0.0 | 0.0 | 0.130638 | 0.065319 | 0.065319 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.065319 | 0.065319 | 0.046599 | 0.0 | 0.0 | 0.065319 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 |
| 19603407-1 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.078713 | 0.078713 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.078713 | 0.078713 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 |
| 20197097-1 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.054908 | 0.054908 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 2520 columns
Add annotations to the Studyset
Now we can add the generated annotations back into the Studyset object. The annotation functions return DataFrames with ‘id’ as the index, so we need to reset the index to make ‘id’ a column before assigning to the Studyset.
This will replace any existing annotations. If you want to add to existing
annotations instead of replacing them, you can merge the DataFrames:
studyset.annotations_df = studyset.annotations_df.merge(tfidf_df.reset_index(), on='id', how='left')
studyset.annotations_df = tfidf_df.reset_index()
# Now the Studyset has the new annotations
print(f"Studyset now has {len(studyset.annotations_df.columns)} annotation columns")
studyset.annotations_df.head(5)
Studyset now has 2523 annotation columns
| id | study_id | contrast_id | 10 | 10 brains | 10 located | 11 | 11 published | 11 showing | 17 | 17 sca17 | 2005 | 2005 major | 2012 | 2012 evidence | aberrant | aberrant hotspots | abilities | abilities action | abnormal | abnormal sexual | abnormal structure | abstract | abstract cognitive | abstract emulation | accessible | accessible ensuing | accompanied | accompanied differential | accomplished | accomplished substrates | account | account common | accurate | accurate robust | acetylcholine | acetylcholine receptor | acquired | acquired standard | action | ... | versus | versus baseline | vi | vi extent | vi ix | viewed | viewed problem | viib | viib viiia | viiia | viiia viiib | viiib | viiib cerebellar | vmpfc | vmpfc pcc | vmpfc posterior | vocalization | vocalization altered | voice | voice control | voice network | voice perturbation | vowel | vowel phonation | voxel | voxel applying | voxel morphometry | voxel syllable | voxels | voxels fp | way | way disrupted | weaknesses | weaknesses conventional | wernicke | wernicke responded | widespread | widespread functional | working | working memory | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17029760-1 | 17029760 | 1 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.063355 | 0.063355 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 18760263-1 | 18760263 | 1 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.06714 | 0.06714 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.06714 | 0.06714 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.06714 | 0.06714 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 19162389-1 | 19162389 | 1 | 0.0 | 0.0 | 0.0 | 0.130638 | 0.065319 | 0.065319 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.065319 | 0.065319 | 0.046599 | 0.0 | 0.0 | 0.065319 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 19603407-1 | 19603407 | 1 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.078713 | 0.078713 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.078713 | 0.078713 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 20197097-1 | 20197097 | 1 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.054908 | 0.054908 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 2523 columns
Total running time of the script: (0 minutes 0.764 seconds)