Simple annotation from text

Perform simple term count or tf-idf value extraction from texts stored in a Studyset.

import os

from nimare import annotate, utils
from nimare.nimads import Studyset

Load Studyset with abstracts

studyset = Studyset(
    os.path.join(utils.get_resource_path(), "neurosynth_laird_studyset.json"),
    target="mni152_2mm",
)
studyset.texts.head(2)
id study_id contrast_id abstract
0 17029760-1 17029760 1 Repetitive transcranial magnetic stimulation (...
1 18760263-1 18760263 1 In an effort to clarify how deductive reasonin...


Generate term counts

Let’s start by extracting terms and their associated counts from article abstracts.

counts_df = annotate.text.generate_counts(
    studyset.texts,
    text_column="abstract",
    tfidf=False,
    max_df=0.99,
    min_df=0.01,
)
counts_df.head(5)
10 10 brains 10 located 11 11 published 11 showing 17 17 sca17 2005 2005 major 2012 2012 evidence aberrant aberrant hotspots abilities abilities action abnormal abnormal sexual abnormal structure abstract abstract cognitive abstract emulation accessible accessible ensuing accompanied accompanied differential accomplished accomplished substrates account account common accurate accurate robust acetylcholine acetylcholine receptor acquired acquired standard action action cognition action selection activating ... versus versus baseline vi vi extent vi ix viewed viewed problem viib viib viiia viiia viiia viiib viiib viiib cerebellar vmpfc vmpfc pcc vmpfc posterior vocalization vocalization altered voice voice control voice network voice perturbation vowel vowel phonation voxel voxel applying voxel morphometry voxel syllable voxels voxels fp way way disrupted weaknesses weaknesses conventional wernicke wernicke responded widespread widespread functional working working memory
id
17029760-1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
18760263-1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0
19162389-1 0 0 0 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
19603407-1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
20197097-1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 rows × 2520 columns



Generate term counts

We can also extract term frequency-inverse document frequency (tf-idf) values from text using the same function. While the terms and values will differ based on the dataset provided and the settings used, this is the same general approach used to generate Neurosynth’s standard features.

tfidf_df = annotate.text.generate_counts(
    studyset.texts,
    text_column="abstract",
    tfidf=True,
    max_df=0.99,
    min_df=0.01,
)
tfidf_df.head(5)
10 10 brains 10 located 11 11 published 11 showing 17 17 sca17 2005 2005 major 2012 2012 evidence aberrant aberrant hotspots abilities abilities action abnormal abnormal sexual abnormal structure abstract abstract cognitive abstract emulation accessible accessible ensuing accompanied accompanied differential accomplished accomplished substrates account account common accurate accurate robust acetylcholine acetylcholine receptor acquired acquired standard action action cognition action selection activating ... versus versus baseline vi vi extent vi ix viewed viewed problem viib viib viiia viiia viiia viiib viiib viiib cerebellar vmpfc vmpfc pcc vmpfc posterior vocalization vocalization altered voice voice control voice network voice perturbation vowel vowel phonation voxel voxel applying voxel morphometry voxel syllable voxels voxels fp way way disrupted weaknesses weaknesses conventional wernicke wernicke responded widespread widespread functional working working memory
id
17029760-1 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.063355 0.063355 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.000000 0.00000 0.00000 0.0 0.0 0.0 0.0
18760263-1 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.06714 0.06714 0.0 0.0 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.06714 0.06714 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.000000 0.06714 0.06714 0.0 0.0 0.0 0.0
19162389-1 0.0 0.0 0.0 0.130638 0.065319 0.065319 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.065319 0.065319 0.046599 0.0 0.0 0.065319 0.0 0.0 0.0 0.0 0.000000 0.000000 0.00000 0.00000 0.0 0.0 0.0 0.0
19603407-1 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.078713 0.078713 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.078713 0.078713 0.00000 0.00000 0.0 0.0 0.0 0.0
20197097-1 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.054908 0.054908 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.000000 0.00000 0.00000 0.0 0.0 0.0 0.0

5 rows × 2520 columns



Add annotations to the Studyset

Now we can add the generated annotations back into the Studyset object. The annotation functions return DataFrames with ‘id’ as the index, so we need to reset the index to make ‘id’ a column before assigning to the Studyset.

This will replace any existing annotations. If you want to add to existing annotations instead of replacing them, you can merge the DataFrames: studyset.annotations_df = studyset.annotations_df.merge(tfidf_df.reset_index(), on='id', how='left')

studyset.annotations_df = tfidf_df.reset_index()

# Now the Studyset has the new annotations
print(f"Studyset now has {len(studyset.annotations_df.columns)} annotation columns")
studyset.annotations_df.head(5)
Studyset now has 2523 annotation columns
id study_id contrast_id 10 10 brains 10 located 11 11 published 11 showing 17 17 sca17 2005 2005 major 2012 2012 evidence aberrant aberrant hotspots abilities abilities action abnormal abnormal sexual abnormal structure abstract abstract cognitive abstract emulation accessible accessible ensuing accompanied accompanied differential accomplished accomplished substrates account account common accurate accurate robust acetylcholine acetylcholine receptor acquired acquired standard action ... versus versus baseline vi vi extent vi ix viewed viewed problem viib viib viiia viiia viiia viiib viiib viiib cerebellar vmpfc vmpfc pcc vmpfc posterior vocalization vocalization altered voice voice control voice network voice perturbation vowel vowel phonation voxel voxel applying voxel morphometry voxel syllable voxels voxels fp way way disrupted weaknesses weaknesses conventional wernicke wernicke responded widespread widespread functional working working memory
0 17029760-1 17029760 1 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.063355 0.063355 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.000000 0.00000 0.00000 0.0 0.0 0.0 0.0
1 18760263-1 18760263 1 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.06714 0.06714 0.0 0.0 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.06714 0.06714 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.000000 0.06714 0.06714 0.0 0.0 0.0 0.0
2 19162389-1 19162389 1 0.0 0.0 0.0 0.130638 0.065319 0.065319 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.065319 0.065319 0.046599 0.0 0.0 0.065319 0.0 0.0 0.0 0.0 0.000000 0.000000 0.00000 0.00000 0.0 0.0 0.0 0.0
3 19603407-1 19603407 1 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.078713 0.078713 0.0 0.0 0.000000 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.078713 0.078713 0.00000 0.00000 0.0 0.0 0.0 0.0
4 20197097-1 20197097 1 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.054908 0.054908 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.000000 0.00000 0.00000 0.0 0.0 0.0 0.0

5 rows × 2523 columns



Total running time of the script: (0 minutes 0.764 seconds)

Gallery generated by Sphinx-Gallery