Simple annotation from text

Perform simple term count or tf-idf value extraction from texts stored in a Dataset.

import os

from nimare import annotate, dataset, utils

Load dataset with abstracts

We’ll load a small dataset composed only of studies in Neurosynth with Angela Laird as a coauthor, for the sake of speed.

dset = dataset.Dataset(os.path.join(utils.get_resource_path(), "neurosynth_laird_studies.json"))
dset.texts.head(2)
id study_id contrast_id abstract
0 17029760-1 17029760 1 Repetitive transcranial magnetic stimulation (...
1 18760263-1 18760263 1 In an effort to clarify how deductive reasonin...


Generate term counts

Let’s start by extracting terms and their associated counts from article abstracts.

counts_df = annotate.text.generate_counts(
    dset.texts,
    text_column="abstract",
    tfidf=False,
    max_df=0.99,
    min_df=0.01,
)
counts_df.head(5)
10 10 brains 10 located 11 11 published 11 showing 17 17 sca17 2005 2005 major 2012 2012 evidence aberrant aberrant hotspots abilities abilities action abnormal abnormal sexual abnormal structure abstract abstract cognitive abstract emulation accessible accessible ensuing accompanied accompanied differential accomplished accomplished substrates account account common accurate accurate robust acetylcholine acetylcholine receptor acquired acquired standard action action cognition action selection activating ... versus versus baseline vi vi extent vi ix viewed viewed problem viib viib viiia viiia viiia viiib viiib viiib cerebellar vmpfc vmpfc pcc vmpfc posterior vocalization vocalization altered voice voice control voice network voice perturbation vowel vowel phonation voxel voxel applying voxel morphometry voxel syllable voxels voxels fp way way disrupted weaknesses weaknesses conventional wernicke wernicke responded widespread widespread functional working working memory
id
17029760-1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
18760263-1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0
19162389-1 0 0 0 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
19603407-1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
20197097-1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 rows × 2520 columns



Generate term counts

We can also extract term frequency-inverse document frequency (tf-idf) values from text using the same function. While the terms and values will differ based on the dataset provided and the settings used, this is the same general approach used to generate Neurosynth’s standard features.

tfidf_df = annotate.text.generate_counts(
    dset.texts,
    text_column="abstract",
    tfidf=True,
    max_df=0.99,
    min_df=0.01,
)
tfidf_df.head(5)
10 10 brains 10 located 11 11 published 11 showing 17 17 sca17 2005 2005 major 2012 2012 evidence aberrant aberrant hotspots abilities abilities action abnormal abnormal sexual abnormal structure abstract abstract cognitive abstract emulation accessible accessible ensuing accompanied accompanied differential accomplished accomplished substrates account account common accurate accurate robust acetylcholine acetylcholine receptor acquired acquired standard action action cognition action selection activating ... versus versus baseline vi vi extent vi ix viewed viewed problem viib viib viiia viiia viiia viiib viiib viiib cerebellar vmpfc vmpfc pcc vmpfc posterior vocalization vocalization altered voice voice control voice network voice perturbation vowel vowel phonation voxel voxel applying voxel morphometry voxel syllable voxels voxels fp way way disrupted weaknesses weaknesses conventional wernicke wernicke responded widespread widespread functional working working memory
id
17029760-1 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.063355 0.063355 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.000000 0.00000 0.00000 0.0 0.0 0.0 0.0
18760263-1 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.06714 0.06714 0.0 0.0 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.06714 0.06714 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.000000 0.06714 0.06714 0.0 0.0 0.0 0.0
19162389-1 0.0 0.0 0.0 0.130638 0.065319 0.065319 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.065319 0.065319 0.046599 0.0 0.0 0.065319 0.0 0.0 0.0 0.0 0.000000 0.000000 0.00000 0.00000 0.0 0.0 0.0 0.0
19603407-1 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.078713 0.078713 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.078713 0.078713 0.00000 0.00000 0.0 0.0 0.0 0.0
20197097-1 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.054908 0.054908 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.000000 0.00000 0.00000 0.0 0.0 0.0 0.0

5 rows × 2520 columns



Total running time of the script: (0 minutes 0.534 seconds)

Gallery generated by Sphinx-Gallery