nimare.annotate.text¶

Text extraction tools.

Functions

`download_abstracts`(dataset, email)	Download the abstracts for a list of PubMed IDs.
`generate_cooccurrence`(text_df[, …])	Build co-occurrence matrix from documents.
`generate_counts`(text_df[, text_column, tfidf])	Generate tf-idf weights for unigrams/bigrams derived from textual data.
`uk_to_us`(text)	Convert UK spellings to US based on a converter.

download_abstracts(dataset, email)[source]¶

Download the abstracts for a list of PubMed IDs. Uses the BioPython package.

Parameters:	dataset (`nimare.dataset.Dataset` or `list` of `str`) – A Dataset object where IDs are in the form PMID-EXPID or a list of PubMed IDs email (`str`) – Email address to use to call the PubMed API
Returns:	dataset – Dataset with abstracts added.
Return type:	`nimare.dataset.Dataset` or `list` of `str`

generate_cooccurrence(text_df, text_column='abstract', vocabulary=None, window=5)[source]¶

Build co-occurrence matrix from documents. Not the same approach as used by the GloVe model.

Parameters:	text_df ((D x 2) `pandas.DataFrame`) – A DataFrame with two columns (‘id’ and ‘text’). D = document. vocabulary (`list`, optional) – List of words in vocabulary to extract from text. window (`int`, optional) – Window size for cooccurrence. Words which appear within window words of one another co-occur.
Returns:	df – One cooccurrence matrix per document in text_df.
Return type:	(V, V, D) `pandas.Panel`

generate_counts(text_df, text_column='abstract', tfidf=True)[source]¶

Generate tf-idf weights for unigrams/bigrams derived from textual data.

Parameters:	text_df ((D x 2) `pandas.DataFrame`) – A DataFrame with two columns (‘id’ and ‘text’). D = document.
Returns:	weights_df – A DataFrame where the index is ‘id’ and the columns are the unigrams/bigrams derived from the data. D = document. T = term.
Return type:	(D x T) `pandas.DataFrame`

uk_to_us(text)[source]¶

Convert UK spellings to US based on a converter.