nimare.annotate.text

Text extraction tools.

Functions

download_abstracts(dataset, email) Download the abstracts for a list of PubMed IDs.
generate_cooccurrence(text_df[, …]) Build co-occurrence matrix from documents.
generate_counts(text_df[, text_column, tfidf]) Generate tf-idf weights for unigrams/bigrams derived from textual data.
uk_to_us(text) Convert UK spellings to US based on a converter.
download_abstracts(dataset, email)[source]

Download the abstracts for a list of PubMed IDs. Uses the BioPython package.

Parameters:
  • dataset (nimare.dataset.Dataset or list of str) – A Dataset object where IDs are in the form PMID-EXPID or a list of PubMed IDs
  • email (str) – Email address to use to call the PubMed API
Returns:

dataset – Dataset with abstracts added.

Return type:

nimare.dataset.Dataset or list of str

generate_cooccurrence(text_df, text_column='abstract', vocabulary=None, window=5)[source]

Build co-occurrence matrix from documents. Not the same approach as used by the GloVe model.

Parameters:
  • text_df ((D x 2) pandas.DataFrame) – A DataFrame with two columns (‘id’ and ‘text’). D = document.
  • vocabulary (list, optional) – List of words in vocabulary to extract from text.
  • window (int, optional) – Window size for cooccurrence. Words which appear within window words of one another co-occur.
Returns:

df – One cooccurrence matrix per document in text_df.

Return type:

(V, V, D) pandas.Panel

generate_counts(text_df, text_column='abstract', tfidf=True)[source]

Generate tf-idf weights for unigrams/bigrams derived from textual data.

Parameters:text_df ((D x 2) pandas.DataFrame) – A DataFrame with two columns (‘id’ and ‘text’). D = document.
Returns:weights_df – A DataFrame where the index is ‘id’ and the columns are the unigrams/bigrams derived from the data. D = document. T = term.
Return type:(D x T) pandas.DataFrame
uk_to_us(text)[source]

Convert UK spellings to US based on a converter.

english_spellings.csv: From http://www.tysto.com/uk-us-spelling-list.html

Parameters:text (str) –
Returns:text
Return type:str