nimare.annotate.lda.LDAModel

class LDAModel(n_topics, max_iter=1000, alpha=None, beta=0.001, text_column='abstract', n_cores=1)[source]

Bases: NiMAREBase

Generate a latent Dirichlet allocation (LDA) topic model.

This class is a light wrapper around scikit-learn tools for tokenization and LDA.

Parameters:
  • n_topics (int) – Number of topics for topic model. This corresponds to the model’s n_components parameter. Must be an integer >= 1.

  • max_iter (int, optional) – Maximum number of iterations to use during model fitting. Default = 1000.

  • alpha (float or None, optional) – The alpha value for the model. This corresponds to the model’s doc_topic_prior parameter. Default is None, which evaluates to 1 / n_topics, as was used in Poldrack et al.[1].

  • beta (float or None, optional) – The beta value for the model. This corresponds to the model’s topic_word_prior parameter. If None, it evaluates to 1 / n_topics. Default is 0.001, which was used in Poldrack et al.[1].

  • text_column (str, optional) – The source of text to use for the model. This should correspond to an existing column in the texts attribute. Default is “abstract”.

  • n_cores (int, optional) – Number of cores to use for parallelization. If <=0, defaults to using all available cores. Default is 1.

Variables:

model (LatentDirichletAllocation) –

Notes

Latent Dirichlet allocation was first developed in Blei et al.[2], and was first applied to neuroimaging articles in Poldrack et al.[1].

References

See also

CountVectorizer

Used to build a vocabulary of terms and their associated counts from texts in the self.text_column of the Dataset’s texts attribute.

LatentDirichletAllocation

Used to train the LDA model.

Methods

fit(dset)

Fit the LDA topic model to text from a Dataset.

get_params([deep])

Get parameters for this estimator.

load(filename[, compressed])

Load a pickled class instance from file.

save(filename[, compress])

Pickle the class instance to the provided file.

set_params(**params)

Set the parameters of this estimator.

fit(dset)[source]

Fit the LDA topic model to text from a Dataset.

Parameters:

dset (Dataset) – A Dataset with, at minimum, text available in the self.text_column column of its texts attribute.

Returns:

dset – A new Dataset with an updated annotations attribute.

Return type:

Dataset

Variables:

distributions (dict) –

A dictionary containing additional distributions produced by the model, including:

  • p_topic_g_word: numpy.ndarray of shape (n_topics, n_tokens) containing the topic-term weights for the model.

  • p_topic_g_word_df: pandas.DataFrame of shape (n_topics, n_tokens) containing the topic-term weights for the model.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

classmethod load(filename, compressed=True)[source]

Load a pickled class instance from file.

Parameters:
  • filename (str) – Name of file containing object.

  • compressed (bool, default=True) – If True, the file is assumed to be compressed and gzip will be used to load it. Otherwise, it will assume that the file is not compressed. Default = True.

Returns:

obj – Loaded class object.

Return type:

class object

save(filename, compress=True)[source]

Pickle the class instance to the provided file.

Parameters:
  • filename (str) – File to which object will be saved.

  • compress (bool, optional) – If True, the file will be compressed with gzip. Otherwise, the uncompressed version will be saved. Default = True.

set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Return type:

self