nimare.annotate.lda.LDAModel
- class LDAModel(n_topics, max_iter=1000, alpha=None, beta=0.001, text_column='abstract')[source]
Bases:
nimare.base.NiMAREBaseGenerate a latent Dirichlet allocation (LDA) topic model.
This class is a light wrapper around scikit-learn tools for tokenization and LDA.
- Parameters
n_topics (
int) – Number of topics for topic model. This corresponds to the model’sn_componentsparameter. Must be an integer >= 1.max_iter (
int, optional) – Maximum number of iterations to use during model fitting. Default = 1000.alpha (
floator None, optional) – Thealphavalue for the model. This corresponds to the model’sdoc_topic_priorparameter. Default is None, which evaluates to1 / n_topics, as was used in 2.beta (
floator None, optional) – Thebetavalue for the model. This corresponds to the model’stopic_word_priorparameter. If None, it evaluates to1 / n_topics. Default is 0.001, which was used in 2.text_column (
str, optional) – The source of text to use for the model. This should correspond to an existing column in thetextsattribute. Default is “abstract”.
- Variables
model (
LatentDirichletAllocation) –
Notes
Latent Dirichlet allocation was first developed in 1, and was first applied to neuroimaging articles in 2.
References
- 1
Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation.” Journal of machine Learning research 3.Jan (2003): 993-1022.
- 2(1,2,3)
Poldrack, Russell A., et al. “Discovering relations between mind, brain, and mental disorders using topic mapping.” PLoS computational biology 8.10 (2012): e1002707. https://doi.org/10.1371/journal.pcbi.1002707
See also
CountVectorizerUsed to build a vocabulary of terms and their associated counts from texts in the
self.text_columnof the Dataset’stextsattribute.LatentDirichletAllocationUsed to train the LDA model.
- fit(dset)[source]
Fit the LDA topic model to text from a Dataset.
- Parameters
dset (
Dataset) – A Dataset with, at minimum, text available in theself.text_columncolumn of itstextsattribute.- Returns
dset (
Dataset) – A new Dataset with an updatedannotationsattribute.- Variables
distributions_ (
dict) –A dictionary containing additional distributions produced by the model, including:
p_topic_g_word:numpy.ndarrayof shape (n_topics, n_tokens) containing the topic-term weights for the model.p_topic_g_word_df:pandas.DataFrameof shape (n_topics, n_tokens) containing the topic-term weights for the model.