Microblog topic mining based on a combined TF-IDF and LDA topic model

doi:10.1201/9780429468605-40

ABSTRACT

Because keywords in a microblog topic are affected by time of release, influence and ways to emphasize special words (such as “#” and “【】”), a standalone Latent Dirichlet Allocation (LDA) topic model cannot accurately cluster microblog keywords into microblog topics. This paper proposes a mining method based on a combined Term Frequency-Inverse Document Frequency (TF-IDF) and LDA topic model, which can accurately mine microblog keywords and cluster them into microblog topics. This method can effectively overcome the data sparsity problem caused by the length restriction on the microblog. First, we collect microblogs containing keywords (such as “Internet”) from the Internet, and perform a pretreatment on them. Then microblog weightings such as release time, influence and keywords are extracted using the TF-IDF algorithm. Finally, the set containing these weighted keywords is used as an input document to train an LDA topic model and achieve the topic mining of the microblog. After recall and precision rate evaluation, this method demonstrates better alignment with the topic of the microblog.