In recent years, with the rapid growth of social media, short texts have been very prevalent on the internet.Due to the limited length of each short text, word co-occurrence information in this type of documents is sparse.Conventional topic models based on word co-occurrence are unable to distill coherent topics on short texts.A state-of-the-art strategy is self-aggregated topic models which implicitly aggregate short texts into latent long documents.But these models have two problems.
One problem is that the number of long documents should be defined explicitly and the inappropriate number leads to poor performance.Another problem is that latent long documents may bring non-semantic word co-occurrence airpods in jacksonville which brings incoherent topics.In this article, we firstly apply the Chinese restaurant process to automatically generate the number of long documents according to the scale of short texts.Then read more to exclude non-semantic word co-occurrence, we propose a novel probabilistic model generating latent long documents in a more semantically way.Specifically, our model employs a pitman-yor process to aggregate short texts into long documents.
This stochastic process can guarantee that the distribution between short texts and long documents following a power-law distribution which can be found in social media like Twitter.Finally, we compared our method with several state-of-the-art methods on four real short texts corpus.The experiment results show that our model performs superior to other methods with the metrics of topic coherence and text classification.