A word ranking method considering the text space structure for a single document

WEI Wei, MENG Xiangzhu, GUO Chonghui

Systems Engineering - Theory & Practice ›› 2020, Vol. 40 ›› Issue (5) : 1293-1303.

PDF(890 KB)
PDF(890 KB)
Systems Engineering - Theory & Practice ›› 2020, Vol. 40 ›› Issue (5) : 1293-1303. DOI: 10.12011/1000-6788-2018-2222-11

A word ranking method considering the text space structure for a single document

  • WEI Wei1,2, MENG Xiangzhu3, GUO Chonghui2
Author information +
History +

Abstract

Feature selection is an important basic work in the field of text mining, which can provide reliable data processing methods and technical support for the implementation of subsequent text mining tasks smoothly. At the same time, feature word ranking is the key part of feature selection. In this research, we propose a word ranking method based on manifold ranking in combination with the textual statistics and structural information. Combining with the idea of manifold ranking, we construct the text's conditional co-occurrence degree word network, which can reflect the semantic and structural information of text, and the network is treated as the potential manifold structure. Taking the term frequency as the original ranking result, and then the words' weights and ranking are reevaluated and optimized by using the similarity learning of words with the graph learning theory and manifold ranking theory. Numerical experiments are compared with other word ranking methods on both public datasets and supplementary corpus, which all verify the effectiveness of the proposed method. In addition, this method broadens the application of graph learning theory in the field of text mining, and it also provides a new method and strategy for word ranking in single document.

Key words

feature selection / word ranking / term frequency / manifold ranking / graph learning / conditional co-occurrence degree

Cite this article

Download Citations
WEI Wei , MENG Xiangzhu , GUO Chonghui. A word ranking method considering the text space structure for a single document. Systems Engineering - Theory & Practice, 2020, 40(5): 1293-1303 https://doi.org/10.12011/1000-6788-2018-2222-11

References

[1] Yousefpour A, Ibrahim R, Hamed H N A. Ordinal-based and frequency-based integration of feature selection methods for sentiment analysis[J]. Expert Systems with Applications, 2017, 75: 80-93.
[2] Luhn H P. The automatic creation of literature abstracts[J]. IBM Journal of Research and Development, 1958, 2(2): 159-165.
[3] Ortuno M, Carpena P, Bernaola-Galván P, et al. Keyword detection in natural languages and DNA[J]. Europhysics Letters, 2002, 57(5): 759-764.
[4] Carpena P, Bernaola-Galván P, Hackenberg M, et al. Level statistics of words: Finding keywords in literary texts and symbolic sequences[J]. Physical Review E, 2009, 79(3): 035102.
[5] Carretero-Campos C, Bernaola-Galván P, Coronado A V, et al. Improving statistical keyword detection in short texts: Entropic and clustering approaches[J]. Physica A: Statistical Mechanics and Its Applications, 2013, 392(6): 1481-1492.
[6] Mihalcea R. Random walks on text structures[C]//International Conference on Intelligent Text Processing and Computational Linguistics, 2006: 249-262.
[7] 许伟, 刘令宇, 王明明. 基于跨媒体分析的突发事件检测及趋势研判研究[J]. 系统工程理论与实践, 2015, 35(10): 2550-2556.Xu W, Liu L Y, Wang M M. Emergency event detection and analysis for emergence management based on cross-media analytics[J]. Systems Engineering—Theory & Practice, 2015, 35(10): 2550-2556.
[8] Zhou H, Slater G W. A metric to search for relevant words[J]. Physica A: Statistical Mechanics and Its Applications, 2003, 329(1-2): 309-327.
[9] Herrera J P, Pury P A. Statistical keyword detection in literary corpora[J]. European Physical Journal B, 2008, 63(1): 135-146.
[10] Mehri A, Darooneh A H. The role of entropy in word ranking[J]. Physica A: Statistical Mechanics and Its Applications, 2011, 390(18-19): 3157-3163.
[11] Mehri A, Darooneh A H. Keyword extraction by nonextensivity measure[J]. Physical Review E, 2011, 83(5): 056106.
[12] Yang Z, Lei J, Fan K, et al. Keyword extraction by entropy difference between the intrinsic and extrinsic mode[J]. Physica A: Statistical Mechanics & Its Applications, 2013, 392(19): 4523-4531.
[13] Mehri A, Jamaati M, Mehri H. Word ranking in a single document by Jensen-Shannon divergence[J]. Physics Letters A, 2015, 379(28-29): 1627-1632.
[14] Zhou D, Weston J, Gretton A, et al. Ranking on data manifolds[C]//Advances in Neural Information Processing Systems, 2004: 169-176.
[15] He J, Li M, Zhang H J, et al. Manifold-ranking based image retrieval[C]//Proceedings of the 12th Annual ACM International Conference on Multimedia, 2004: 9-16.
[16] 王萌, 孙树栋, 杨宏安,等. 基于流形学习降维的决策分析算法[J]. 系统工程理论与实践, 2014, 34(9): 2432-2437.Wang M, Sun S D, Yang H A, et al. Decision analysis algorithm based on manifold learning dimension reduction[J]. Systems Engineering—Theory & Practice, 2014, 34(9): 2432-2437.
[17] 陈静, 徐波, 王甜甜, 等. 基于hLDA的图书内部主题层次组织研究[J]. 图书情报工作, 2016, 60(18): 140-148. Chen J, Xu B, Wang T T, et al. A research on internal hierarchical topic organization model of the book based on hLDA[J]. Library and Information Service, 2016, 60(18): 140-148.
[18] Nakao Y. Thematic hierarchy detection of a text using lexical cohesion[J]. Journal of the Association for Natural Language Processing, 1999, 6(6): 83-112.
[19] 杨震, 范科峰, 雷建军, 等. 基于语义的文本流形研究[J]. 电子学报, 2009, 37(3): 557-561. Yang Z, Fan K F, Lei J J, et al. Text manifold based on semantic analysis[J]. Acta Electronica Sinica, 2009, 37(3): 557-561.
[20] Roweis S T, Saul L K. Nonlinear dimensionality reduction by locally linear embedding[J]. Science, 2000, 290(5500): 2323-2326.
[21] Wei W, Guo C H, Chen J F, et al. CCODM: Conditional co-occurrence degree matrix document representation method[J]. Soft Computing, 2017(5): 1-17.
[22] Xu Y, Chen L. Term-frequency based feature selection methods for text categorization[C]//The 4th International Conference on Genetic and Evolutionary Computing, 2010: 280-283.
[23] Michel J B, Yuan K S, Aiden A P, et al. Quantitative analysis of culture using millions of digitized books[J]. Science, 2011, 331(6014): 176-182.
[24] 郭崇慧, 魏伟, 任晓玲. 文化组学研究综述[J]. 情报学报, 2014, 33(7): 765-774.Guo C H, Wei W, Ren X L. A review on culturomics[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(7): 765-774.
[25] Chen H, Chen X, Liu H. How does language change as a lexical network? An investigation based on written Chinese word co-occurrence networks[J]. PloS One, 2018, 13(2): 1-22.
[26] Darwin C, Beer G. The origin of species[M]. Dent, 1951.
[27] Andrade M A, Valencia A. Automatic extraction of keywords from scientific text: Application to the knowledge domain of protein families[J]. Bioinformatics, 1998, 14(7): 600-607.
[28] Zhang C. Automatic keyword extraction from documents using conditional random fields[J]. Journal of Computational Information Systems, 2008, 4(3): 1169-1180.
[29] Lee L H, Isa D, Choo W O, et al. High relevance keyword extraction facility for Bayesian text classification on different domains of varying characteristic[J]. Expert Systems with Applications, 2012, 39(1): 1147-1155.
[30] Zhou D, Weston J, Gretton A, et al. Ranking on data manifolds[C]//Advances in Neural Information Processing Systems, 2004: 169-176.
[31] Wang Q, Lin J, Yuan Y. Salient band selection for hyperspectral image classification via manifold ranking[J]. IEEE Transactions on Neural Networks and Learning Systems, 2016, 27(6): 1279-1289.
[32] Mihalcea R. Random walks on text structures[C]//International Conference on Computational Linguistics and Intelligent Text Processing. Springer-Verlag, 2006: 249-262.
[33] Manning C D, Schütze H. Foundations of statistical natural language processing[M]. MIT press, 1999.
[34] Chen F, Lu C, Wu H, et al. A semantic similarity measure integrating multiple conceptual relationships for web service discovery[J]. Expert Systems with Applications, 2017, 67: 19-31.

Funding

National Natural Science Foundation of China (71771034); Jieyang Science and Technology Planning Projects (2017xm041)
PDF(890 KB)

435

Accesses

0

Citation

Detail

Sections
Recommended

/