|
1.IntroductionText classification refers to the classification of texts into several categories according to a certain classification basis. With the advent of the information age, a large amount of information is constantly emerging, followed by a large number of texts. Therefore, classifying texts to better process texts and obtain effective information has become one of the current problems. From another perspective, text classification also belongs to one of the application directions of natural language processing. 2.BackgroundText refers to the expression of written language, which is a sentence or a combination of multiple sentences with complete and systematic meaning. A text can be a sentence, a paragraph or a chapter. For the application of computers, the current role of computers for mathematical computing accounted for only 10%, for process control classes less than 5%, the remaining 85% are used in the current language information processing, the main thing is the computer for our human natural language sound, shape, meaning processing, and then simply put, is the general text input, small to a word a word, as large as a sentence a paragraph, just like a publishing house, the text of the coherent publication[1]. With the development of information network, more and more information is obtained, and how to deal with a large amount of information has become a huge problem. Simply relying on manpower to deal with this information is not only time-consuming and labor-intensive, but also may lead to mistakes and low efficiency, so tools are needed to better deal with these problems. Text classification uses computer to automatically classify and mark text sets (or other entities or objects) according to a certain classification system or standard. According to a set of labeled training documents, it finds the relationship model between document features and document categories, and then uses the learned relationship model to judge the categories of new documents. Text classification has gradually changed from knowledge-based methods to methods based on statistics and machine learning[2]. Automatic text classification can effectively help people to search, query, filter and use information. Text classification is an important foundation of data mining. Text classification first originated in foreign countries, especially in 1957. Especially in 1957, H.P.Luhn of IBM in the United States first made a sudden achievement in the field of automatic classification, researched and put forward the idea of word frequency statistics for automatic classification. Besides, In 1960, M.E.Maron published “On Relevance Probabilistic Indexing and Information Retrieval” in Journal of ACM, which was a well-known article of automatic classification. At that time, it was also a sensation, because a new search member-keyword was put forward in this article, and thus a new automatic classification technology came into being, which made automatic classification technology enter a new era. From 1960s to 1980s, text classification still needed the participation of a large number of staff, but there were still some shortcomings[3]. In 1990s, text classification was improved, and automatic text classification appeared with strong adaptability, such as Construe system developed by Carnegie for Reuters[4]. In the research of text classification in China, due to the difference in language, we can’t fully refer to foreign research results, so we need to have text classification suitable for Chinese. Professor Hou Hanqing made an important contribution to this. Later, in 1987, the experimental classification system of Chinese scientific and technological literature (computer) developed by Zhu Lanjuan and Wang Yongcheng was born. By 1995, China had developed a variety of practicality. High-level system, the field involves oncology professional literature, Chinese corpus automatic classification system and archives automatic classification system[4]. In 2000, more and more outstanding researchers made great improvements to the system, improved the classification accuracy, collected synonyms on the same concept word, and reduced the calculation of the overall classification. These outstanding researchers made great progress in text classification in China. Now consider text classification based on natural language processing to improve the efficiency of text classification. 3.Overview of Natural Language Processing TechnologyWith the development of science and technology, the application of artificial intelligence is more and more extensive, and it can be applied in many directions, such as images, texts, etc. And natural language processing technology is the product of the combination of artificial intelligence and text technology. Language refers to everyday languages such as Chinese and English. It is a tool for communication in learning and life, not a programming language, so computers can’t understand it. However, natural language processing refers to the use of computers to process natural languages, so as to realize the information interaction between man and machine, including intelligent word segmentation, part-of-speech word segmentation, named entity recognition, text classification, and reactionary advertisement auditing. The basic tasks of natural processing technology can be divided into three categories, including part-of-speech analysis, syntactic analysis and text analysis. Part-of-speech analysis is the basic work of natural language processing technology, including word segmentation, part-of-speech tagging and named entity recognition. Syntactic analysis includes syntactic dependency analysis, semantic dependency analysis and text error correction. Text analysis includes keyword extraction, sentiment analysis and text classification. Among them, text classification is an important direction. The general process of natural language processing technology is roughly divided into corpus acquisition, corpus preprocessing, feature engineering, feature selection, model training, evaluation index, and online application of models. 4.Overview of text classification4.1ConceptsThe so-called text classification means that the computer automatically classifies and marks the text sets according to a certain classification system. According to a classified training sample, he finds out the relationship between the text features and the new text sets, establishes a model, and classifies the new text sets. With the advent of the information age, information is exploding. It is inefficient, time-consuming and labor-intensive to analyze and process information by hand, and the subjectivity of people is strong, which may have a certain influence on it. So, we consider using computers to deal with these problems, but computers do not understand the common texts between human beings, so we need natural language processing technology to build a bridge between human beings and computers. Text classification is widely used, such as Sentiment Analyse, Topic Labeling, Question Answering, intention classification, Natural Language Inference, etc. Therefore, text classification is very important and has research significance. 4.2Text classification infrastructureText classification includes two basic structures: feature representation and classification model. The purpose of feature representation is to transform text into a form that can be understood and processed by computers, so that computers can work instead of people and efficiently realize information processing. Common text feature representation methods of:
The classification model includes shallow learning model and deep learning model. Shallow learning models include PGM model (probability diagram), KNN model (K nearest neighbor), SVM model (supporting cross product), DT model (decision tree) and RF model (random tree); Deep learning models include ReNN-Based model (recurrent neural network), MLP-based model (multilayer perceptron), RNN-based model (recurrent neural network), CNN-based model (convolutional neural network), Attention-based model (attention mechanism), Transformer-based model (pre-training model) and GNN-based model. 4.3Data set of text classificationAccording to the application of text classification, text classification data sets are divided into:
5.Text classification based on natural language processingThrough investigation, it is found that pure manual text classification can’t completely meet the needs of information processing. Therefore, consider text classification based on natural language processing technology, use natural language processing technology to obtain more accurate features, select and extract features, comprehensively consider the complexity and extraction effect of the algorithm, and select the appropriate feature extraction algorithm to achieve text classification efficiently. In this paper, the convolutional neural network CNN structure in natural language processing technology is used to classify texts[5]. The paper is implemented based on tensoflow and python. Convolutional neural network CNN is a deep learning model, which can also be understood as multilayer perceptron of artificial neural network. Its neurons in each layer are arranged in three dimensions, and three dimensions refer to width, height and depth, where depth refers to the number of layers of the network, while vectors are arranged in the depth direction. Convolution structure has many advantages, which can reduce the amount of memory occupied by deep network. Besides, it can also reduce the number of network parameters, and can be used to over-fit some models. 5.1Data preprocessing
5.2Model implementationThe basic idea is as follows: the first layer maps words to a set of vector representations; The second layer is convolution layer, which uses multiple filters to traverse 3-5 words at a time, and the next layer is a series of long feature vectors. Then, the probability of each class is obtained by using softmax.
5.3Model training
Cycle, one-step training. After the definition of the model is completed, the next step is to cycle the model and batch it, which saves time and labor and has high efficiency. 7.ConclusionThrough the results, it can be seen that the loss of using natural language processing technology for text classification is small, and the accuracy is good. Convolutional neural network CNN can be used for text classification well, and it is concluded that text classification based on natural language processing is feasible and efficient. 8.SummaryIn this paper, the convolutional neural network CNN is used to classify the text. Unlike CV input in computer vision, the input of natural language processing technology is text, which includes articles, sentences, phrases, etc. These texts are represented as vector matrix by word2vec or glove, with one line representing a phrase and the number of lines representing the length of the sentence. Of course, it is necessary to clean up the sentences and remove symbols in advance, and the number of columns is latitude. In computer vision CV, filters slide through the whole image with arbitrary length and width, while in natural language processing, filter will cover everything. Filter convolutes the matrix obtained by input text processing to obtain features, and then extracts the best features. Although natural language processing often uses RNN architecture and then attention, I personally think that convolutional neural network will be simpler and more efficient for text classification. Science and technology are always improving, and I believe there will be more efficient methods for text classification in the future, which is worthy of continuous efforts. References:Yu Shitomb, Introduction to Computational Linguistics[M], The Commercial Press, Beijing Google Scholar
Sebastian F.,
“A tutorial on automated text categorization[J],”
in Proceedings of ArgentineanSymposiums Artificial Intelligence(ASAI-99,1st) Buenos Aires,
7
–35
(1999). Google Scholar
Cheng Ying,Shi Jiao-Lin,
“Research on the automatic classification: present situation andprospects[J].prospects[J],”
Journal of the China Society for Scientific and Technical Information, 1 20
–27
(1999). Google Scholar
Spark J K,Willett P, et al., Readings of information retrieval, Morgan Kaufmann, San Mateo,US
(1997). Google Scholar
“Convolutional neural network for Sentence Classification[1],”
Google Scholar
|