Traditional text classifiers that rely on supervised learning methods always require a large number of labeled documents. Labeling the documents often requires a certain amount of expertise to ensure the accuracy, which is time-consuming and costly. Therefore, a dataless text classifcation method around a small number of easily accessible label descriptions, ie, seed words,rather than surrounding the labeled documents to provide the supervision information for the classification task, shows a good development prospect. However, since the size of the seed word set is much smaller than the word set contained in the document , many documents do not contain any seed words or even contain some irrelevant seed words, which limits the effect of the seed word supervision. The manifold assumption suggests that highly similar texts tend to belong to the same category, so we maintain a local neighborhood structure for each document and construct a manifold regularizer to spread limited the supervised information between similar documents. We propose a Laplacian Nonnegative Matrix Factorization (LapNMF) method,adding the seed word prior information and document manifold into the framework of non-negative matrix factorization. And use the block corrdinate desent method to solve the problem. Experiments show that in most cases, our LapNMF performs better than the current weakly supervised classification methods, showing certain competitiveness.
In drug research and development, in order to save time and cost, the method of establishing compound activity prediction model is usually used to screen potential active compounds, In order to become a candidate drug, a compound not only needs to have good biological activity, but also needs to have good pharmacokinetic properties and safety in human body, which is collectively known as ADMET. This paper adopts data mining technology, Firstly, the use of random forest to find the main variables of modeling is studied, and its independence is verified by high correlation filtering. The 20 main operating variables selected are MDEC-23, maxHsOH etc; Secondly, a five layer BP neural network is used to establish a compound bioactivity prediction model, which can predict the IC50 value and the corresponding pIC50 value of the compound; Then the improved BP neural network model is used to establish the classification prediction model of compounds Caco-2, CYP3A4, ERG, hob and Mn. The algorithm verifies that the accuracy of CYP3A4 is 94.3%, and the accuracy of the five models is more than or close to 90%, which is more practical than the prediction value of the improved BP neural network; Finally, the main variables of genetic algorithm are used to make the compound pair inhibit er α The value range of biological activity is optimized, which has certain practical significance.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.