Title: Exploiting the Domain Structure for Gene Recognition from Biomedical Text Abstract: Automatic recognition of gene names and protein names is a critical first step towards information extraction and text mining in biomedical literature. The state-of-the-art biomedical named entity taggers are mostly based on supervised learning methods, and their performance depends largely on the availability and the quality of the training data. Because of the different gene naming conventions among different biological species, the performance of a gene tagger trained on text containing gene names of certain species may degrade significantly when it is used to recognize gene names of new species. We propose an approach to gene recognition (and to named entity recognition in general) that exploits the domain structure in the training data to improve the performance of the tagger on new domains. We show that by combining the feature ranking in each training domain and imposing such generalizability-based new feature ranking as a prior, we can learn patterns that generalize well in all domains. Our results from three sets of experiments show that the proposed method outperforms a baseline method that does not consider the domain structure in the training data. Biography: Jing Jiang is a third-year Ph.D. student in the Department of Computer Science at UIUC. Her research interests are mainly in information extraction and its application in biomedical literature.
|