Data Mining & Bioinformatics Lab
Home Project Personnel Publication Seminar Software  
 

Funded Research Projects:

Project Title: CAREER: A Unified Architecture for Data Mining Large Biomedical Literature Databases
Sponsor: National Science Foundation (NSF), Award No. IIS 0448023
PI: Xiaohua Hu
Amount: $415,000
Duration: March 15, 2005 ¨C Feb 28, 2010
Project Description:
The large number of documents in biomedical literature databases and the lack of formal structure in the natural-language narrative in those documents make the search and processing very difficult to many scientists involved in bioinformatics research. This CAREER project is investigating the efficiency and effectiveness of information retrieval procedures, the effectiveness and robustness of pattern learning methods for information extraction, and information overload in text mining simultaneously in a coherent and unified framework for biomedical literature data mining. The deliverables of this project are: (1) to develop a semantic-based query expansion method for large biomedical literature databases; (2) to design an automatic pattern generation and evaluation method from unlabeled text files based on mutual bootstrapping and dynamic programming; (3) to develop a set of novel text mining algorithms such as ontology-enhanced textual clustering and text summarization. This project is testing its application in real-world bioinformatics domains such as chromatin interaction networks and microarray data analysis. The broad impact on society made by this project is the generation of a novel unified architecture for biomedical literature data mining. This integrated and complementary approach in a unified architecture has the potential to create a very powerful novel tool for bioinformatics and for most text processing tasks. This project has the potential to attract diverse collaborators who have an interest in accessing complex biomedical or general scientific data and information. Students are involved in this research through hands-on projects, a Co-Op program and courses at both the graduate and undergraduate level.
Project Title: High Performance Rough Sets Data Analysis in Data Mining
Sponsor: National Science Foundation (NSF), Award No. CCF 0514679
PI: Xiaohua Hu
Amount: $102,300
Duration: July 15, 2005 - June 30, 2008

Project Description:
Data mining (aka Knowledge Discovery in Databases, KDD) is a procedure to extract previously unknown and potentially useful information or pattern from huge data sets. KDD is usually a multiphase process involving numerous steps such as data preparation, data preprocessing, feature selection, rule induction, knowledge evaluation and deployment etc. Many novel data mining and learning algorithms have been developed, though vigorously, under rather add hoc and vague concepts. These algorithms, in most cases, are individual creations of different researchers, without much common methodological and fundamental framework. In other words, great majority of work in data mining is focused on algorithm development while neglecting the studies of fundamental theoretical issues concerning data, inter-data relationships, and quality of the implicit information hidden in the data or data redundancies. Thus, it is not easy to fully understand and evaluate how individual phase influences each other and the impact of each phase on the whole knowledge discovery process. For further development and breakthroughs in data mining and learning algorithms, a deep examination of its foundation is necessary. The central goal of the proposed research is to develop a unified rough set based data mining framework to explore various fundamental issues of data mining and learning algorithms. It aims to present the analytical capabilities of the methodology of rough sets in the context of data mining methodologies, techniques and applications. It will provide a unified framework to help better understand the whole KDD process. Intellectual merit: Rough set theory is particularly suited to reasoning about imprecise or incomplete data and discovering relationships in the data. The simplicity and mathematical clarity of rough set theory makes it attractive for both theoreticians and application-oriented researchers. The main advantage of rough set theory is that it does not require any preliminary or additional information about the data, such as probability in statistics, basic probability assignment in Dempster-Shafer theory or the value of membership in fuzzy set theory. Rough set theory constitutes a sound basis for KDD and can be used in different phases of the KDD process. In particular, the formal techniques of rough set theory lead to many novel and promising breakthrough methods and algorithms for attribute functional, or partial functional dependencies, their discovery, analysis, and characterization, feature election, feature extraction, data reduction, decision rule generation, and pattern extraction (templates, association rules) etc., which are the fundamental issues of the KDD process. Rough set theory represents a new innovative approach and can lead to the development of new learning algorithms to create novel uses and breakthroughs of data mining techniques.

Project Title: Center for Public Health Readiness and Communication
Sponsor: PA Dept. of Health
Co-PI: Xiaohua Hu
Amount: $1.5M
Duration: 09/2004-08/2007