-
Mining the Web for Lexical Knowledge to Improve Keyphrase Extraction: Learning from Labeled and Unlabeled Data
Authors: Peter D. Turney (National Research Council of Canada)
Comments: 36 pages, issued 2002
Report-no: NRC-44947
Subj-class: Learning; Information Retrieval
ACM-class: H.3.1; H.3.3; I.2.6; I.2.7
Keyphrases are useful for a variety of purposes, including summarizing, indexing, labeling, categorizing, clustering, highlighting, browsing, and searching. The task of automatic keyphrase extraction is to select keyphrases from within the text of a given document. Automatic keyphrase extraction makes it feasible to generate keyphrases for the huge number of documents that do not have manually assigned keyphrases. Good performance on this task has been obtained by approaching it as a supervised learning problem. An input document is treated as a set of candidate phrases that must be classified as either keyphrases or non-keyphrases. To classify a candidate phrase as a keyphrase, the most important features (attributes) appear to be the frequency and location of the candidate phrase in the document. Recent work has demonstrated that it is also useful to know the frequency of the candidate phrase as a manually assigned keyphrase for other documents in the same domain as the given document (e.g., the domain of computer science). Unfortunately, this keyphrase-frequency feature is domain-specific (the learning process must be repeated for each new domain) and training-intensive (good performance requires a relatively large number of training documents in the given domain, with manually assigned keyphrases). The aim of the work described here is to remove these limitations. In this paper, I introduce new features that are derived by mining lexical knowledge from a very large collection of unlabeled data, consisting of approximately 350 million Web pages without manually assigned keyphrases. I present experiments that show that the new features result in improved keyphrase extraction, although they are neither domain-specific nor training-intensive.
Full-text: PDF only
-
Using Hierarchical Data Mining to Characterize Performance of Wireless System Configurations
Authors: Alex Verstak, Naren Ramakrishnan, Kyung Kyoon Bae, William H. Tranter, Layne T. Watson, Jian He, Clifford A. Shaffer, Theodore S. Rappaport
Subj-class: Computational Engineering, Finance, and Science
ACM-class: I.6.4
This paper presents a statistical framework for assessing wireless systems performance using hierarchical data mining techniques. We consider WCDMA (wideband code division multiple access) systems with two-branch STTD (space time transmit diversity) and 1/2 rate convolutional coding (forward error correction codes). Monte Carlo simulation estimates the bit error probability (BEP) of the system across a wide range of signal-to-noise ratios (SNRs). A performance database of simulation runs is collected over a targeted space of system configurations. This database is then mined to obtain regions of the configuration space that exhibit acceptable average performance. The shape of the mined regions illustrates the joint influence of configuration parameters on system performance. The role of data mining in this application is to provide explainable and statistically valid design conclusions. The research issue is to define statistically meaningful aggregation of data in a manner that permits efficient and effective data mining algorithms. We achieve a good compromise between these goals and help establish the applicability of data mining for characterizing wireless systems performance.
Full-text: PostScript, PDF, or Other formats
-
Symbolic Methodology in Numeric Data Mining: Relational Techniques for Financial Applications
Authors: B. Kovalerchuk, E. Vityaev, H. Yusupov
Comments: 20 pages, 1 figure, 16 tables
Subj-class: Computational Engineering, Finance, and Science
ACM-class: I.2.6
Currently statistical and artificial neural network methods dominate in financial data mining. Alternative relational (symbolic) data mining methods have shown their effectiveness in robotics, drug design and other applications. Traditionally symbolic methods prevail in the areas with significant non-numeric (symbolic) knowledge, such as relative location in robot navigation. At first glance, stock market forecast looks as a pure numeric area irrelevant to symbolic methods. One of our major goals is to show that financial time series can benefit significantly from relational data mining based on symbolic methods. The paper overviews relational data mining methodology and develops this techniques for financial data mining.
Full-text: PDF only
-
Petabyte Scale Data Mining: Dream or Reality?
Authors: Alexander S. Szalay, Jim Gray, Jan vandenBerg
Comments: originals at this http URL
Report-no: MSR-TR-2002-84
Subj-class: Databases; Computational Engineering, Finance, and Science
ACM-class: H.2.8;J.2
Journal-ref: SIPE Astronomy Telescopes and Instruments, 22-28 August 2002, Waikoloa, Hawaii
Science is becoming very data intensive1. Today's astronomy datasets with tens of millions of galaxies already present substantial challenges for data mining. In less than 10 years the catalogs are expected to grow to billions of objects, and image archives will reach Petabytes. Imagine having a 100GB database in 1996, when disk scanning speeds were 30MB/s, and database tools were immature. Such a task today is trivial, almost manageable with a laptop. We think that the issue of a PB database will be very similar in six years. In this paper we scale our current experiments in data archiving and analysis on the Sloan Digital Sky Survey2,3 data six years into the future. We analyze these projections and look at the requirements of performing data mining on such data sets. We conclude that the task scales rather well: we could do the job today, although it would be expensive. There do not seem to be any show-stoppers that would prevent us from storing and using a Petabyte dataset six years from today.
Full-text: PDF only
-
Sampling Strategies for Mining in Data-Scarce Domains
Authors: Naren Ramakrishnan, Chris Bailey-Kellogg
Subj-class: Computational Engineering, Finance, and Science; Artificial Intelligence
ACM-class: D.2.6; G.1.2; G.1.3; G.3; I.2.10; I.5; H.2.8
Data mining has traditionally focused on the task of drawing inferences from large datasets. However, many scientific and engineering domains, such as fluid dynamics and aircraft design, are characterized by scarce data, due to the expense and complexity of associated experiments and simulations. In such data-scarce domains, it is advantageous to focus the data collection effort on only those regions deemed most important to support a particular data mining objective. This paper describes a mechanism that interleaves bottom-up data mining, to uncover multi-level structures in spatial data, with top-down sampling, to clarify difficult decisions in the mining process. The mechanism exploits relevant physical properties, such as continuity, correspondence, and locality, in a unified framework. This leads to effective mining and sampling decisions that are explainable in terms of domain knowledge and data characteristics. This approach is demonstrated in two diverse applications -- mining pockets in spatial data, and qualitative determination of Jordan forms of matrices.
Full-text: PostScript, PDF, or Other formats
-
Data Mining the SDSS SkyServer Database
Authors: Jim Gray, Alex S. Szalay, Ani R. Thakar, Peter Z. Kunszt, Christopher Stoughton, Don Slutz, Jan vandenBerg
Comments: 40 pages, Original source is at this http URL
Report-no: Microsoft Tech Report MSR TR 02 01
Subj-class: Databases; Digital Libraries
ACM-class: H.2.8;H.3.3; H.3.5;h.3.7;H.4.2
An earlier paper (Szalay et. al. "Designing and Mining MultiTerabyte Astronomy Archives: The Sloan Digital Sky Survey," ACM SIGMOD 2000) described the Sloan Digital Sky Survey's (SDSS) data management needs by defining twenty database queries and twelve data visualization tasks that a good data management system should support. We built a database and interfaces to support both the query load and also a website for ad-hoc access. This paper reports on the database design, describes the data loading pipeline, and reports on the query implementation and performance. The queries typically translated to a single SQL statement. Most queries run in less than 20 seconds, allowing scientists to interactively explore the database. This paper is an in-depth tour of those queries. Readers should first have studied the companion overview paper Szalay et. al. "The SDSS SkyServer, Public Access to the Sloan Digital Sky Server Data" ACM SIGMOND 2002.
Full-text: PDF only
-
A Data Mining Framework for Optimal Product Selection in Retail Supermarket Data: The Generalized PROFSET Model
Authors: Tom Brijs, Bart Goethals, Gilbert Swinnen, Koen Vanhoof, Geert Wets
Subj-class: Databases; Artificial Intelligence
ACM-class: H.2.8
In recent years, data mining researchers have developed efficient association rule algorithms for retail market basket analysis. Still, retailers often complain about how to adopt association rules to optimize concrete retail marketing-mix decisions. It is in this context that, in a previous paper, the authors have introduced a product selection model called PROFSET. This model selects the most interesting products from a product assortment based on their cross-selling potential given some retailer defined constraints. However this model suffered from an important deficiency: it could not deal effectively with supermarket data, and no provisions were taken to include retail category management principles. Therefore, in this paper, the authors present an important generalization of the existing model in order to make it suitable for supermarket data as well, and to enable retailers to add category restrictions to the model. Experiments on real world data obtained from a Belgian supermarket chain produce very promising results and demonstrate the effectiveness of the generalized PROFSET model.
Full-text: PostScript, PDF, or Other formats
-
Data Mining in Astronomical Databases
Authors: Kirk D. Borne (1 and 2) ((1) Raytheon Information Technology and Scientific Services, (2) NASA Goddard Space Flight Center)
Comments: 3 pages. Uses eso.sty style file. Paper to appear in the proceedings of the August 2000 "Mining the Sky" conference at MPA/ESO/MPE, Garching, Germany. (For figures and demos related to sample user scenarios.) (Revised version v2 only corrected these comments.)
Subj-class: Astrophysics; Databases; Information Retrieval; Digital Libraries
A Virtual Observatory (VO) will enable transparent and efficient access, search, retrieval, and visualization of data across multiple data repositories, which are generally heterogeneous and distributed. Aspects of data mining that apply to a variety of science user scenarios with a VO are reviewed.
Full-text: PostScript, PDF, or Other formats
-
Applications of Data Mining to Electronic Commerce
Authors: Ron Kohavi, Foster Provost
Comments: Editorial for special issue
Subj-class: Learning; Databases
ACM-class: I.2.6;H.2.8
Electronic commerce is emerging as the killer domain for data mining technology.
The following are five desiderata for success. Seldom are they they all present in one data mining application.
- Data with rich descriptions. For example, wide customer records with many potentially useful fields allow data mining algorithms to search beyond obvious correlations.
- A large volume of data. The large model spaces corresponding to rich data demand many training instances to build reliable models.
- Controlled and reliable data collection. Manual data entry and integration from legacy systems both are notoriously problematic; fully automated collection is considerably better.
- The ability to evaluate results. Substantial, demonstrable return on investment can be very convincing.
- Ease of integration with existing processes. Even if pilot studies show potential benefit, deploying automated solutions to previously manual processes is rife with pitfalls. Building a system to take advantage of the mined knowledge can be a substantial undertaking. Furthermore, one often must deal with social and political issues involved in the automation of a previously manual business process.
Full-text: PostScript, PDF, or Other formats
-
Science User Scenarios for a Virtual Observatory Design Reference Mission: Science Requirements for Data Mining
Authors: Kirk D. Borne
Comments: 4 pages. Paper to appear in the proceedings of the June 2000 "Virtual Observatories of the Future" conference at Caltech, edited by R. J. Brunner, S. G. Djorgovski, & A. Szalay. (For figures and demos related to sample user scenarios.)
Subj-class: Astrophysics; Information Retrieval; Databases; Digital Libraries
The knowledge discovery potential of the new large astronomical databases is vast. When these are used in conjunction with the rich legacy data archives, the opportunities for scientific discovery multiply rapidly. A Virtual Observatory (VO) framework will enable transparent and efficient access, search, retrieval, and visualization of data across multiple data repositories, which are generally heterogeneous and distributed. Aspects of data mining that apply to a variety of science user scenarios with a VO are reviewed. The development of a VO should address the data mining needs of various astronomical research constituencies. By way of example, two user scenarios are presented which invoke applications and linkages of data across the catalog and image domains in order to address specific astrophysics research problems. These illustrate a subset of the desired capabilities and power of the VO, and as such they represent potential components of a VO Design Reference Mission.
Full-text: PostScript, PDF, or Other formats
-
Data Mining to Measure and Improve the Success of Web Sites
Authors: Myra Spiliopoulou, Carsten Pohle (Institue of Information Systems, Humboldt University Berlin)
Comments: 24 pages, 4 postscript figures and 4 figures containing only text. To be published in the Journal of Data Mining and Knowledge Discovery (Kluwer Academic Publishers), Special Issue on E-Commerce; subject to some revision
Subj-class: Learning; Databases
ACM-class: I.2.6; H.2.8
For many companies, competitiveness in e-commerce requires a successful presence on the web. Web sites are used to establish the company's image, to promote and sell goods and to provide customer support. The success of a web site affects and reflects directly the success of the company in the electronic market. In this study, we propose a methodology to improve the ``success'' of web sites, based on the exploitation of navigation pattern discovery. In particular, we present a theory, in which success is modelled on the basis of the navigation behavior of the site's users. We then exploit WUM, a navigation pattern discovery miner, to study how the success of a site is reflected in the users' behavior. With WUM we measure the success of a site's components and obtain concrete indications of how the site should be improved. We report on our first experiments with an online catalog, the success of which we have studied. Our mining analysis has shown very promising results, on the basis of which the site is currently undergoing concrete improvements.
Full-text: PostScript, PDF, or Other formats
-
Integrating E-Commerce and Data Mining: Architecture and Challenges
Authors: Suhail Ansari, Ron Kohavi, Llew Mason, Zijian Zheng
Comments: KDD workshop: WebKDD 2000
Subj-class: Learning; Artificial Intelligence; Computer Vision and Pattern Recognition; Databases
ACM-class: I.2.6;H.2.8
Journal-ref: WEBKDD'2000 workshop: Web Mining for E-Commerce -- Challenges and Opportunities
We show that the e-commerce domain can provide all the right ingredients for successful data mining and claim that it is a killer domain for data mining. We describe an integrated architecture, based on our -experience at Blue Martini Software, for supporting this integration. The architecture can dramatically reduce the pre-processing, cleaning, and data understanding effort often documented to take 80% of the time in knowledge discovery projects. We emphasize the need for data collection at the application server layer (not the web server) in order to support logging of data and metadata that is essential to the discovery process. We describe the data transformation bridges required from the transaction processing systems and customer event streams (e.g., clickstreams) to the data warehouse. We detail the mining workbench, which needs to provide multiple views of the data through reporting, data mining algorithms, visualization, and OLAP. We con-clude with a set of challenges.
Full-text: PostScript, PDF
-
A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data. To Appear in Data Mining for Security Applications.
Authors: Eleazar Eskin, Andrew Arnold, Michael Prerau, Leonid Portnoy and Salvatore Stolfo.
Journal Ref: Kluwer 2002.
Most current intrusion detection systems employ signature-based methods od data mining-based methods which rely on labeled training data. This training data is typically expensive to produce. We present a new geometric framework for unsupervised anomaly detection, which are algorithms that are designed to process unlabeled data.In our framework, data elements are mapped to a feature space which is typically a vector space Rd. Anomalies are detected by determining which point lies in sparse regions of the feature space. Our first map is data-dependent normalization feature map which we apply to network connections. Our second feature map is a spectrum kernel which we apply to system call traces. We present three algorithms for detecting which points lie in sparse regions of the feature space. We evaluate our methods by performing experiments over network records from the KDD CUP 1999 data set and system call traces from the 1999 Lincoln Labs DARPA evaluation.
full paper: PDF
-
Data Mining Approaches for Intrusion Detection.
Authors: Wenke Lee and Sal Stolfo.
In Proceedings of the Seventh USENIX Security Symposium (SECURITY '98), San Antonio, TX, January 1998
Computer Science Department Columbia University 500 West 120th Street,
New York, NY 10027
In this paper we discuss our research in developing general and systematic methods for intrusion detection. The key ideas are to use data mining techniques to discover consistent and useful patterns of system features that describe program and user behavior, and use the set of rel-relevant system features to compute (inductively learned) classifiers that can recognize anomalies and known in-intrusions. Using experiments on the sendmail system call data and the network tcpdump data, we demonstrate that we can construct concise and accurate classifiers to detect anomalies. We provide an overview on two general data mining algorithms that we have implemented: the association rules algorithm and the frequent episodes -algorithm. These algorithms can be used to compute the intra- and inter- audit record patterns, which are essential in describing program or user behavior. The discovered patterns can guide the audit data gathering process and facilitate feature selection. To meet the challenges of both efficient learning (mining) and real-time detection,we propose an agent-based architecture for intrusion detection systems where the learning agents continuously compute and provide the updated (detection) models to the detection agents.
Full paper PS
-
Mining Audit Data to Build Intrusion Detection Models
Authors: Wenke Lee, Sal Stolfo, and Kui Mok.
In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD '98), New York, NY, August 1998
Columbia University500 West 120th Street, New York, NY 10027 {wenke,sa
l,mok}@cs.columbia.edu
In this paper we discuss a data mining framework for constructing intrusion detection models. The key ideas are to mine system audit data for consistent and useful patterns of program and user behavior, and use the set of relevant system features presented in the patterns to compute (inductively learned) classifiers that can recognize anomalies and known intrusions. Our past experiments showed that classifiers can be used to detect intrusions, provided that sufficient audit data is available for training and the right set of system features are selected. We propose to use the association rules and frequent episodes computed from audit data as the basis for guiding the audit data gathering and feature selection processes. We modify these two basic algorithms to use axis attribute(s) as a form of item constraints to compute only the relevant ("useful") patterns, and an iterative level-wise approximate mining procedure to uncover the low frequency (but important) patterns. We report our experiments in using these algorithms on real-world audit data.
full paper PS
-
Towards Automatic Intrusion Detection using NFR
Authors: Wenke Lee, Chris Park, and Sal Stolfo.
In Proceedings of the 1st USENIX Workshop on Intrusion Detection and Network Monitoring, April 1999
Computer Science Department, Columbia University 500 West 120th Street, New York, NY 10027
There is often the need to update an installed Intrusion Detection System (IDS) due to new attack methods or upgraded computing environments. Since many current IDSs are constructed by manual encoding of expert security knowledge, changes to IDSs are expensive and require many hours of programming and debugging. We describe a data mining framework for adaptively building Intrusion Detection (ID) models specifically for the use of in Network Flight Recorder (NFR) [10]. The central idea is to utilize auditing programs to extract an extensive set of features that describe each network connection or host session, and apply data mining programs to learn rules that accurately capture the behavior of intrusions and normal activities. These rules can then be used for misuse detection and anomaly detection. Detection models are then incorporated into NFR through a machine translator, which produces a working detection model in the form of N-Code, NFR's powerful filtering language.
Full paper PS
-
Data Mining Methods for Detection of New Malicious Executables
Authors: Matthew G. Schultz, Eleazar Eskin, Erez Zadok, and Salvatore J. Stolfo.
To Appear in Proceedings of IEEE Symposium on Security and Privacy. Oakland, CA: May 2001.
A serious security threat today is malicious executables, especially new, unseen malicious executables. Many of these new malicious executables are undetectable by current antivirus systems because they do not contain signatures for these new instances of malicious programs. These new malicious executables are created every day, and thus pose a serious security threat. We present a framework that detects new, previously unseen malicious executables. Comparing our detection methods with a traditional signature-based method, our method more than doubles the current detection rates for new malicious executables.
Full paper: PS, PDF
-
Cleaning Financial Data
From: Numerical Algorithms Group
Increasingly, sophisticated methods are available for analyzing financial data and helping decision makers. In practice, the data that will be used can be full of errors. It is often the more sophisticated methods that seem to be particularly sensitive to the presence of bad values in the data. Therefore, it makes sense to deal with the bad data before the modeling takes place - improve the quality of the data and you are very likely to improve the quality of the results.
Full text PDF
-
Examples of the Use of Data Mining in Financial Applications
From: Numerical Algorithms Group
This article considers building mathematical models with financial data by using data mining techniques. In general, data mining methods such as neural networks and decision trees can be a useful addition to the techniques available to the financial analyst. However, the data mining techniques tend to require more historical data than the standard models and, in the case of neural networks, can be difficult to interpret.
Full text PDF
-
Numeric Components and Data Mining: Part I
From: Numerical Algorithms Group
This is first of a three-part series that discusses how numeric components underlie the success of emerging data mining and automated knowledge discovery tools. NAG components have long been used in military and other mission critical applications. In this interview with Tony Nilles, vice president of Sales and Marketing for the U.S. offices of NAG in Downers Grove, Illinois, the emerging trend of data mining and similar software specialists using numeric components is discussed.
Full text PDF
-
Numeric Components and Data Mining: Part II
From: Numerical Algorithms Group
The Informix NAG Datablade provides a new and exceptional way to handle time-series data. It includes 50 of the Numerical Algorithm Group's mathematical functions that assist the accurate analysis of high volumes of data. In this interview, the executive director of database business development for Informix explains why numeric components are critical to the success of this new species of data mining tools launched by Informix.
Full text PDF
-
Numeric Components and Data Mining: Part III
From: Numerical Algorithms Group
Seeking to put themselves in the "shoes" of its customers for enterprise performance management, PeopleSoft Inc product planners asked themselves how they could best bring new analytic tools to their users. PeopleSoft concluded that seamlessly weaving the world's most tested and proven statistical algorithms into their applications would provide compelling advantages to their company. PeopleSoft's vice president of product management explains how numeric components are used in their highly acclaimed tool for automated knowledge discovery.
Full text PDF