ALBERT — All Library Books, journals and Electronic Records Telegrafenberg

1

Electronic Resource

The Role of Statistics in IS/IT: Practical Gains from Mined Data (1999)

Dudewicz, Edward J. ; Karian, Zaven A.

Springer

Information systems frontiers 1 (1999), S. 259-266

add to mindlist on the mindlist

Details

ISSN: 1572-9419

Keywords: data mining ; statistics ; patterns in data ; fitting distributions ; lambda ; beta

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract Data mining has, in the past, tended to use simplistic statistical methods (or even none at all). In this paper we show by example how cutting edge (but easy to use and comprehend) statistical methods can yield substantial gains in data mining. The role of statistics in IS/IT (information systems and information technology) in general can be substantial, yielding more nearly optimal performance of problems at the emerging frontiers in all their aspects.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1010002428579

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

2

Electronic Resource

Learning Qualitative Models of Dynamic Systems (1997)

Hau, David T. ; Coiera, Enrico W.

Springer

Machine learning 26 (1997), S. 177-211

add to mindlist on the mindlist

Details

ISSN: 0885-6125

Keywords: inductive logic programming ; qualitative modelling ; system identification ; PAC learning ; physiological modelling ; cardiovascular system ; data mining ; patient monitoring

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract The automated construction of dynamic system models is an important application area for ILP. We describe a method that learns qualitative models from time-varying physiological signals. The goal is to understand the complexity of the learning task when faced with numerical data, what signal processing techniques are required, and how this affects learning. The qualitative representation is based on Kuipers' QSIM. The learning algorithm for model construction is based on Coiera's GENMODEL. We show that QSIM models are efficiently PAC learnable from positive examples only, and that GENMODEL is an ILP algorithm for efficiently constructing a QSIM model. We describe both GENMOEL which performs RLGG on qualitative states to learn a QSIM model, and the front-end processing and segmenting stages that transform a signal into a set of qualitative states. Next we describe results of experiments on data from six cardiac bypass patients. Useful models were obtained, representing both normal and abnormal physiological states. Model variation across time and across different levels of temporal abstraction and fault tolerance is explored. The assumption made by many previous workers that the abstraction of examples from data can be separated from the learning task is not supported by this study. Firstly, the effects of noise in the numerical data manifest themselves in the qualitative examples. Secondly, the models learned are directly dependent on the initial qualitative abstraction chosen.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1007317323969

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

3

Electronic Resource

Learning to Recognize Volcanoes on Venus (1998)

Burl, Michael C. ; Asker, Lars ; Smyth, Padhraic ; [et al.]

Springer

Machine learning 30 (1998), S. 165-194

add to mindlist on the mindlist

Details

ISSN: 0885-6125

Keywords: machine learning ; pattern recognition ; learning from examples ; large image databases ; data mining ; automatic cataloging ; detection of natural objects ; Magellan SAR ; JARtool ; volcanoes ; Venus ; principal components analysis ; trainable

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract Dramatic improvements in sensor and image acquisition technology have created a demand for automated tools that can aid in the analysis of large image databases. We describe the development of JARtool, a trainable software system that learns to recognize volcanoes in a large data set of Venusian imagery. A machine learning approach is used because it is much easier for geologists to identify examples of volcanoes in the imagery than it is to specify domain knowledge as a set of pixel-level constraints. This approach can also provide portability to other domains without the need for explicit reprogramming; the user simply supplies the system with a new set of training examples. We show how the development of such a system requires a completely different set of skills than are required for applying machine learning to “toy world” domains. This paper discusses important aspects of the application process not commonly encountered in the “toy world,” including obtaining labeled training data, the difficulties of working with pixel data, and the automatic extraction of higher-level features.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1007400206189

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

4

Electronic Resource

A Search for Hidden Relationships: Data Mining with Genetic Algorithms (1997)

SZPIRO, GEORGE G.

Springer

Computational economics 10 (1997), S. 267-277

add to mindlist on the mindlist

Details

ISSN: 1572-9974

Keywords: data mining ; forecasting ; genetic algorithms.

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science , Economics

Notes: Abstract This paper presents an algorithm that permits the search for dependencies among sets of data (univariate or multivariate time-series, or cross-sectional observations). The procedure is modeled after genetic theories and Darwinian concepts, such as natural selection and survival of the fittest. It permits the discovery of equations of the data-generating process in symbolic form. The genetic algorithm that is described here uses parts of equations as building blocks to breed ever better formulas. Apart from furnishing a deeper understanding of the dynamics of a process, the method also permits global predictions and forecasts. The algorithm is successfully tested with artificial and with economic time-series and also with cross-sectional data on the performance and salaries of NBA players during the 94–95 season.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1008673309609

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

5

Electronic Resource

Websom for Textual Data Mining (1999)

Lagus, Krista ; Honkela, Timo ; Kaski, Samuel ; [et al.]

Springer

Artificial intelligence review 13 (1999), S. 345-364

add to mindlist on the mindlist

Details

ISSN: 1573-7462

Keywords: data mining ; document filtering ; exploratory data analysis ; information retrieval ; self-organizing map ; SOM ; text document collection ; WEBSOM

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract New methods that are user-friendly and efficient are needed for guidanceamong the masses of textual information available in the Internet and theWorld Wide Web. We have developed a method and a tool called the WEBSOMwhich utilizes the self-organizing map algorithm (SOM) for organizing largecollections of text documents onto visual document maps. The approach toprocessing text is statistically oriented, computationally feasible, andscalable – over a million text documents have been ordered on a single map.In the article we consider different kinds of information needs and tasksregarding organizing, visualizing, searching, categorizing and filteringtextual data. Furthermore, we discuss and illustrate with examples howdocument maps can aid in these situations. An example is presented wherea document map is utilized as a tool for visualizing and filtering a stream ofincoming electronic mail messages.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1006586221250

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

6

Electronic Resource

Efficient GA Based Techniques for Classification (1999)

Sharpe, Peter K. ; Glover, Robin P.

Springer

Applied intelligence 11 (1999), S. 277-284

add to mindlist on the mindlist

Details

ISSN: 1573-7497

Keywords: genetic algorithms ; classification ; data mining

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract A common approach to evaluating competing models in a classification context is via accuracy on a test set or on cross-validation sets. However, this can be computationally costly when using genetic algorithms with large datasets and the benefits of performing a wide search are compromised by the fact that estimates of the generalization abilities of competing models are subject to noise. This paper shows that clear advantages can be gained by using samples of the test set when evaluating competing models. Further, that applying statistical tests in combination with Occam's razor produces parsimonious models, matches the level of evaluation to the state of the search and retains the speed advantages of test set sampling.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1008386925927

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

7

Electronic Resource

Discretisation of Continuous Commercial Database Features for a Simulated Annealing Data Mining Algorithm (1999)

Debuse, Justin C.W. ; Rayward-Smith, Victor J.

Springer

Applied intelligence 11 (1999), S. 285-295

add to mindlist on the mindlist

Details

ISSN: 1573-7497

Keywords: discretisation ; data mining ; simulated annealing

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract An introduction to the approaches used to discretise continuous database features is given, together with a discussion of the potential benefits of such techniques. These benefits are investigated by applying discretisation algorithms to two large commercial databases; the discretisations yielded are then evaluated using a simulated annealing based data mining algorithm. The results produced suggest that dramatic reductions in problem size may be achieved, yielding improvements in the speed of the data mining algorithm. However, it is also demonstrated under certain circumstances that the discretisation produced may give an increase in problem size or allow overfitting by the data mining algorithm. Such cases, within which often only a small proportion of the database belongs to the class of interest, highlight the need both for caution when producing discretisations and for the development of more robust discretisation algorithms.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1008339026836

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

8

Electronic Resource

Characteristic Rule Discovery in Aurum-3 (1999)

McSherry, David ; Roantree, Donal

Springer

Applied intelligence 11 (1999), S. 297-304

add to mindlist on the mindlist

Details

ISSN: 1573-7497

Keywords: data mining ; rule discovery ; interest measure ; distinctive features ; characteristic rules

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract One strategy for increasing the efficiency of rule discovery in data mining is to target a restricted class of rules, such as exact or almost exact rules, rules with a limited number of conditions, or rules in which each condition, on its own, eliminates a competing outcome class. An algorithm is presented for the discovery of rules in which each condition is a distinctive feature of the outcome class on its right-hand side in the subset of the data set defined by the conditions, if any, which precede it. Such a rule is said to be characteristic for the outcome class. A feature is defined as distinctive for an outcome class if it maximises a well-known measure of rule interest or is unique to the outcome class in the data set. In the special case of data mining which arises when each outcome class is represented by a single instance in the data set, a feature of an object is shown to be distinctive if and only if no other feature is shared by fewer objects in the data set.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1008343110906

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

9

Electronic Resource

Knowledge discovery from structural data (1995)

Cook, Diane J. ; Holder, Lawrence B. ; Djoko, Surnjani

Springer

Journal of intelligent information systems 5 (1995), S. 229-248

add to mindlist on the mindlist

Details

ISSN: 1573-7675

Keywords: machine discovery ; data mining ; data compression ; inexact graph match ; scene analysis ; chemical analysis

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract Discovering repetitive substructure in a structural database improves the ability to interpret and compress the data. This paper describes the Subdue system that uses domain-independent and domain-dependent heuristics to find interesting and repetitive structures in structural data. This substructure discovery technique can be used to discover fuzzy concepts, compress the data description, and formulate hierarchical substructure definitions. Examples from the domains of scene analysis, chemical compound analysis, computer-aided design, and program analysis demonstrate the benefits of the discovery technique.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1007/BF00962235

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

10

Electronic Resource

Medical Data Mining on the Internet: Research on a Cancer Information System (1999)

Houston, Andrea L. ; Chen, Hsinchun ; Hubbard, Susan M. ; [et al.]

Springer

Artificial intelligence review 13 (1999), S. 437-466

add to mindlist on the mindlist

Details

ISSN: 1573-7462

Keywords: CancerLit ; concept spaces ; data mining ; Hopfield net ; information retrieval ; Kohonen net ; medical knowledge ; neural networks

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract This paper discusses several data mining algorithms and techniques thatwe have developed at the University of Arizona Artificial Intelligence Lab.We have implemented these algorithms and techniques into severalprototypes, one of which focuses on medical information developed incooperation with the National Cancer Institute (NCI) and the University ofIllinois at Urbana-Champaign. We propose an architecture for medicalknowledge information systems that will permit data mining across severalmedical information sources and discuss a suite of data mining tools that weare developing to assist NCI in improving public access to and use of theirexisting vast cancer information collections.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1006548623067

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

11

Electronic Resource

Mining Text Using Keyword Distributions (1998)

Feldman, Ronen ; Dagan, Ido ; Hirsh, Haym

Springer

Journal of intelligent information systems 10 (1998), S. 281-300

add to mindlist on the mindlist

Details

ISSN: 1573-7675

Keywords: data mining ; text mining ; text categorization ; distribution comparison ; trend analysis

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract Knowledge Discovery in Databases (KDD) focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of information that is available only in unstructured textual form. This paper describes the KDT system for Knowledge Discovery in Text, in which documents are labeled by keywords, and knowledge discovery is performed by analyzing the co-occurrence frequencies of the various keywords labeling the documents. We show how this keyword-frequency approach supports a range of KDD operations, providing a suitable foundation for knowledge discovery and exploration for collections of unstructured text.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1008623632443

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

12

Electronic Resource

Feature Subset Selection within a Simulated Annealing Data Mining Algorithm (1997)

Debuse, Justin C.W. ; Rayward-Smith, Victor J.

Springer

Journal of intelligent information systems 9 (1997), S. 57-81

add to mindlist on the mindlist

Details

ISSN: 1573-7675

Keywords: Feature subset selection ; data mining ; simulated annealing

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract An overview of the principle feature subset selection methods isgiven. We investigate a number of measures of feature subset quality, usinglarge commercial databases. We develop an entropic measure, based upon theinformation gain approach used within ID3 and C4.5 to build trees, which isshown to give the best performance over our databases. This measure is usedwithin a simple feature subset selection algorithm and the technique is usedto generate subsets of high quality features from the databases. A simulatedannealing based data mining technique is presented and applied to thedatabases. The performance using all features is compared to that achievedusing the subset selected by our algorithm. We show that a substantialreduction in the number of features may be achieved together with animprovement in the performance of our data mining system. We also present amodification of the data mining algorithm, which allows it to simultaneouslysearch for promising feature subsets and high quality rules. The effect ofvarying the generality level of the desired pattern is alsoinvestigated.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1008641220268

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

13

Electronic Resource

Discovering Patterns from Large and Dynamic Sequential Data (1997)

Wang, Ke

Springer

Journal of intelligent information systems 9 (1997), S. 33-56

add to mindlist on the mindlist

Details

ISSN: 1573-7675

Keywords: combinatorial pattern matching ; data mining ; sequential pattern ; suffix tree ; update

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract Most daily and scientific data are sequential in nature. Discoveringimportant patterns from such data can benefit the user and scientist bypredicting coming activities, interpreting recurring phenomena, extractingoutstanding similarities and differences for close attention, compressingdata, and detecting intrusion. We consider the following incrementaldiscovery problem for large and dynamic sequential data. Suppose thatpatterns were previously discovered and materialized. An update is made tothe sequential database. An incremental discovery will take advantage ofdiscovered patterns and compute only the change by accessing the affectedpart of the database and data structures. In addition to patterns, thestatistics and position information of patterns need to be updated to allowfurther analysis and processing on patterns. We present an efficientalgorithm for the incremental discovery problem. The algorithm is applied tosequential data that honors several sequential patterns modeling weatherchanges in Singapore. The algorithm finds what it is supposed to find.Experiments show that for small updates and large databases, the incrementaldiscovery algorithm runs in time independent of the data size.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1008689103430

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

14

Electronic Resource

Computationally Efficient Approximation of a Probabilistic Model for Document Representation in the WEBSOM Full-Text Analysis Method (1997)

Kaski, S.

Springer

Neural processing letters 5 (1997), S. 69-81

add to mindlist on the mindlist

Details

ISSN: 1573-773X

Keywords: data mining ; feature extraction ; information retrieval ; Self-Organizing Map (SOM) ; text analysis

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract WEBSOM is a recently developed neural method for exploring full-text document collections, for information retrieval, and for information filtering. In WEBSOM the full-text documents are encoded as vectors in a document space somewhat like in earlier information retrieval methods, but in WEBSOM the document space is formed in an unsupervised manner using the Self-Organizing Map algorithm. In this article the document representations the WEBSOM creates are shown to be computationally efficient approximations of the results of a certain probabilistic model. The probabilistic model incorporates information about the similarity of use of different words to take into account their semantic relations.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1009618125967

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

15

Electronic Resource

Borders: An Efficient Algorithm for Association Generation in Dynamic Databases (1999)

Aumann, Yonatan ; Feldman, Ronen ; Lipshtat, Orly ; [et al.]

Springer

Journal of intelligent information systems 12 (1999), S. 61-73

add to mindlist on the mindlist

Details

ISSN: 1573-7675

Keywords: association rules ; knowledge discovery ; data mining

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract We consider the problem of finding association rules in a database with binary attributes. Most algorithms for finding such rules assume that all the data is available at the start of the data mining session. In practice, the data in the database may change over time, with records being added and deleted. At any given time, the rules for the current set of data are of interest. The naive, and highly inefficient, solution would be to rerun the association generation algorithm from scratch following the arrival of each new batch of data. This paper describes the Borders algorithm, which provides an efficient method for generating associations incrementally, from dynamically changing databases. Experimental results show an improved performance of the new algorithm when compared with previous solutions to the problem.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1026482903537

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

16

Electronic Resource

Data Mining in Large Databases Using Domain Generalization Graphs (1999)

Hilderman, Robert J. ; Hamilton, Howard J. ; Cercone, Nick

Springer

Journal of intelligent information systems 13 (1999), S. 195-234

add to mindlist on the mindlist

Details

ISSN: 1573-7675

Keywords: data mining ; knowledge discovery ; machine learning ; knowledge representation ; attribute-oriented generalization ; domain generalization graphs

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract Attribute-oriented generalization summarizes the information in a relational database by repeatedly replacing specific attribute values with more general concepts according to user-defined concept hierarchies. We introduce domain generalization graphs for controlling the generalization of a set of attributes and show how they are constructed. We then present serial and parallel versions of the Multi-Attribute Generalization algorithm for traversing the generalization state space described by joining the domain generalization graphs for multiple attributes. Based upon a generate-and-test approach, the algorithm generates all possible summaries consistent with the domain generalization graphs. Our experimental results show that significant speedups are possible by partitioning path combinations from the DGGs across multiple processors. We also rank the interestingness of the resulting summaries using measures based upon variance and relative entropy. Our experimental results also show that these measures provide an effective basis for analyzing summary data generated from relational databases. Variance appears more useful because it tends to rank the less complex summaries (i.e., those with few attributes and/or tuples) as more interesting.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1008769516670

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

17

Electronic Resource

Information mediation in cyberspace: Scalable methods for declarative information networks (1996)

Dao, Son ; Perry, Brad

Springer

Journal of intelligent information systems 6 (1996), S. 131-150

add to mindlist on the mindlist

Details

ISSN: 1573-7675

Keywords: information mediation ; data mining ; semantic integration ; ontologies ; declarative interoperability

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract An end-to-end discussion, from logical architecture to implementation, of issues and design decisions in declarative information networks is presented. A declarative information network is defined to be a dynamic and decentralized structure where value-added services are declared and applied as mediators in a scalable and controlled manner. A primary result is the need to adopt dynamically linked ontologies as the semantic basis for knowledge sharing in scalable networks. It is shown that data mining techniques provide a promising basis upon which to explore and develop this result. Our prototype system, entitled Mystique, is described in terms of KQML, distributed object management, and distributed agent execution. An example shows how we map our architecture into the World Wide Web (WWW) and transform the appearance of the WWW into an intelligently integrated and multi-subject distributed information network.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1007/BF00122125

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

18

Electronic Resource

On the Accuracy of Meta-learning for Scalable Data Mining (1997)

Chan, Philip K. ; Stolfo, Salvatore J.

Springer

Journal of intelligent information systems 8 (1997), S. 5-28

add to mindlist on the mindlist

Details

ISSN: 1573-7675

Keywords: machine learning ; meta-learning ; scalability ; data mining ; classifiers

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract In this paper, wedescribe a general approach to scaling data mining applications thatwe have come to call meta-learning. Meta-Learningrefers to a general strategy that seeks to learn how to combine anumber of separate learning processes in an intelligent fashion. Wedesire a meta-learning architecture that exhibits two key behaviors.First, the meta-learning strategy must produce an accurate final classification system. This means that a meta-learning architecturemust produce a final outcome that is at least as accurate as aconventional learning algorithm applied to all available data.Second, it must be fast, relative to an individual sequential learningalgorithm when applied to massive databases of examples, and operatein a reasonable amount of time. This paper focussed primarily onissues related to the accuracy and efficacy of meta-learning as ageneral strategy. A number of empirical results are presenteddemonstrating that meta-learning is technically feasible in wide-area,network computing environments.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1008640732416

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

19

Electronic Resource

Multicategory Classification by Support Vector Machines (1999)

Bredensteiner, Erin J. ; Bennett, Kristin P.

Springer

Computational optimization and applications 12 (1999), S. 53-79

add to mindlist on the mindlist

Details

ISSN: 1573-2894

Keywords: support vector machines ; linear programming ; classification ; data mining ; machine learning.

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract We examine the problem of how to discriminate between objects of three or more classes. Specifically, we investigate how two-class discrimination methods can be extended to the multiclass case. We show how the linear programming (LP) approaches based on the work of Mangasarian and quadratic programming (QP) approaches based on Vapnik's Support Vector Machine (SVM) can be combined to yield two new approaches to the multiclass problem. In LP multiclass discrimination, a single linear program is used to construct a piecewise-linear classification function. In our proposed multiclass SVM method, a single quadratic program is used to construct a piecewise-nonlinear classification function. Each piece of this function can take the form of a polynomial, a radial basis function, or even a neural network. For the k 〉 2-class problems, the SVM method as originally proposed required the construction of a two-class SVM to separate each class from the remaining classes. Similarily, k two-class linear programs can be used for the multiclass problem. We performed an empirical study of the original LP method, the proposed k LP method, the proposed single QP method and the original k QP methods. We discuss the advantages and disadvantages of each approach.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1008663629662

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

20

Electronic Resource

Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals (1997)

Gray, Jim ; Chaudhuri, Surajit ; Bosworth, Adam ; [et al.]

Springer

Data mining and knowledge discovery 1 (1997), S. 29-53

add to mindlist on the mindlist

Details

ISSN: 1573-756X

Keywords: data cube ; data mining ; aggregation ; summarization ; database ; analysis ; query

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract Data analysis applications typically aggregate data across manydimensions looking for anomalies or unusual patterns. The SQL aggregatefunctions and the GROUP BY operator produce zero-dimensional orone-dimensional aggregates. Applications need the N-dimensionalgeneralization of these operators. This paper defines that operator, calledthe data cube or simply cube. The cube operator generalizes the histogram,cross-tabulation, roll-up,drill-down, and sub-total constructs found in most report writers.The novelty is that cubes are relations. Consequently, the cubeoperator can be imbedded in more complex non-procedural dataanalysis programs. The cube operator treats each of the Naggregation attributes as a dimension of N-space. The aggregate ofa particular set of attribute values is a point in this space. Theset of points forms an N-dimensional cube. Super-aggregates arecomputed by aggregating the N-cube to lower dimensional spaces.This paper (1) explains the cube and roll-up operators, (2) showshow they fit in SQL, (3) explains how users can define new aggregatefunctions for cubes, and (4) discusses efficient techniques tocompute the cube. Many of these features are being added to the SQLStandard.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1009726021843

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

21

Electronic Resource

Advanced Scout: Data Mining and Knowledge Discovery in NBA Data (1997)

Bhandari, Inderpal ; Colet, Edward ; Parker, Jennifer ; [et al.]

Springer

Data mining and knowledge discovery 1 (1997), S. 121-125

add to mindlist on the mindlist

Details

ISSN: 1573-756X

Keywords: data mining ; knowledge discovery ; attribute focusing ; basketball ; NBA

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract Advanced Scout is a PC-based data mining application used by National Basketball Association (NBA)coaching staffs to discover interesting patterns in basketball game data. We describe Advanced Scout software from the perspective of data mining and knowledge discovery. This paper highlights the pre-processing of raw data that the program performs, describes the data mining aspects of the software and how the interpretation of patterns supports the processof knowledge discovery. The underlying technique of attribute focusing asthe basis of the algorithm is also described. The process of pattern interpretation is facilitated by allowing the user to relate patterns to video tape.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1009782106822

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

22

Electronic Resource

A Simple Constraint-Based Algorithm for Efficiently Mining Observational Databases for Causal Relationships (1997)

Cooper, Gregory F.

Springer

Data mining and knowledge discovery 1 (1997), S. 203-224

add to mindlist on the mindlist

Details

ISSN: 1573-756X

Keywords: causal discovery ; data mining ; observational data

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract This paper presents a simple, efficient computer-based method for discovering causal relationships from databases that contain observational data. Observational data is passively observed, as contrasted with experimental data. Most of the databases available for data mining are observational. There is great potential for mining such databases to discover causal relationships. We illustrate how observational data can constrain the causal relationships among measured variables, sometimes to the point that we can conclude that one variable is causing another variable. The presentation here is based on a constraint-based approach to causal discovery. A primary purpose of this paper is to present the constraint-based causal discovery method in the simplest possible fashion in order to (1) readily convey the basic ideas that underlie more complex constraint-based causal discovery techniques, and (2) permit interested readers to rapidly program and apply the method to their own databases, as a start toward using more elaborate causal discovery algorithms.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1009787925236

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

23

Electronic Resource

Beyond Market Baskets: Generalizing Association Rules to Dependence Rules (1998)

Silverstein, Craig ; Brin, Sergey ; Motwani, Rajeev

Springer

Data mining and knowledge discovery 2 (1998), S. 39-68

add to mindlist on the mindlist

Details

ISSN: 1573-756X

Keywords: data mining ; market basket ; association rules ; dependence rules ; closure properties ; text mining

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract One of the more well-studied problems in data mining is the search for association rules in market basket data. Association rules are intended to identify patterns of the type: “A customer purchasing item A often also purchases item B.” Motivated partly by the goal of generalizing beyond market basket data and partly by the goal of ironing out some problems in the definition of association rules, we develop the notion of dependence rules that identify statistical dependence in both the presence and absence of items in itemsets. We propose measuring significance of dependence via the chi-squared test for independence from classical statistics. This leads to a measure that is upward-closed in the itemset lattice, enabling us to reduce the mining problem to the search for a border between dependent and independent itemsets in the lattice. We develop pruning strategies based on the closure property and thereby devise an efficient algorithm for discovering dependence rules. We demonstrate our algorithm's effectiveness by testing it on census data, text data (wherein we seek term dependence), and synthetic data.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1009713703947

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

24

Electronic Resource

Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values (1998)

Huang, Zhexue

Springer

Data mining and knowledge discovery 2 (1998), S. 283-304

add to mindlist on the mindlist

Details

ISSN: 1573-756X

Keywords: data mining ; cluster analysis ; clustering algorithms ; categorical data

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function. With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1009769707641

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

25

Electronic Resource

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules (1998)

Lee, S.D. ; Cheung, David W. ; Kao, Ben

Springer

Data mining and knowledge discovery 2 (1998), S. 233-262

add to mindlist on the mindlist

Details

ISSN: 1573-756X

Keywords: sampling ; data mining ; knowledge discovery ; association rules ; update ; maintenance ; confidence interval

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract By nature, sampling is an appealing technique for data mining, because approximate solutions in most cases may already be of great satisfaction to the need of the users. We attempt to use sampling techniques to address the problem of maintaining discovered association rules. Some studies have been done on the problem of maintaining the discovered association rules when updates are made to the database. All proposed methods must examine not only the changed part but also the unchanged part in the original database, which is very large, and hence take much time. Worse yet, if the updates on the rules are performed frequently on the database but the underlying rule set has not changed much, then the effort could be mostly wasted. In this paper, we devise an algorithm which employs sampling techniques to estimate the difference between the association rules in a database before and after the database is updated. The estimated difference can be used to determine whether we should update the mined association rules or not. If the estimated difference is small, then the rules in the original database is still a good approximation to those in the updated database. Hence, we do not have to spend the resources to update the rules. We can accumulate more updates before actually updating the rules, thereby avoiding the overheads of updating the rules too frequently. Experimental results show that our algorithm is very efficient and highly accurate.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1009703019684

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

26

Electronic Resource

Mining Pharmacy Data Helps to Make Profits (1998)

Hamuro, Yukinobu ; Katoh, Naoki ; Matsuda, Yasuyuki ; [et al.]

Springer

Data mining and knowledge discovery 2 (1998), S. 391-398

add to mindlist on the mindlist

Details

ISSN: 1573-756X

Keywords: data mining ; knowledge discovery ; pharmacy ; point of sales

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract Pharma, a drugstore chain in Japan, has been remarkably successful in the effective use of data mining. From over one tera bytes of sales data accumulated in databases, it has derived much interesting and useful knowledge that in turn has been applied to produce profits. In this paper, we shall explain several interesting cases of knowledge discovery at Pharma. We then discuss the innovative features of the data mining system developed in Pharma that led to meaningful knowledge discovery.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1009748731133

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

27

Electronic Resource

Partitioning Nominal Attributes in Decision Trees (1999)

Coppersmith, Don ; Hong, Se June ; Hosking, Jonathan R.M.

Springer

Data mining and knowledge discovery 3 (1999), S. 197-217

add to mindlist on the mindlist

Details

ISSN: 1573-756X

Keywords: binary decision tree ; classification ; data mining ; entropy ; Gini index ; impurity ; optimal splitting

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract To find the optimal branching of a nominal attribute at a node in an L-ary decision tree, one is often forced to search over all possible L-ary partitions for the one that yields the minimum impurity measure. For binary trees (L = 2) when there are just two classes a short-cut search is possible that is linear in n, the number of distinct values of the attribute. For the general case in which the number of classes, k, may be greater than two, Burshtein et al. have shown that the optimal partition satisfies a condition that involves the existence of 2 L hyperplanes in the class probability space. We derive a property of the optimal partition for concave impurity measures (including in particular the Gini and entropy impurity measures) in terms of the existence ofL vectors in the dual of the class probability space, which implies the earlier condition. Unfortunately, these insights still do not offer a practical search method when n and k are large, even for binary trees. We therefore present a new heuristic search algorithm to find a good partition. It is based on ordering the attribute's values according to their principal component scores in the class probability space, and is linear in n. We demonstrate the effectiveness of the new method through Monte Carlo simulation experiments and compare its performance against other heuristic methods.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1009869804967

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

28

Electronic Resource

CHAMP: A Prototype for Automated Cellular Churn Prediction (1999)

Masand, Brij ; Datta, Piew ; Mani, D.R. ; [et al.]

Springer

Data mining and knowledge discovery 3 (1999), S. 219-225

add to mindlist on the mindlist

Details

ISSN: 1573-756X

Keywords: data mining ; knowledge discovery ; churn prediction application ; predictive modeling

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract We describe CHAMP (CHurn Analysis, Modeling, and Prediction), an automated system for modeling cellular customer behavior on a large scale. Using historical data from GTE's data warehouse for cellular phone customers, every month CHAMP identifies churn factors for several geographic regions and updates models to generate churn scores predicting who is likely to churn within the near future. CHAMP is capable of developing customized monthly models and churn scores for over one hundred GTE cellular phone markets totaling over 5 million customers.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1009873905876

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

29

Electronic Resource

Parallel Formulations of Decision-Tree Classification Algorithms (1999)

Srivastava, Anurag ; Han, Eui-Hong ; Kumar, Vipin ; [et al.]

Springer

Data mining and knowledge discovery 3 (1999), S. 237-261

add to mindlist on the mindlist

Details

ISSN: 1573-756X

Keywords: data mining ; parallel processing ; classification ; scalability ; decision trees

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract Classification decision tree algorithms are used extensively for data mining in many domains such as retail target marketing, fraud detection, etc. Highly parallel algorithms for constructing classification decision trees are desirable for dealing with large data sets in reasonable amount of time. Algorithms for building classification decision trees have a natural concurrency, but are difficult to parallelize due to the inherent dynamic nature of the computation. In this paper, we present parallel formulations of classification decision tree learning algorithm based on induction. We describe two basic parallel formulations. One is based on Synchronous Tree Construction Approach and the other is based on Partitioned Tree Construction Approach. We discuss the advantages and disadvantages of using these methods and propose a hybrid method that employs the good features of these methods. We also provide the analysis of the cost of computation and communication of the proposed hybrid method. Moreover, experimental results on an IBM SP-2 demonstrate excellent speedups and scalability.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1009832825273

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

30

Electronic Resource

Effect of Data Distribution in Parallel Mining of Associations (1999)

Cheung, David W. ; Xiao, Yongqiao

Springer

Data mining and knowledge discovery 3 (1999), S. 291-314

add to mindlist on the mindlist

Details

ISSN: 1573-756X

Keywords: association rules ; data mining ; data skewness ; workload balance ; parallel mining ; parallel computing

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract Association rule mining is an important new problem in data mining. It has crucial applications in decision support and marketing strategy. We proposed an efficient parallel algorithm for mining association rules on a distributed share-nothing parallel system. Its efficiency is attributed to the incorporation of two powerful candidate set pruning techniques. The two techniques, distributed and global prunings, are sensitive to two data distribution characteristics: data skewness and workload balance. The prunings are very effective when both the skewness and balance are high. We have implemented FPM on an IBM SP2 parallel system. The performance studies show that FPM outperforms CD consistently, which is a parallel version of the representative Apriori algorithm (Agrawal and Srikant, 1994). Also, the results have validated our observation on the effectiveness of the two pruning techniques with respect to the data distribution characteristics. Furthermore, it shows that FPM has nice scalability and parallelism, which can be tuned for different business applications.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1009836926181

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

31

Electronic Resource

The Carnot Heterogeneous Database Project: Implemented Applications (1997)

Singh, Munindar P. ; Cannata, Philip E. ; Huhns, Michael N. ; [et al.]

Springer

Distributed and parallel databases 5 (1997), S. 207-225

add to mindlist on the mindlist

Details

ISSN: 1573-7578

Keywords: enterprise integration ; workflow management ; agents interoperation ; heterogeneous databases ; scientific decision support ; data mining

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract The Carnot project was an ambitious research project inheterogeneous databases. It integrated a variety of techniques toaddress a wide range of problems in achieving interoperation inheterogeneous environments. Here we describe some of the majorimplemented applications of this project. These applications concern(a) accessing a legacy scientific database, (b) automating a workflowinvolving legacy systems, (c) cleaning data, and (d) retrievingsemantically appropriate information from structured databases inresponse to text queries. These applications support scientificdecision support, business process management, data integrityenhancement, and analytical decision support, respectively. Theydemonstrate Carnot‘s capabilities for (a) heterogeneous queryprocessing, (b) relaxed transaction and workflow management, (c)knowledge discovery, and (d) heterogeneous resource modelintegration.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1023/A:1008645509474

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext