DATA MINING Techniques
What are the most used Data Mining techniques?
Classification is probably the most widely used data mining technique.
Most decision making models are usually based upon
classification methods. These techniques, also called classifiers, enable the
categorisation of data (or entities) into pre-defined classes.
The use of classification algorithms involves a training set consisting of
pre-classified examples. In the tax audit domain, the two classes could be
compliant filings versus non-compliant filings, and the training set would be
assembled from historical audits. The classifier calibration algorithm uses the
pre-classified examples to determine a set of parameters required for proper
discrimination between the classes. The algorithm then encodes these parameters
into a model called a classifier. Once such a classifier is calibrated, it can
assign new filings to either of the classes.
There are many algorithms that can be used for classification, such as decision
trees, neural networks, logistic regression, etc.
Using this data mining technique, the
data mining tool learns from examples or the data (data
warehouses, databases etc) how to partition or classify certain objects (it can
be an object, an action, or any other information, that can be formalised).
As a result, data mining software formulates classification rules.
if PURCHASED = monthly and PROFIT > 5000$ and INCIDENTS = 0
then CUSTOMER_TYPE = LOYAL
- Example - customer database
- Question - Does the customer belong to loyal ones?
- Typical rule formulated -
Clustering is a data mining technique, used to discover and explore groupings
within data or entities. Clustering approaches are mainly used for
segmentation – for example, it can be used to identify polluted soil areas.
Clustering method allows entities to be partitioned into distinct groups,
also called “segments”. The main difference between
classification and clustering is that clustering is structuring data without
knowing anything about classes, while classification method
assigns new knowledge to the classes that are known apriori.
Cluster analysis is a visual method, that helps to understand data structure.
Association rules are basic types of patterns or
regularities that are found in transactional-type data. This data mining
technique has its origins in traditional retail marketing where it can
discover affinities between items that occur within a particular shopping
trip (for example, what items typically co-occur as contents of a shopping
basket). Hence, an alternative name for this type of analysis is
From a set of transaction data (for example tax filings, or insurance
claims), association rules can discover characteristics within a transaction
that imply the presence of other characteristics in the same transaction.
For two sets of characteristics X and Y, an association rule is usually
denoted as to convey that the presence of the characteristic X in a
transaction frequently implies the presence of characteristic Y.
With the help of association methods data mining
software creates rules that associate one attribute of a
relation to another. Discovering these rules is very efficient on set oriented approaches.
56 is the confidence factor of the
- Example - customer database in a supermarket
- 56% of customers who purchase Article1 also purchase Article2
Sequential patterns involve mining
frequently occurring patterns of activity over a period of time. In many
situations, not only may the coexistence of items within a transaction be
important (which would be discovered by association rules algorithms), but
also the order in which those items appear across ordered transactions, and
the amount of time between transactions (which would be discovered by
sequential pattern detection algorithms). Thus, sequential pattern
detection methods are similar to association rules, except that they look
for patterns across time (as opposed to patterns within transactions).
This could be a pattern that represents a sequence of tax filings over
time, or a sequence of purchases over time, etc.
differ from other data mining methods with the temporal factor.