STATISTICS-Q

STATISTICS

STUDENT NAME

AFFILIATION

PROFESSOR

SUBMISSION DATE

Discuss c4.5 method

This is an algorithm that uses the concept of information entropy to predict correctly the value of category attribute. This algorithm generates a decision tree that can identify classes of objects that have a descriptive trait. The attributes of the data that splits the sets into subsets to place in one class. It is referred to statistical classifier since it can be used for classification because its main aim is to partition datasets into groups in terms of variables to be predicted

How decision tree are used in classification explain?(explain how they are used?)

They are used in the form of classes which have coexistence with response variable. Homogeneity of data is used as the basis of standard classification tree dataset. Decision tree is used in when we have two variables: size and age to become a swim suit model. In our data if it shows that 80% of people who are younger than 25 sign up we split the data and age becomes top node in the tree. This Standard cart procedure is used when target market is categorical in nature.

Explain decisions tree induction?

The decision tree induction was developed by J. Ross Quinlan known as ID3 and its successor C4.5.An example is decision tree and data set .Decision tree is a system that depicts a tree with root nodes, branches and leaf nodes. The nodes that are internal represent the test on variable, branch nodes represent outcome of test and leaf node explains a class label(Berry ,2000).

Explain the concept of data mining and techniques of data mining?

Data mining is the extraction hidden information and analyzing the data from large databases and summarizing it into useful information. It provides tools for better decision support systems by predicting future trends and behaviors allowing business to make proactive decisions (Berry ,2000). This is done through analyzing data from different dimensions and relating the summarized information identified. Artificial neural networks trains and resembles structures of biological neural networks. Decision trees represent the set decisions which include methods like classification and regression tree. Rule indication extracts information by identifying statistical evidence.

Describe the following

Numeric prediction with examples

This is a model that predicts unknown or missing values which predict continuous class. It is mainly used as a continuous valued function especially in schemes such as, linear regression, model tree generators, instance based learners and decision tables.

Classifying clustering with example

Classification is when you classify new samples into known classes while clustering is based on patters in data. In clustering there are no class labels an example in identifying new animal species .In classifying we mainly use decision tree case we identify labeled samples from a class set.

Difference between divide & conquer and separate & conquer

Divide and conquer the sub problems are independent of each other .These algorithm is simply used to break downproblems into sub problems that can be solved easily.

Separate and conquer rule learning algorithm produces set rules by specialising in a general rule.In each case a specialised rule covers a subset of positive examples by excluding the negative ones.

Association rule mining and issues with associative rule mining explain with example

Association mining is a data mining task which identifies relationships in databases. Association rule mining are statements that can be used by educators to help understand unrelated data into a relational database. This is done by analyzing data and criteria to support and identify most important relationships. The main issue with association rule mining is over fitting. This is because limited data is available and it is inevitable in statistical sense to get false discoveries because the number of rules is high. An example of association rule mining is applied in an e learning systems, market basket transactions (Tibshirani,Friedman ,2009).

What is mining? Explain pre mining, post mining with test.

Data mining is the practice of searching large data stores and identifying patterns and trends in easy to interpret visuals. It uses a mathematical algorithm to evaluate future event probability by segmenting and evaluating the data.

Pre mining is a revisit that helps analyze data sets before data mining. This is essential in uncovering patterns that can be mined within specified time .This is mainly done through detecting problems in data set by Detection techniques and accessing manageable data sets.

Post mining is evaluating the problems such as noisy data by reducing the decision tree. This is mainly done by post pruning or truncation. This assist in identifying the problem before implementing it in system or usage by an end user. (Berry ,2000).

Difference between supervised and unsupervised? (at least 5 difference)

The output datasets are provided in supervised learning and are used for training the machine to get desired outputs whereas no data sets are provided in unsupervised learning, however the data is clustered into various classes (Kuhn & Johnson, 2013).

supervised learning is done using statistically based algorithmic methods of machine learning Whereas, unsupervised learning deals with neuromorphic technology known as Attractor Neural Networks ANN.

In Supervised learning, Labels are given and one tries to learn how to predict/model them whereas Unsupervised there are no Labels given and one tries to extract information in general out of the data

In supervised learning, there are Applications where training data contains examples of the input vectors with their corresponding target vectors whereas unsupervised learning the training data consists of a set of input vectors x without any consistent target values

This is an algorithm that involves dependent variable which is to be predicted from independent variables. By use of these set of variables, one generates a function that map inputs to desired outputs (Kuhn & Johnson, 2013). Unsupervised Learning is an algorithm that does not have a target or outcome variable to predict / estimate.  It is used for clustering in various groups, that is widely used for segmenting the different groups for specific interposition.

REFERENCES

Berry, M. J., Linoff, G., & Berry, M. J. (2000). Mastering data mining: The art and science of customer relationship management. New York: Wiley Computer Pub

Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer.

Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. New York: Springer.