WEKA: Data Mining Input Concepts Instances And Attributes

1.
Data Mining Input:Concepts, Instances, and Attributes

2.
Input takes thefollowing forms:Concept: The thing that is to be learned is called the concept. Concept should be :

3.
Intelligible in thatit can be understood

4.
Operational in thatit can be applied to actual examples

5.
Instances: The datapresent consists of various instances of the class. E.g. the table below consists of 2 instances

6.
Attributes: Each instanceof the class has various attributes. E.g. the table bellow consists of two attributes {Name, Age}Types of learning in data miningClassification learning:

7.
Learning scheme ispresented with a set of classified examples from which it is expected to learn a way of classifying unseen examples

8.
Also called supervisedlearning

9.
E.g. Classification rulesfor the weather forecasting problem If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes

10.
Numeric prediction

11.
Same as classificationlearning but the outcome to be predicted is not a discreet class but a numeric quantity

12.
Clustering

13.
Groups of examplesthat belong together are sought and clubbed together in a cluster

14.
E.g. based onthe data with a bank the following relation between debt and income was seen:Association rules

15.
Any association amongfeatures is sought, not just ones that predict a particular class value

16.
It predicts anyattribute, not just the class

17.
It can predictmore than one attribute value at a time

18.
E.g. from thefollowing super market data it can be concluded: If milk and bread is bought, customers also buy butterFew important terms…Concept description: Output produced by a learning scheme

19.
Flat file: Eachdataset is represented as a matrix of instances versus attributes, which in database terms is a single relationship, or a flat file

20.
Closed world assumption:The idea of specifying only positive examples and adopting a standing assumption that the rest are negative is called closed world assumptionSteps to prepare dataData assembly and aggregationData integration Data Cleaning 4. General preparation

21.
Data assembly andaggregationInstances which are there in the input should be independent

22.
Independence can beachieved by de-normalization

23.
In database terms,take two relations and join them together to make one, a process of flattening that is technically called de-normalization

24.
Possible with finiteset of finite relationsInput is a family tree

25.
We are tryingto find ‘Sister of’ relation shipEach row of tree mapped to instances:We cant make sense of this with respect to our requirement or concept. Therefore …….

26.
We de-normalize thesetables to get:Here we can clearly see the ‘Sister of’ relationship

27.
Problems with de-normalization:Ifrelationship between large number of items is required then tables will be hugeIt produces irregularities in data that are completely spuriousRelations might not be finite (use: Inductive logic programming)Overlay data: Sometimes data relevant to the problem at hand needs to be collected from outside of the organization. This is called overlay data.

28.
Data IntegrationIntegration ofsystem wide databases is difficult because different departments will use/have:Different style of record keepingDifferent conventions Different degrees of data aggregations etcDifferent types of errorsDifferent time periodDifferent primary keys These issues are taken care by the idea of company wide databases, a process called as data warehousing

29.
Data CleaningData cleaningis the careful checking of data It helps in resolving many architectural issues with different databasesData cleaning usually requires good domain knowledge

30.
Attribute-Relation File Format(ARFF)Definition: An ARFF file is an ASCII text file that describes a list of instances sharing a set of attributesConventions used in ARFF :ARFF Header Line beginning with % are comments To declare relation: @relation <name of relation>To declare attribute: @attribute <attribute> <data type>ARFF Data SectionTo start the actual data: @data, followed by row wise CS data

31.
Data type forARFF:Numeric can be real or integer numbersNominal values are defined by providing <nominal-specification> listing the possible values: {nm-value1, nm-value2,…} e.g. {yes, no}Values separated by space must be quotedString attributes allow us to create attributes containing arbitrary textual values Date type is used as: @attribute <name> date [<date-format>]The default date format is ISO-8601 combined date and time format:”yyyy-MM-dd’T’HH:mm:ss” Missing values are represented by ?

32.
Sparse ARFF filesSparseARFF files are very similar to ARFF files, but data with value 0 are not be explicitly representedSame header as ARFF but different data section. Instead of representing each value in order, like this:@data 0, X, 0, Y, “class A”The non zero attributes are explicitly identified by attribute number(starting from zero) and their value stated , like this:@data{1X, 3Y,4 “class A”}

33.
Visit more selfhelp tutorialsPick a tutorial of your choice and browse through it at your own pace.The tutorials section is free, self-guiding and will not involve any additional support.Visit us at www.dataminingtools.net

WEKA: Data Mining Input Concepts Instances And Attributes

More Related Content

What's hot

Viewers also liked

Similar to WEKA: Data Mining Input Concepts Instances And Attributes

More from DataminingTools Inc

Recently uploaded

WEKA: Data Mining Input Concepts Instances And Attributes