Data Mining Input: Concepts, Instances, and Attributes
Input takes the following forms:Concept: The thing that is to be learned is called the concept. Concept should be :
Intelligible in that it can be understood
Operational in that it can be applied to actual examples
Instances: The data present consists of various instances of the class. E.g. the table below consists of 2 instances
Attributes: Each instance of the class has various attributes. E.g. the table bellow consists of two attributes {Name, Age}Types of learning in data miningClassification learning:
Learning scheme is presented with a set of classified examples from which it is expected to learn a way of classifying unseen examples
Also called supervised learning
E.g. Classification rules for the weather forecasting problem If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes
Numeric prediction
Same as classification learning but the outcome to be predicted is not a discreet class but a numeric quantity
Clustering
Groups of examples that belong together are sought and clubbed together in a cluster
E.g. based on the data with a bank the following relation between debt and income was seen:Association rules
Any association among features is sought, not just ones that predict a particular class value
It predicts any attribute, not just the class
It can predict more than one attribute value at a time
E.g. from the following super market data it can be concluded: If milk and bread is bought, customers also buy butterFew important terms…Concept description: Output produced by a learning scheme

WEKA: Data Mining Input Concepts Instances And Attributes

  • 1.
    Data Mining Input:Concepts, Instances, and Attributes
  • 2.
    Input takes thefollowing forms:Concept: The thing that is to be learned is called the concept. Concept should be :
  • 3.
    Intelligible in thatit can be understood
  • 4.
    Operational in thatit can be applied to actual examples
  • 5.
    Instances: The datapresent consists of various instances of the class. E.g. the table below consists of 2 instances
  • 6.
    Attributes: Each instanceof the class has various attributes. E.g. the table bellow consists of two attributes {Name, Age}Types of learning in data miningClassification learning:
  • 7.
    Learning scheme ispresented with a set of classified examples from which it is expected to learn a way of classifying unseen examples
  • 8.
  • 9.
    E.g. Classification rulesfor the weather forecasting problem If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes
  • 10.
  • 11.
    Same as classificationlearning but the outcome to be predicted is not a discreet class but a numeric quantity
  • 12.
  • 13.
    Groups of examplesthat belong together are sought and clubbed together in a cluster
  • 14.
    E.g. based onthe data with a bank the following relation between debt and income was seen:Association rules
  • 15.
    Any association amongfeatures is sought, not just ones that predict a particular class value
  • 16.
    It predicts anyattribute, not just the class
  • 17.
    It can predictmore than one attribute value at a time
  • 18.
    E.g. from thefollowing super market data it can be concluded: If milk and bread is bought, customers also buy butterFew important terms…Concept description: Output produced by a learning scheme
  • 19.
    Flat file: Eachdataset is represented as a matrix of instances versus attributes, which in database terms is a single relationship, or a flat file
  • 20.
    Closed world assumption:The idea of specifying only positive examples and adopting a standing assumption that the rest are negative is called closed world assumptionSteps to prepare dataData assembly and aggregationData integration Data Cleaning 4. General preparation
  • 21.
    Data assembly andaggregationInstances which are there in the input should be independent
  • 22.
    Independence can beachieved by de-normalization
  • 23.
    In database terms,take two relations and join them together to make one, a process of flattening that is technically called de-normalization
  • 24.
    Possible with finiteset of finite relationsInput is a family tree
  • 25.
    We are tryingto find ‘Sister of’ relation shipEach row of tree mapped to instances:We cant make sense of this with respect to our requirement or concept. Therefore …….
  • 26.
    We de-normalize thesetables to get:Here we can clearly see the ‘Sister of’ relationship
  • 27.
    Problems with de-normalization:Ifrelationship between large number of items is required then tables will be hugeIt produces irregularities in data that are completely spuriousRelations might not be finite (use: Inductive logic programming)Overlay data: Sometimes data relevant to the problem at hand needs to be collected from outside of the organization. This is called overlay data.
  • 28.
    Data IntegrationIntegration ofsystem wide databases is difficult because different departments will use/have:Different style of record keepingDifferent conventions Different degrees of data aggregations etcDifferent types of errorsDifferent time periodDifferent primary keys These issues are taken care by the idea of company wide databases, a process called as data warehousing
  • 29.
    Data CleaningData cleaningis the careful checking of data It helps in resolving many architectural issues with different databasesData cleaning usually requires good domain knowledge
  • 30.
    Attribute-Relation File Format(ARFF)Definition: An ARFF file is an ASCII text file that describes a list of instances sharing a set of attributesConventions used in ARFF :ARFF Header Line beginning with % are comments To declare relation: @relation <name of relation>To declare attribute: @attribute <attribute> <data type>ARFF Data SectionTo start the actual data: @data, followed by row wise CS data
  • 31.
    Data type forARFF:Numeric can be real or integer numbersNominal values are defined by providing <nominal-specification> listing the possible values: {nm-value1, nm-value2,…} e.g. {yes, no}Values separated by space must be quotedString attributes allow us to create attributes containing arbitrary textual values Date type is used as: @attribute <name> date [<date-format>]The default date format is ISO-8601 combined date and time format:”yyyy-MM-dd’T’HH:mm:ss” Missing values are represented by ?
  • 32.
    Sparse ARFF filesSparseARFF files are very similar to ARFF files, but data with value 0 are not be explicitly representedSame header as ARFF but different data section. Instead of representing each value in order, like this:@data 0, X, 0, Y, “class A”The non zero attributes are explicitly identified by attribute number(starting from zero) and their value stated , like this:@data{1X, 3Y,4 “class A”}
  • 33.
    Visit more selfhelp tutorialsPick a tutorial of your choice and browse through it at your own pace.The tutorials section is free, self-guiding and will not involve any additional support.Visit us at www.dataminingtools.net