IJSRD - International Journal for Scientific Research & Development| Vol. 2, Issue 07, 2014 | ISSN (online): 2321-0613 All rights reserved by www.ijsrd.com 511 Data Mining in Multi-Instance and Multi-Represented Objects Ajay aggarwal1 Mohammad Danish2 1 Research Scholar 2 Assistant Professor 1,2 Department of Computer Science & Engineering 1,2 Al-Falah School of Engineering & Technology, Dhoj, Faridabad, Haryana Abstract— In multi-instance learning, the training set comprises labeled bags that are composed of unlabeled instances, and the task is to predict the labels of unseen bags. In this part, a web mining problem, i.e. web index recommendation, is investigated from a multi-instance view. In detail, each web index page is regarded as a bag, while each of its linked pages is regarded as an instance. A user favoring an index page means that he or she is interested in at least one page linked by the index Keywords: Instances, Labeled, Multi Instance view, index I. INTRODUCTION Most drugs are small molecules working by binding to larger protein molecules such as enzymes and cell-surface receptors. For molecules qualified to make a drug, one of its low-energy shapes could tightly bind to the target area. While for molecules unqualified to make a drug, none of its low- energy shapes could tightly bind to the target area. The main difficulty of drug activity prediction lies in that each molecule may have many alternative low-energy shapes, but biochemists only know that whether a molecule is qualified to make a drug or not, instead of knowing that which of its alternative low-energy shapes responses for the qualification.[3] II. WEB INDEX RECOMMENDATION There are diverse web pages on the Internet, among which some pages contain plentiful information but themselves only provide titles or brief summaries while leaving the detailed presentation to their linked pages. These web pages are called web index pages. For example, the entrance of NBA at Yahoo! (sports.yahoo.com/ nba/) is a web index page. Fig. 1 The web index page is regarded as a bag, while its linked pages are regarded as the instances in the bag This problem could be viewed as a multi-instance problem. Now the goal is to label unseen web index pages as positive or negative. A positive web index page is such a page that the user is interested in at least one of its linked pages. A negative web index page is such a page that none of its linked pages interested the user. Thus, each index page could be regarded as a bag while its linked pages could be regarded as the instances in the bag. For illustration, Fig. 1 shows a bag and two of its instances. [3] For simplifying the analysis, this terms focuses on the hypertext information on the pages while neglecting other hypermedia such as images, audios, videos, etc. Then, each instance can be represented by a term vector T = [t1, t2, · · · , tn], Where ti (i = 1, 2, · · · , n) is one of the n most frequent terms appearing in the corresponding linked page. T could be obtained by pre-accessing the linked page and then counting the occurrence of different terms. Note that some trivial terms such as ‘a’, ‘the’, ‘is’, are neglected in this process. In this, all the pages are described by the same number of frequent terms, i.e. the length of any term vectors are the same. However, for term vectors corresponding to different instances, even though their length is the same, their components may be quite different. [10] Moreover, for different bags, since their corresponding web index pages may contain different number of links, the number of instances in the bags may be. Thus, a web index page linking to m pages, i.e. a bag containing m instances, can be represented as {[t11, t12, · · · , t1n], [t21, t22, · · · , t2n], · · · , [tm1, tm2, · · · , tmn]}. The label of the bag is positive if the web index page interested the user. Otherwise the label is negative. Note that the web index pages may contain many links to advertisements or other index pages, which may baffle the analysis. In this part it is constrained that for a linked page to be considered as an instance in a bag, its corresponding link in the index page must contain at least four terms. It is surprising that such a simple strategy helps remove most useless links.[2] III. COMPARING TXT-KNN AND CIT-KNN WITH FRETCIT- KNN At first, experiments are performed to evaluate the performance of Fretcit-kNN on the web index recommendation problem. Since Fretcit-kNN is an extended kNN algorithm that considers the characteristics of multi-instance problems, for comparison, two extended kNN algorithms that do not consider the characteristics of multi-instance problems are also evaluated.[8,9] The first compared algorithm is obtained through adapting the standard kNN algorithm to textual objects. Recall that the standard kNN algorithm utilizes Euclidean distance to measure the distance between examples, which disable it be applied to objects described by textual frequent terms. However, if the distance metric is replaced by fret-minH(.), then the modified algorithm can be easily applied to textual objects. Here the modified algorithm is called Txt-kNN.[3,6] This section shows a new data mining problem called multiinstance outlier identification. This problem arises in tasks where each sample consists of many alternative feature vectors (instances) that describe it. This part defines the multi-instance outliers and analyzes the basic types of multiinstance outliers. Two general identification approaches are proposed based on the state-of-the-art (single-instance) outlier detector LOF (local outlier factor).[8] One approach utilizes the underlying mechanism of the kernel method and plunges the set distance into LOF to detect the multiinstance outliers. The other approach takes each instance’s neighborhood into account. Based on the two approaches, four concrete multi-instance outlier detectors are then introduction In clustering schemes, data objects are usually represented as vectors of feature-value pairs. Features represent certain attributes of the objects that are known to be useful for the clustering task. Attributes that are not relevant in forming structures out of data can lead to non accurate results. Attributes can be numeric and non-
Data Mining in Multi-Instance and Multi-Represented Objects (IJSRD/Vol. 2/Issue 07/2014/112) All rights reserved by www.ijsrd.com 512 numeric, thus forming a mixed-mode data representation.[2] Conceptual clustering is one of the algorithms that can deal with mixed-mode data. However, conceptual clustering has primarily focused on attributes described by nominal values. The best way to combine numeric, ordinal, and nominal- valued data is still an open question.[1] If a convention is adopted for the ordering of the attributes in a given problem context, we can represent instances of data as feature vectors consisting of the attribute values only, where the attribute names themselves are implicitly known by their order. Sison and Shimura proposed a relational description model to clustering data as opposed to the usual prepositional attribute-value pair representation. Usually attributes are single valued, but sometimes they can be multi-valued, such as the document clustering problem at hand. In this case a convention has to be adopted to deal with multivolume attributes depending on the problem context. [7] Our data integration and visualization system is composed of three layers in which the data constitutes the back-end layer (Fig. 1). Schema mappings, ontology definitions and conceptual learning implementations occupy the middle tier and the user interface constitutes the front- end layer. The middle tier also comprises sets of algorithms and modules that process and display results of the query. Most of our local data are represented in XML format. The data are stored using XML data management system Tamino XML server (Software AG) in a Redhat Linux Advanced Server v2.1 environment. The databases are queried using Tamino XQuery (Fiebig and Schöning, 2004) which is an implementation of XQuery language. The queries are enabled through the Tamino Java API. For storing more voluminous data, such as gene-expression data and in house produced mass spectrometry data, we use Oracle 10g database server. IV. INTEGRATED DATABASE SYSTEM A. Architectural design The core architecture of our data integration and visualisation system, called megNet, is composed of three layers; back-end, middle tier and front-end (Figure 1). The data, schema maps, ontology definitions constitute the back- end layer. Most of our local data are represented in XML or RDF formats. The data is stored using XML data management system Tamino XML server (Software AG) in a Redhat Linux Advanced Server v3.0 environment. The databases are queried using Tamino X-Query which is based on XPath 1.0 specification. The queries are enabled through the Tamino Java API. For storing more voluminous data such as gene expression data and in house produced mass spectrometry data, we use Oracle 10g database server (Oracle, Inc.). The Oracle queries are performed using Oracle JDBC Thin drivers. The results obtained from queries to Tamino and Oracle are combined at the Java programming level in the middle tier. The middle tier comprises the business logic of our system. Business logic events, such as graph constructions, distance data projections, topology calculations are implemented as stateless session beans. They are processed as web services. The session beans are the end points of the web services. They receive their request messages from the client for performing a business logic event. In the end of their life cycle they send the response to the client. Fig. 1: Architecture of Tier Transmission The middle tier resides physically in a JBoss 4.04 Application Server (JBoss, Inc.).The business logic events are processed in the EJB Container of JBoss. The client and server communicate through SOAP messages. The SOAP messages are converted to Java objects by the middle tier after it has received a request message from the front-end client and Java objects are converted to SOAP messages before they are sent back as a response message. These conversions are implemented by using Apache Axis 1.4 (Apache Software Foundation).[11] They are processed in Apache Tomcat 5.5 Servlet Container. The front-end comzprises the user interface for visualising and interacting with the end user. It is implemented in the Java environment. REFERENCES [1] D.W. Aha. Lazy learning: special issue editorial. Artificial Intelligence Review, vol.11 o.1-5, pp.7-10, 1997. [2] R.A. Amar, D.R. Dooly, S.A. Goldman, and Q. Zhang. Multiple-instance learning of real-valued data. In Proceedings of the 18th International Conference on Machine Learning, Williamstown, MA, pp.3-10, 2001. [3] P. Auer. On learning from multi-instance examples: empirical evaluation of a theoretical approach. In Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, pp.21- 29, 1997. [4] P. Auer, P.M. Long, and A. Srinivasan. Approximating hyper-rectangles: learning and pseudo-random sets. Journal of Computer and System Sciences, vol.57, no.3, pp.376-388, 1998. [5] A. Blum and A. Kalai. A note on learning from multiple-instance examples. Machine Learning, vol.30, no.1, pp.23-29, 1998. [6] Y. Chevaleyre and J.-D. Zucker. Solving multiple- instance and multiple-part learning problems with decision trees and decision rules. Application to the mutagenesis problem. In E. Stroulia and S. Matwin, Eds. Lecture Notes in Artificial Intelligence 2056, Berlin: Springer, pp.204-214, 2001.
Data Mining in Multi-Instance and Multi-Represented Objects (IJSRD/Vol. 2/Issue 07/2014/112) All rights reserved by www.ijsrd.com 513 [7] B.V. Dasarathy. Nearest Neighbor Norms: NN Pattern Classification Techniques, Los Alamitos, CA: IEEE Computer Society Press, 1991. [8] Dietterich, T., Lathrop, R., Lozano-Perez, T.: ”Solving the multiple instance problem with axis- parallel rectangles”. Artificial Intelligence 89 (1997) 31-71 [9] Kriegel, H.P., Schubert, M.: ”Classification of websites as sets of feature vectors”. In: Proc. IASTED Int. Conf. on Databases and Applications (DBA 2004), Inns- bruck, Austria. (2004) [10]Zhou, Z.H.: ”Multi-Instance Learning: A Survey”. Technical Report, AI Lab, Computer Science a. Technology Department, Nanjing University, Nanjing, China (2004) [11]Ruffo, G.: Learning single and multiple instance decision tree for computer security applications. PhD thesis, Department of Computer Science, University of Turin, Torino,Italy (2000) [12]Weidmann, N., Frank, E., Pfahringer, B.: ”A Two- Level Learning Method for Generalized Multi- instance Problems”. In: Proc. ECML 2003, Cavtat- Dubrovnik,Cr. (2003)468-479

Data Mining in Multi-Instance and Multi-Represented Objects

  • 1.
    IJSRD - InternationalJournal for Scientific Research & Development| Vol. 2, Issue 07, 2014 | ISSN (online): 2321-0613 All rights reserved by www.ijsrd.com 511 Data Mining in Multi-Instance and Multi-Represented Objects Ajay aggarwal1 Mohammad Danish2 1 Research Scholar 2 Assistant Professor 1,2 Department of Computer Science & Engineering 1,2 Al-Falah School of Engineering & Technology, Dhoj, Faridabad, Haryana Abstract— In multi-instance learning, the training set comprises labeled bags that are composed of unlabeled instances, and the task is to predict the labels of unseen bags. In this part, a web mining problem, i.e. web index recommendation, is investigated from a multi-instance view. In detail, each web index page is regarded as a bag, while each of its linked pages is regarded as an instance. A user favoring an index page means that he or she is interested in at least one page linked by the index Keywords: Instances, Labeled, Multi Instance view, index I. INTRODUCTION Most drugs are small molecules working by binding to larger protein molecules such as enzymes and cell-surface receptors. For molecules qualified to make a drug, one of its low-energy shapes could tightly bind to the target area. While for molecules unqualified to make a drug, none of its low- energy shapes could tightly bind to the target area. The main difficulty of drug activity prediction lies in that each molecule may have many alternative low-energy shapes, but biochemists only know that whether a molecule is qualified to make a drug or not, instead of knowing that which of its alternative low-energy shapes responses for the qualification.[3] II. WEB INDEX RECOMMENDATION There are diverse web pages on the Internet, among which some pages contain plentiful information but themselves only provide titles or brief summaries while leaving the detailed presentation to their linked pages. These web pages are called web index pages. For example, the entrance of NBA at Yahoo! (sports.yahoo.com/ nba/) is a web index page. Fig. 1 The web index page is regarded as a bag, while its linked pages are regarded as the instances in the bag This problem could be viewed as a multi-instance problem. Now the goal is to label unseen web index pages as positive or negative. A positive web index page is such a page that the user is interested in at least one of its linked pages. A negative web index page is such a page that none of its linked pages interested the user. Thus, each index page could be regarded as a bag while its linked pages could be regarded as the instances in the bag. For illustration, Fig. 1 shows a bag and two of its instances. [3] For simplifying the analysis, this terms focuses on the hypertext information on the pages while neglecting other hypermedia such as images, audios, videos, etc. Then, each instance can be represented by a term vector T = [t1, t2, · · · , tn], Where ti (i = 1, 2, · · · , n) is one of the n most frequent terms appearing in the corresponding linked page. T could be obtained by pre-accessing the linked page and then counting the occurrence of different terms. Note that some trivial terms such as ‘a’, ‘the’, ‘is’, are neglected in this process. In this, all the pages are described by the same number of frequent terms, i.e. the length of any term vectors are the same. However, for term vectors corresponding to different instances, even though their length is the same, their components may be quite different. [10] Moreover, for different bags, since their corresponding web index pages may contain different number of links, the number of instances in the bags may be. Thus, a web index page linking to m pages, i.e. a bag containing m instances, can be represented as {[t11, t12, · · · , t1n], [t21, t22, · · · , t2n], · · · , [tm1, tm2, · · · , tmn]}. The label of the bag is positive if the web index page interested the user. Otherwise the label is negative. Note that the web index pages may contain many links to advertisements or other index pages, which may baffle the analysis. In this part it is constrained that for a linked page to be considered as an instance in a bag, its corresponding link in the index page must contain at least four terms. It is surprising that such a simple strategy helps remove most useless links.[2] III. COMPARING TXT-KNN AND CIT-KNN WITH FRETCIT- KNN At first, experiments are performed to evaluate the performance of Fretcit-kNN on the web index recommendation problem. Since Fretcit-kNN is an extended kNN algorithm that considers the characteristics of multi-instance problems, for comparison, two extended kNN algorithms that do not consider the characteristics of multi-instance problems are also evaluated.[8,9] The first compared algorithm is obtained through adapting the standard kNN algorithm to textual objects. Recall that the standard kNN algorithm utilizes Euclidean distance to measure the distance between examples, which disable it be applied to objects described by textual frequent terms. However, if the distance metric is replaced by fret-minH(.), then the modified algorithm can be easily applied to textual objects. Here the modified algorithm is called Txt-kNN.[3,6] This section shows a new data mining problem called multiinstance outlier identification. This problem arises in tasks where each sample consists of many alternative feature vectors (instances) that describe it. This part defines the multi-instance outliers and analyzes the basic types of multiinstance outliers. Two general identification approaches are proposed based on the state-of-the-art (single-instance) outlier detector LOF (local outlier factor).[8] One approach utilizes the underlying mechanism of the kernel method and plunges the set distance into LOF to detect the multiinstance outliers. The other approach takes each instance’s neighborhood into account. Based on the two approaches, four concrete multi-instance outlier detectors are then introduction In clustering schemes, data objects are usually represented as vectors of feature-value pairs. Features represent certain attributes of the objects that are known to be useful for the clustering task. Attributes that are not relevant in forming structures out of data can lead to non accurate results. Attributes can be numeric and non-
  • 2.
    Data Mining inMulti-Instance and Multi-Represented Objects (IJSRD/Vol. 2/Issue 07/2014/112) All rights reserved by www.ijsrd.com 512 numeric, thus forming a mixed-mode data representation.[2] Conceptual clustering is one of the algorithms that can deal with mixed-mode data. However, conceptual clustering has primarily focused on attributes described by nominal values. The best way to combine numeric, ordinal, and nominal- valued data is still an open question.[1] If a convention is adopted for the ordering of the attributes in a given problem context, we can represent instances of data as feature vectors consisting of the attribute values only, where the attribute names themselves are implicitly known by their order. Sison and Shimura proposed a relational description model to clustering data as opposed to the usual prepositional attribute-value pair representation. Usually attributes are single valued, but sometimes they can be multi-valued, such as the document clustering problem at hand. In this case a convention has to be adopted to deal with multivolume attributes depending on the problem context. [7] Our data integration and visualization system is composed of three layers in which the data constitutes the back-end layer (Fig. 1). Schema mappings, ontology definitions and conceptual learning implementations occupy the middle tier and the user interface constitutes the front- end layer. The middle tier also comprises sets of algorithms and modules that process and display results of the query. Most of our local data are represented in XML format. The data are stored using XML data management system Tamino XML server (Software AG) in a Redhat Linux Advanced Server v2.1 environment. The databases are queried using Tamino XQuery (Fiebig and Schöning, 2004) which is an implementation of XQuery language. The queries are enabled through the Tamino Java API. For storing more voluminous data, such as gene-expression data and in house produced mass spectrometry data, we use Oracle 10g database server. IV. INTEGRATED DATABASE SYSTEM A. Architectural design The core architecture of our data integration and visualisation system, called megNet, is composed of three layers; back-end, middle tier and front-end (Figure 1). The data, schema maps, ontology definitions constitute the back- end layer. Most of our local data are represented in XML or RDF formats. The data is stored using XML data management system Tamino XML server (Software AG) in a Redhat Linux Advanced Server v3.0 environment. The databases are queried using Tamino X-Query which is based on XPath 1.0 specification. The queries are enabled through the Tamino Java API. For storing more voluminous data such as gene expression data and in house produced mass spectrometry data, we use Oracle 10g database server (Oracle, Inc.). The Oracle queries are performed using Oracle JDBC Thin drivers. The results obtained from queries to Tamino and Oracle are combined at the Java programming level in the middle tier. The middle tier comprises the business logic of our system. Business logic events, such as graph constructions, distance data projections, topology calculations are implemented as stateless session beans. They are processed as web services. The session beans are the end points of the web services. They receive their request messages from the client for performing a business logic event. In the end of their life cycle they send the response to the client. Fig. 1: Architecture of Tier Transmission The middle tier resides physically in a JBoss 4.04 Application Server (JBoss, Inc.).The business logic events are processed in the EJB Container of JBoss. The client and server communicate through SOAP messages. The SOAP messages are converted to Java objects by the middle tier after it has received a request message from the front-end client and Java objects are converted to SOAP messages before they are sent back as a response message. These conversions are implemented by using Apache Axis 1.4 (Apache Software Foundation).[11] They are processed in Apache Tomcat 5.5 Servlet Container. The front-end comzprises the user interface for visualising and interacting with the end user. It is implemented in the Java environment. REFERENCES [1] D.W. Aha. Lazy learning: special issue editorial. Artificial Intelligence Review, vol.11 o.1-5, pp.7-10, 1997. [2] R.A. Amar, D.R. Dooly, S.A. Goldman, and Q. Zhang. Multiple-instance learning of real-valued data. In Proceedings of the 18th International Conference on Machine Learning, Williamstown, MA, pp.3-10, 2001. [3] P. Auer. On learning from multi-instance examples: empirical evaluation of a theoretical approach. In Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, pp.21- 29, 1997. [4] P. Auer, P.M. Long, and A. Srinivasan. Approximating hyper-rectangles: learning and pseudo-random sets. Journal of Computer and System Sciences, vol.57, no.3, pp.376-388, 1998. [5] A. Blum and A. Kalai. A note on learning from multiple-instance examples. Machine Learning, vol.30, no.1, pp.23-29, 1998. [6] Y. Chevaleyre and J.-D. Zucker. Solving multiple- instance and multiple-part learning problems with decision trees and decision rules. Application to the mutagenesis problem. In E. Stroulia and S. Matwin, Eds. Lecture Notes in Artificial Intelligence 2056, Berlin: Springer, pp.204-214, 2001.
  • 3.
    Data Mining inMulti-Instance and Multi-Represented Objects (IJSRD/Vol. 2/Issue 07/2014/112) All rights reserved by www.ijsrd.com 513 [7] B.V. Dasarathy. Nearest Neighbor Norms: NN Pattern Classification Techniques, Los Alamitos, CA: IEEE Computer Society Press, 1991. [8] Dietterich, T., Lathrop, R., Lozano-Perez, T.: ”Solving the multiple instance problem with axis- parallel rectangles”. Artificial Intelligence 89 (1997) 31-71 [9] Kriegel, H.P., Schubert, M.: ”Classification of websites as sets of feature vectors”. In: Proc. IASTED Int. Conf. on Databases and Applications (DBA 2004), Inns- bruck, Austria. (2004) [10]Zhou, Z.H.: ”Multi-Instance Learning: A Survey”. Technical Report, AI Lab, Computer Science a. Technology Department, Nanjing University, Nanjing, China (2004) [11]Ruffo, G.: Learning single and multiple instance decision tree for computer security applications. PhD thesis, Department of Computer Science, University of Turin, Torino,Italy (2000) [12]Weidmann, N., Frank, E., Pfahringer, B.: ”A Two- Level Learning Method for Generalized Multi- instance Problems”. In: Proc. ECML 2003, Cavtat- Dubrovnik,Cr. (2003)468-479