International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 948 An User Friendly Interface for Data Preprocessing and Visualization using Machine Learning Models Mr. S. Yoganand1, Bharathi Kannan R2, Daya Meenakshi B2 1Assistant Professor, Department of Computer Science and Engineering, Agni College of Technology Chennai-130, Tamil Nadu, India. 2,3UG Student, Department of Computer Science and Engineering,Agni College of Technology Chennai-130, Tamil Nadu, India. ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract – Machine learning is one of the most efficient techniques for prediction and classification related problems. In this modern era, most of the industries all over the world depend upon the machine learning models which leadintothe data analytics century. There is no properandefficienttool for handling the datasets which use machine learning models for data prediction and Visualization. So, in this paper a novel idea is proposed for making the user-friendly approach to handle the machine learning models for data prediction and visualisation. A tool is developed, such that it performs data cleaning which will be a prerequisite for data analysis and then provides a visible representation of the cleansed data. The developed tool will take the input as structured dataset that contains both textual and numerical data which are then processed using machine learning algorithms to obtain a pre- processed dataset. This process may undergo series of steps to produce visualized and predicted data as per the chosen effective algorithm to obtain efficient result. Key Words: Machine learning, visualization, pre- processing, Tool, user interface. 1. INTRODUCTION An organisation uses the dataset for predictive analysis and an important concern in these cases is data quality. Using noisy data can hamper with the correctness of analysis. The common errors are missing values, duplicates and other errors. These errors need to be corrected for reliable decisions and analytics. The users must know that the effects of using the noisy data before proceeding with the cleaning process. Noise removal will improve the model performance, due to the fact that noises may disturb the discovery of important information. Machine learning is the appreciated application of Artificial Intelligence. It is used to learn automatically without any human assistancethatprovideshugedataset for analysing with a large number of data fields. With the data provided by the system after implementing the machine learning algorithms, organizations are able to work more effective and acquire profit over their competitors. The system that uses machine learning technique will be able to predict how the structure looks like and adjust the data according to their structure. The mainchallengesinmachine learning model is to deal with large data sources for data cleaning process. Data cleaning process is carried by taking in huge datasets which are checkedforthepossible errorsby using data pre-processing techniques. The other challenges include avoiding learning process from noisy data, avoiding building a prejudiced model, not giving reasons for compromising with the qualityofthedata.The bestpractices for data cleaning using machine learning techniquesthatare filling missing values, removing unnecessary rows,reducing the size of the data and implementing a good quality plan. The success of machine learning applications depends on the amount of good quality data that is given to it. But this process of cleaning may not be considered as a main area in data pre-processing. The system that uses powerful algorithms to process the noisy data can yield bad results if irrelevant or wrong training set of data is given. In the proposed model ML algorithms to find out the different patterns in the data and group it by itself into clean and noisy data which will help in reducing execution time. 2. Related Work: Data Pre-processing is used to convert the raw data into pre-processed data set. [1] In Machine Learning, the data pre-processing is used to transform or encode the data easily by their algorithm. It consists of interactive steps as follows. Data cleaning is used to detect and correct inaccurate records from a record or tables, and then replacing, modifying or deleting this noisy data. Data integration will combines the data residing indifferent sources that provides user with a unified view of these data [2]. The process of selecting suitable data for a research project will impact data integritywhereData transformation converts data from a source data format into resultant data [3]. The tools which are available to process the data in data processing and visualizing are Knime, Shogun, Oryx 2, Tensor flow, Weka, RapidMiner, Trifacta Wrangler, Python [12] [13]. In this paper, we will focus on removing the noisy data that identifies the numerical values, predicting and filling in missing values and detect outliers which hamper with data analysis [11]. We propose a system that simplifies the process for the user and allows for better processing. In summary, Machine learning for data cleaning might be the only way to provide complete and trustworthy data sets for
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 949 effective analytics, so we provide an user friendly interface for pre-processing and model analysis with visualizationfor the ease of user. 3. System Design: The Data Pre-processingisdone withthreemethods they are Data Cleaning, Data Transformation and Data Reduction. The data cleaning application is to process the raw dataset containing both textual and numerical data that convert it into a cleaned dataset which can be used for data analysis. Initially, users must upload the dataset in which they perform the analysis. They can choose the operations that they want to perform on their dataset from themodules provided. This application performs a series of operations which includes removing columns with less information or no information, removing unnecessary rows, identifyingthe numerical values, filling in the missing fields and identifying the outliers. Some columns may contain less information or no information that makes it hard to rely on such columns for analysis and so such columns can be removed and they don’t cause significant damage to the data. Some rows may contain empty fields which will again tamper with the proper pre-processing of the dataset. Hence such values are identified and removed. The dataset will contain categorical features ranging from numerical to non-numerical values. This application requires only numerical data which is used for analysis and prediction, such that the fields containing numeric values areidentified. If you try to remove them, you might reduce the amount of data that is available. So, these fields need to be filled in appropriate values. 4. Implementation: The outliers with data points are really far from the rest of your data points. Mathematically, an outlierisusually defined as an observation over three standard deviations from the mean. They can show up due to errors in data entry or measurement, or just because there's a variation in the population. Identifying and handlingoutliersisanimportant part of data cleaning. In Data Analysis we are using the subsequent algorithms to analyse the cleansed data. Linear regression, SVM (Support Vector Machine), KNN (K-Nearest Neighbours), Logistic Regression, Decision Tree, K-Means, Random Forest, Naive Bayes, Dimensional Reduction Algorithms, Gradient Boosting Algorithms. Linear Regression algorithm will use the info points to seek out the simplest fit line to model the info. A line can be represented by the equation, y = m*x + c where y is the dependent variable and x is the independent variable. Basic calculus theories are applied to seek out the values for m and c using the given data set. The SVM will separate the data points using a line. The KNN will predict unknown data point with its k nearest neighbours. The value of k is a critical factor regarding the accuracy of prediction. It determines the nearest distance using basic distance functions like Euclidean. Thisalgorithm has to be a high computation power and that we have to normalize the information initially to bring every datum within the same range. The Decision Tree algorithm is used to solve classification problems.Sometechniquesareusedto categorize the data they are Gini, Chi-square, entropy etc. K- Mean is an unsupervised algorithm that provides a solution for clustering problem. The algorithm will follow the procedure to form a cluster which contains homogeneous data. Random forest is identified as a collection of decision trees. Every tree will try to estimate a classification and this is called as a vote. We consider each votefromevery tree and chose the maximum voted classification. Naive Bayes can be applied only if the features are independent to each other. Gradient Boosting Algorithm usesmultipleweak algorithms to form accurate algorithm. Instead of using the single estimator, will create a more stable and robust algorithm. Based on the data set the algorithm is predicted and provides an efficient result for data analysing process. 5. Results: The user can click on the Submit button that is provided and then select the operations they wish to perform on their dataset from the list of operations provided. The user can then upload the dataset into the application by click on the Upload button to start the pre- processing. Initially the original dataset is displayed and then dataset after operation 1 will be displayed as cleansed dataset. The selected operations are performed with the
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 950 Cleansed dataset; finally the user will perform the data analysis with the required algorithm to obtain the result in visualization and it can be download by the user. Upload Noisy Dataset: Displaying the noisy Dataset: Preprocessing the Data: Applying Machine Learning Modal: Output: 6. Conclusion: Our developed systemperformsData Cleaning,Data Transformation and Data Reduction in data pre-processing. Our system which takes the rawdatasetsintotheapplication which are then pre-processed to clean up all the noisy data using pre-processing techniques and the cleansed data is visualized to the users after all the pre-processing is done. This system saves a lot of time since manual cleaning can be avoided. After cleansing the user can choose or select the machine learning model which will provide efficient results as plots. This serves as an effective purpose for the users who wants to clean huge datasets and visualizestheanalysis of pre-processed data. In future the accuracy and comparison of the machine learning algorithms can be done within the friendly user interface. REFERENCES [1] Cristian Felix, Anshul Vikram Pandey, and EnricoBertini, “TextTile: An Interactive Visualization Tool for Seamless Exploratory, Analysis of Structured Dataand Unstructured Text“, IEEE-2018. [2] Data,Huawen Liu, Xuelong Li, Jiuyong Li, andShichao Zhang, “Efficient Outlier Detection for High-Dimensional“, IEEE-2019. [3] M. Bostock, V. Ogievetsky, and J. Heer, “Datadriven documents,” IEEE-2011.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 951 [4] F. Beck, S. Koch, and D. Weiskopf, “Visual Analysis and Dissemination of Scientific Literature Collections with SurVis”, IEEE-2016. [5] Parke Godfrey, JarekGryz and PioterLasek,“Interactive visualisation of large datasets”, IEEE-2016. [6] Dileep kumarkoshleyand RajuHadler,“Data Cleaning: An Abstraction-based approach”, IEEE-2015. [7] Mehmet Adil Yalçın;NiklasElmqvist; Benjamin B. Bederson,“Keshif :Rapid and Expressive Tabular Data Exploration for Novices”, IEEE-2018. [8] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsosk, “LOCI: Fast outlier detection using the local correlation integral,” IEEE 19th Int. Conf. Data Eng. (ICDE), Bengaluru, India, 2003, pp. 315–326. [9] Y. Pang, J. Cao, and X. Li, “Learning samplingdistributions for efficient object detection”, IEEE Trans. Cybern., vol. 47, no. 1, pp. 117–129, Jan. 2017. [10] M. Gupta, J. Gao, C. C. Aggarwal, and J. Han, “Outlier detection for temporal data: A survey”, IEEE Trans. Knowl. Data Eng., vol. 26, no. 9, pp. 2250–2267, Sep. 2014. [11] S. F. Roth and J. Mattis, “Automating the presentation of information,” in Artificial Intelligence Applications, 1991. Pro-ceedings. , Seventh IEEE Conference on, vol. 1.IEEE, 1991, pp. 90–97. [12] M. Bostock and J. Heer, “Protovis: A graphical toolkit for visualization,” Visualization and Computer Graphics, IEEE Transactions on, vol. 15, no. 6, pp. 1121–1128, 2009. [13] A. Dziedzic, J. Duggan, A. J. Elmore, V. Gadepally, and M. Stonebraker, “Bigdawg: a polystore for diverse interactive applications,” in IEEE Viz Data Systems for Interactive Analysis, 2015. [14] P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis “Conditional functional dependencies for data cleaning. In Data Engineering”, IEEE 23rd International Conference on, pages 746–755. IEEE, 2007.

IRJET - An User Friendly Interface for Data Preprocessing and Visualization using Machine Learning Models

  • 1.
    International Research Journalof Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 948 An User Friendly Interface for Data Preprocessing and Visualization using Machine Learning Models Mr. S. Yoganand1, Bharathi Kannan R2, Daya Meenakshi B2 1Assistant Professor, Department of Computer Science and Engineering, Agni College of Technology Chennai-130, Tamil Nadu, India. 2,3UG Student, Department of Computer Science and Engineering,Agni College of Technology Chennai-130, Tamil Nadu, India. ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract – Machine learning is one of the most efficient techniques for prediction and classification related problems. In this modern era, most of the industries all over the world depend upon the machine learning models which leadintothe data analytics century. There is no properandefficienttool for handling the datasets which use machine learning models for data prediction and Visualization. So, in this paper a novel idea is proposed for making the user-friendly approach to handle the machine learning models for data prediction and visualisation. A tool is developed, such that it performs data cleaning which will be a prerequisite for data analysis and then provides a visible representation of the cleansed data. The developed tool will take the input as structured dataset that contains both textual and numerical data which are then processed using machine learning algorithms to obtain a pre- processed dataset. This process may undergo series of steps to produce visualized and predicted data as per the chosen effective algorithm to obtain efficient result. Key Words: Machine learning, visualization, pre- processing, Tool, user interface. 1. INTRODUCTION An organisation uses the dataset for predictive analysis and an important concern in these cases is data quality. Using noisy data can hamper with the correctness of analysis. The common errors are missing values, duplicates and other errors. These errors need to be corrected for reliable decisions and analytics. The users must know that the effects of using the noisy data before proceeding with the cleaning process. Noise removal will improve the model performance, due to the fact that noises may disturb the discovery of important information. Machine learning is the appreciated application of Artificial Intelligence. It is used to learn automatically without any human assistancethatprovideshugedataset for analysing with a large number of data fields. With the data provided by the system after implementing the machine learning algorithms, organizations are able to work more effective and acquire profit over their competitors. The system that uses machine learning technique will be able to predict how the structure looks like and adjust the data according to their structure. The mainchallengesinmachine learning model is to deal with large data sources for data cleaning process. Data cleaning process is carried by taking in huge datasets which are checkedforthepossible errorsby using data pre-processing techniques. The other challenges include avoiding learning process from noisy data, avoiding building a prejudiced model, not giving reasons for compromising with the qualityofthedata.The bestpractices for data cleaning using machine learning techniquesthatare filling missing values, removing unnecessary rows,reducing the size of the data and implementing a good quality plan. The success of machine learning applications depends on the amount of good quality data that is given to it. But this process of cleaning may not be considered as a main area in data pre-processing. The system that uses powerful algorithms to process the noisy data can yield bad results if irrelevant or wrong training set of data is given. In the proposed model ML algorithms to find out the different patterns in the data and group it by itself into clean and noisy data which will help in reducing execution time. 2. Related Work: Data Pre-processing is used to convert the raw data into pre-processed data set. [1] In Machine Learning, the data pre-processing is used to transform or encode the data easily by their algorithm. It consists of interactive steps as follows. Data cleaning is used to detect and correct inaccurate records from a record or tables, and then replacing, modifying or deleting this noisy data. Data integration will combines the data residing indifferent sources that provides user with a unified view of these data [2]. The process of selecting suitable data for a research project will impact data integritywhereData transformation converts data from a source data format into resultant data [3]. The tools which are available to process the data in data processing and visualizing are Knime, Shogun, Oryx 2, Tensor flow, Weka, RapidMiner, Trifacta Wrangler, Python [12] [13]. In this paper, we will focus on removing the noisy data that identifies the numerical values, predicting and filling in missing values and detect outliers which hamper with data analysis [11]. We propose a system that simplifies the process for the user and allows for better processing. In summary, Machine learning for data cleaning might be the only way to provide complete and trustworthy data sets for
  • 2.
    International Research Journalof Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 949 effective analytics, so we provide an user friendly interface for pre-processing and model analysis with visualizationfor the ease of user. 3. System Design: The Data Pre-processingisdone withthreemethods they are Data Cleaning, Data Transformation and Data Reduction. The data cleaning application is to process the raw dataset containing both textual and numerical data that convert it into a cleaned dataset which can be used for data analysis. Initially, users must upload the dataset in which they perform the analysis. They can choose the operations that they want to perform on their dataset from themodules provided. This application performs a series of operations which includes removing columns with less information or no information, removing unnecessary rows, identifyingthe numerical values, filling in the missing fields and identifying the outliers. Some columns may contain less information or no information that makes it hard to rely on such columns for analysis and so such columns can be removed and they don’t cause significant damage to the data. Some rows may contain empty fields which will again tamper with the proper pre-processing of the dataset. Hence such values are identified and removed. The dataset will contain categorical features ranging from numerical to non-numerical values. This application requires only numerical data which is used for analysis and prediction, such that the fields containing numeric values areidentified. If you try to remove them, you might reduce the amount of data that is available. So, these fields need to be filled in appropriate values. 4. Implementation: The outliers with data points are really far from the rest of your data points. Mathematically, an outlierisusually defined as an observation over three standard deviations from the mean. They can show up due to errors in data entry or measurement, or just because there's a variation in the population. Identifying and handlingoutliersisanimportant part of data cleaning. In Data Analysis we are using the subsequent algorithms to analyse the cleansed data. Linear regression, SVM (Support Vector Machine), KNN (K-Nearest Neighbours), Logistic Regression, Decision Tree, K-Means, Random Forest, Naive Bayes, Dimensional Reduction Algorithms, Gradient Boosting Algorithms. Linear Regression algorithm will use the info points to seek out the simplest fit line to model the info. A line can be represented by the equation, y = m*x + c where y is the dependent variable and x is the independent variable. Basic calculus theories are applied to seek out the values for m and c using the given data set. The SVM will separate the data points using a line. The KNN will predict unknown data point with its k nearest neighbours. The value of k is a critical factor regarding the accuracy of prediction. It determines the nearest distance using basic distance functions like Euclidean. Thisalgorithm has to be a high computation power and that we have to normalize the information initially to bring every datum within the same range. The Decision Tree algorithm is used to solve classification problems.Sometechniquesareusedto categorize the data they are Gini, Chi-square, entropy etc. K- Mean is an unsupervised algorithm that provides a solution for clustering problem. The algorithm will follow the procedure to form a cluster which contains homogeneous data. Random forest is identified as a collection of decision trees. Every tree will try to estimate a classification and this is called as a vote. We consider each votefromevery tree and chose the maximum voted classification. Naive Bayes can be applied only if the features are independent to each other. Gradient Boosting Algorithm usesmultipleweak algorithms to form accurate algorithm. Instead of using the single estimator, will create a more stable and robust algorithm. Based on the data set the algorithm is predicted and provides an efficient result for data analysing process. 5. Results: The user can click on the Submit button that is provided and then select the operations they wish to perform on their dataset from the list of operations provided. The user can then upload the dataset into the application by click on the Upload button to start the pre- processing. Initially the original dataset is displayed and then dataset after operation 1 will be displayed as cleansed dataset. The selected operations are performed with the
  • 3.
    International Research Journalof Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 950 Cleansed dataset; finally the user will perform the data analysis with the required algorithm to obtain the result in visualization and it can be download by the user. Upload Noisy Dataset: Displaying the noisy Dataset: Preprocessing the Data: Applying Machine Learning Modal: Output: 6. Conclusion: Our developed systemperformsData Cleaning,Data Transformation and Data Reduction in data pre-processing. Our system which takes the rawdatasetsintotheapplication which are then pre-processed to clean up all the noisy data using pre-processing techniques and the cleansed data is visualized to the users after all the pre-processing is done. This system saves a lot of time since manual cleaning can be avoided. After cleansing the user can choose or select the machine learning model which will provide efficient results as plots. This serves as an effective purpose for the users who wants to clean huge datasets and visualizestheanalysis of pre-processed data. In future the accuracy and comparison of the machine learning algorithms can be done within the friendly user interface. REFERENCES [1] Cristian Felix, Anshul Vikram Pandey, and EnricoBertini, “TextTile: An Interactive Visualization Tool for Seamless Exploratory, Analysis of Structured Dataand Unstructured Text“, IEEE-2018. [2] Data,Huawen Liu, Xuelong Li, Jiuyong Li, andShichao Zhang, “Efficient Outlier Detection for High-Dimensional“, IEEE-2019. [3] M. Bostock, V. Ogievetsky, and J. Heer, “Datadriven documents,” IEEE-2011.
  • 4.
    International Research Journalof Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 951 [4] F. Beck, S. Koch, and D. Weiskopf, “Visual Analysis and Dissemination of Scientific Literature Collections with SurVis”, IEEE-2016. [5] Parke Godfrey, JarekGryz and PioterLasek,“Interactive visualisation of large datasets”, IEEE-2016. [6] Dileep kumarkoshleyand RajuHadler,“Data Cleaning: An Abstraction-based approach”, IEEE-2015. [7] Mehmet Adil Yalçın;NiklasElmqvist; Benjamin B. Bederson,“Keshif :Rapid and Expressive Tabular Data Exploration for Novices”, IEEE-2018. [8] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsosk, “LOCI: Fast outlier detection using the local correlation integral,” IEEE 19th Int. Conf. Data Eng. (ICDE), Bengaluru, India, 2003, pp. 315–326. [9] Y. Pang, J. Cao, and X. Li, “Learning samplingdistributions for efficient object detection”, IEEE Trans. Cybern., vol. 47, no. 1, pp. 117–129, Jan. 2017. [10] M. Gupta, J. Gao, C. C. Aggarwal, and J. Han, “Outlier detection for temporal data: A survey”, IEEE Trans. Knowl. Data Eng., vol. 26, no. 9, pp. 2250–2267, Sep. 2014. [11] S. F. Roth and J. Mattis, “Automating the presentation of information,” in Artificial Intelligence Applications, 1991. Pro-ceedings. , Seventh IEEE Conference on, vol. 1.IEEE, 1991, pp. 90–97. [12] M. Bostock and J. Heer, “Protovis: A graphical toolkit for visualization,” Visualization and Computer Graphics, IEEE Transactions on, vol. 15, no. 6, pp. 1121–1128, 2009. [13] A. Dziedzic, J. Duggan, A. J. Elmore, V. Gadepally, and M. Stonebraker, “Bigdawg: a polystore for diverse interactive applications,” in IEEE Viz Data Systems for Interactive Analysis, 2015. [14] P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis “Conditional functional dependencies for data cleaning. In Data Engineering”, IEEE 23rd International Conference on, pages 746–755. IEEE, 2007.