Ieeepro techno solutions ieee java project - generalized approach for data

GENERALIZED APPROACH FOR DATA ANONYMIZATION USING MAP REDUCE ON CLOUD K.R.VIGNESH, M.Tech CSE, SRM University, Kattankulathur, Chennai, India. P.SARANYA, Asst.profesor Dept. of CSE, SRM University, Kattankulathur, Chennai, India ABSTRACT— Data anonymization has been extensively studied and widely adopted method for privacy preserving in data publishing and sharing scenario. Data anonymization is hiding up of sensitive data for owner’s data record to avoid unidentified Risk. The privacy of an individual can be effectively preserved while some aggregate information is shared to data user for data analysis and data mining. The proposed method is Generalized method data anonymization using Map Reduce on cloud. Here we Two Phase Top Down specialization. In First Phase the original data set is partitioned into group of smaller dataset and they are anonymized and intermediate result is produced. In second phase the intermediate result first is further anonymized to achieve consistent data set. And the data is presented in generalized form using Generalized Approach. Keywords: Cloud computing, Data Anonymization, Map Reduce, Privacy Preserving. 1. INTRODUCTION: Cloud computing is a disruptive trend at present, poses significant amount of current IT industry and for research organizations, Cloud computing provides massive storage and power capacity, enable user a to implement application cost effectively without heavy investment and infrastructure. But privacy preserving is one of the major disadvantages of cloud environment, some privacy issues are not new where some personal health record is shared for data analysis for research organization. For e.g. Microsoft Health Vault an online health cloud service. Data anonymization is widely used method for Privacy Preserving of data in non-interactive data publishing scenario Data anonymization refers to the hiding the identity/or sensitive data for owners data record. The privacy of individual can be effectively preserved while some aggregate information is shared for data analysis and mining. A variety of anonymizing algorithm is with different operations have been proposed[3,4,5,6] however the data set size has been increased tremendously in the big data trend[1,7] this have become a challenge for anonymization of data set. And for processing of large data set we use Map Reduce integrated with

cloud to provide high computational capability for application. 2. RELATED WORK: Recently data privacy preservation has been extensively studied and investigated [2]. Le Fever et.al has addressed about scalability of anonymization algorithm via introducing scalable decision tree and the sampling technique, and lwuchkwu et.al[8] proposed R-tree based index approach by building a spatial index over data sets, achieving high efficiency. However the approach aim at multidimensional generalization[6] which fail to work in Top Down Specialization[TDS]. Fung et.al [2, 9, 10] proposed some TDS approach that produce anonymize data set with data exploration problem. A data structure taxonomy indexed partition [TIPS] is exploited to improve efficiency of TDS but it fails to handle large data set. But this approach is centralized leasing to in adequacy of large data set. Several distributed algorithm are proposed to preserve privacy of multiple data set retained by multiple parties, Jiang et al [12] proposed distributed algorithm to anonymization to vertical portioned data. However, the above algorithms mainly based on secure anonymization and integration. But our aim is scalability issue of TDS anonymization. Further, Zhang et al [13] leveraged Map Reduce itself to automatically partition the computation job in term of security level protecting data and further processed by other Map Reduce itself to anonymize large scale data before further processed by other Map Reduce job, arriving at privacy preservation. 3. Top-Down Specialization: Generally, Top-Down Specialization (TDS) is an iterative process starting from the Topmost domain values in the taxonomy trees of attributes. Each round of iteration consists of three main steps, namely, finding the best specialization, performing specialization and updating values of the search metric for the next round [3]. Such a process is repeated until k-anonymity is violated, in order to expose the maximum data utility. The goodness of a specialization is measured by a search metric. We adopt the Information Gain per Privacy Loss (IGPL), a trade-off metric that considers both the privacy and information requirements, as the search metric in our approach. A specialization with the highest IGPL value is regarded as the best one and selected in each round. We briefly describe how to calculate the value of IGPL subsequently to make readers understand our approach well. Interested readers can refer to [11] for more details. Given a specialization spec: p→ child (p),the IGPL of the specialization is calculated by IGPL (spec) = IG (spec)/(PL(spec) + 1). (1) The term IG (spec) is the information gain after performing spec, and PL (spec) is the privacy loss. IG (spec) and PL (spec) can be computed via statistical information derived from data sets. Let Rx denote the set of original records containing attribute values that can be generalized to x. |Rx | is the number of data records in Rx. Let I(Rx ) be the entropy of Rx. Then, IG (spec) is calculated by ∑child p RC RP I Rc , (2) Let |(Rx , sv )| denote the number of the data records with sensitive value sv in Rx . I(Rx) is computed I Rx ∑sv€sv | R , RX . log2 R , RX .(3) The anonymity of a data set is defined by the minimum group size out of all QI-groups, denoted as

, i.e., A = minqid€QID {|QID(qid)|}, where |QID(qid)|, is the size of QID(qid). Let Ap (Spec) denote the anonymity before performing spec, while Ac (Spec) be that after performing spec. Privacy loss caused by Spec is calculated by PL (spec) = Ap (spec) - Ac (Spec). 4. Two Phase Top down Specialization: The Two-phase Top down specialization (TPDS) approach is in First phase the given data set is first partitioned and anonymized then the intermediate result is produced then in the second phase the intermediate result is further anonymized and stored in the database. Three components of the TPTDS approach, namely, data partition, anonymization level merging and data specialization 4.1 Sketch of Two Phase Top down specialization: We propose a Two-Phase Top-Down Specialization (TPTDS) approach to conduct the computation required in TDS in a highly scalable and efficient fashion. The two phases of our approach are based on the two levels of parallelization provisioned by Map Reduce on cloud. Basically, Map Reduce on cloud has two levels of parallelization, i.e., job level and task level. Job level parallelization means that multiple Map Reduce jobs can be executed simultaneously to make full use of cloud infrastructure resources. Combined with cloud, Map Reduce becomes more powerful and elastic as cloud can offer infrastructure resources on demand, e.g., Amazon Elastic Map Reduce service [11]. Task level parallelization refers to that multiple mapper/reducer tasks in a Map Reduce job are executed simultaneously over data splits. To achieve high scalability, we parallelizing multiple jobs on data partitions in the first phase, but the resultant anonymization levels are not identical. To obtain finally consistent anonymous data sets, the second phase is necessary to integrate the intermediate results and further anonymize entire data sets. In the first phase, an original data set D is partitioned into smaller ones. Let Di, 1 ≤ i ≤ P, denote the data sets partitioned from D the, where P is the number of partitions, and D=∑i=1 Di, 1≤i < j ≤ p. Then, we run a subroutine over each of the partitioned data sets in parallel to make full use of the job level parallelization of Map Reduce. The subroutine is a Map Reduce version of centralized TDS (MRTDS) which concretely conducts the computation required in TPTDS. MRTDS anonymizes data partitions to generate intermediate anonymization levels. An intermediate anonymization level means that further specialization can be performed without violating k-anonymity. MRTDS only leverages the task level parallelization of Map Reduce. Formally, let function MRTDS (D, K, AL) →AL` represent a MRTDS routine that anonymizes data set D to satisfy k-anonymity from anonymization level AL to A. AL0 is the initial anonymization level, i.e., AL0 = ({TOP1},{TOP2 }, … ,{TOPM} , where TopJ=DOM j,1≤ j ≤ m , is the topmost domain value in TTj . ALi is the resultant intermediate anonymization level. In the second phase, all intermediate anonymization levels are merged into one. The merged anonymization level is denoted as ALi . The merging process is formally represented as function merge(AL`1,AL`2….AL`P)→ ALi . Then, the whole data set D is further anonymized based on AL`, achieving k-anonymity finally, i.e., MRTDS (D, K, AL) →AL*, where AL* denotes the final anonymization level. Ultimately, D is concretely anonymized according to AL*.Above all, Algorithm 1 depicts the sketch of the two-phase TDS approach.

5. Data Partition: When D is partitioned into Di, 1 ≤ i ≤ p, it is required that the distribution of data records in Di is similar to D. A data record here can be treated as a point in an m dimension space, where m is the number of attributes. Thus, the intermediate anonymization levels derived from Di, 1 ≤ i ≤ p, can be more similar so that we can get a better merged anonymization level. Random sampling technique is adopted to partition D, which can satisfy the above requirement. Specifically, a random number RAND, 1≤ i ≤ p, is generated for each data record. A record is assigned to the partition DRand. Algorithm2 shows the Map Reduce program of data partition. Note that the number of Reducers should be equal to p so that each Reducer handles one value of Rand, exactly producing p resultant files. Each file contains a random sample of D. Once partitioned data sets Di, 1≤ i ≤ p , are obtained, we run MRTDS (D, K, AL0 ) on these data sets in parallel to derive intermediate anonymization levels ALi*,1≤ i ≤p. 6. Data Specialization: An original data set D is concretely specialized for anonymization in a one-pass MapReduce job. After obtaining the merged intermediate anonymization level AL1 , we run MRTDS (D, K, AL1 ) on the entire data set D, and get the final anonymization level AL* . Then, the data set D is anonymized by replacing original attribute values in D with the responding domain values in AL*. Details of Map and Reduce functions of the data specialization MapReduce job are described in Algorithm3. The Map function emits anonymous records and its count. The Reduce function simply aggregates these anonymous records and counts their number. An anonymous record and its count represent a QI-group. The QI-groups constitute the final anonymous data sets. Algorithm1:Two Phase Top Down Specialization: Input: Data set D, anonymity parameter K, K` and the number of partition P. Output: Anonymous Data set D*. 1. Partition D into Di, 1≤ i ≤ p`. 2. Execute MRTDS (Di, Ki , AL0) →AL`I, 1 ≤ i ≤ p in parallel as multiple Map Reduce jobs 3. Merge all intermediate anonymization level into one, merge (AL`1, AL`2….AL`P) 4. Execute MRTDS (D, K, AL1) → AL* to achieve K‐anonymity. 5. Specialization D according to AL*, output D**. Algorithm 3: Data Specialization: Input: Data Record (IDr,r),r € D, Anonymization Level AL*` Output: Anonymous Record(r*,count). Map: Construct anonymous Record r*=(p1,p2,……….pm, sv),pi 1≤ i ≤ m, is the parent of specialization in current AL and it I also an ancestor of Vi in r; emit ( r*,count) Reduce: For each r*,sum←∑count; emit ( r*,sum) Algorithm 2: Data partition and Map & Reduce. Input: Data Record (IDr,r), r € D, parameters P. Output: Di, 1≤ i ≤ p. Map: Generate a random number rand, where 1≤ rand ≤ p; emit (rand, r). Reduce: For each rand, emit (null, list(r)).

7. MRTDS Driver: Usually, a single MapReduce job is inadequate to accomplish a complex task in many applications. Thus, a group of MapReduce jobs are orchestrated in a driver program to achieve such an objective. MRTDS consists of MRTDS Driver and two types of jobs, i.e., IGPL Initialization and IGPL Update. The driver arranges the execution of jobs. Algorithm 4 frames MRTDS Driver where a data set is anonymized by TDS. It is the algorithmic design of function MRTDS (D, K, AL) →AL. Note that we leverage anonymization level to manage the process of anonymization. Step 1 initializes the values of information gain and privacy loss for all specializations, which can be done by the job IGPL Initialization. Step 2 is iterative. Firstly, the best specialization is selected from valid specializations in current anonymization level as described in Step 2.1. A specialization spec is a valid one if it satisfies two conditions. One is that its parent value is not a leaf, and the other is that the anonymity Ac (spec) > k, i.e., the data set is still k-anonymous if spec is performed. Then, the current anonymization level is modified via performing the best specialization in Step 2.2, i.e., removing the old specialization and inserting new ones that are derived from the old one. In Step 2.3, information gain of the newly added specializations and privacy loss of all specializations need to be recomputed, which are accomplished by job IGPL Update. The iteration continues until all specializations become invalid, achieving the maximum data utility. MRTDS produces the same anonymous data as the centralized TDS in [12], because they follow the same steps. MTRDS mainly differs from centralized TDS on calculating IGPL values. However, calculating IGPL values dominates the scalability of TDS approaches, as it requires TDS algorithms to count the statistical information of data sets iteratively. MRTDS exploits Map Reduce on cloud to make the computation of IGPL parallel and scalable. REFERENCE: [1] S. Chaudhuri, “What Next?: A Half-Dozen Data Management Research Goals for Big Data and the Cloud,”in Proc. 31st Symp Principles of Database Systems (PODS'12), pp. 1-4, 2012. [2] B.C.M. Fung, K. Wang, R. Chen and P.S. Yu, “Privacy- Preserving Data Publishing: A Survey of Recent Developments,”ACM Comput. Surv., vol. 42, no. 4, pp. 1-53, 2010. [3] B.C.M. Fung, K. Wang and P.S. Yu, “Anonymizing Classification Data for Privacy Preservation,” IEEE Trans.Knowl..Data Eng., vol. 19, no. 5, pp. 711-725, 2007. [4] X. Xiao and Y. Tao, “Anatomy: Simple and Effective PrivacyPreservation,” Proc. 32nd Int'l Conf. Algorithm 4: MRTDS DRIVER. Input: Data set D anonymized level AL and K- anonymity parameter k. Output: Anonymization AL`. 1. Initialize the value of search metrics IGPL, i.e., for each specialization spec € uj m =1 cutj . The IGPL value of spec is computed by job IGPL initialization 2. while ₤ spec € uj m =1 cutj is valid 2.1 find the specialization from Ali ,spec Best. 2.2 update ALi and ALi+1. 2.3 update information Gained on the new specialization in ALi+1. And the privacy of the specialization IGPL update. end while AL`← AL

Very Large Data Bases (VLDB'06), pp. 139-150, 2006. [5]. K. LeFevre, D.J. DeWitt and R. Ramakrishnan, “Incognito: EfficientFull-Domain K-Anonymity,” Proc. 2005 ACM SIGMODInt'l Conf' Management of Data (SIGMOD '05), pp. 49-60, 2005. [6].K. LeFevre, D.J. DeWitt and R. Ramakrishnan, “Mondrian Multidimensional K-Anonymity,” Proc. 22nd Int'l Conf. Data Engineering (ICDE '06), artical 25, 2006. [7].V. Borkar, M.J. Carey and C. Li, “Inside "Big Data Management": Ogres, Onions, or Parfaits?,” Proc. 15th Int'l Conf.Extending Database Technology (EDBT'12), pp. 3-14, 2012. [8].T. Iwuchukwu and J.F. Naughton “K- Anonymization as Spatial Indexing: Toward Scalable and Incremental Anonymization,” Proc. 33rd Int'l Conf. Very Large Data Bases(VLDB'07), pp. 746- 757, 2007. [9] N. Mohammed, B. Fung, P.C.K. Hung and C.K. Lee, “Centralized and Distributed Anonymization for High-Dimensional Healthcare Data,” ACM Trans. Knowl. Discov. Data, vol. 4, no. 4, article 18, 2010. [10]. B. Fung, K. Wang, L. Wang and P.C.K. Hung, “Privacy- Preserving Data Publishing for Cluster Analysis,” Data Knowl.Eng., vol. 68, no. 6, pp. 552- 575, 2009. [11].Amazon Web Services, “Amazon Elastic Mapreduce,”http://aws.amazon.com/elasticmapreduc e/, accessed on: Jan. 05, 2013. [12]. W. Jiang and C. Clifton, “A Secure Distributed Framework for Achieving k-anonymity,” VLDB J., vol. 15, no. 4, pp. 316-333,2006. [13]. K. Zhang, X. Zhou, Y. Chen, X. Wang and Y. Ruan, “Sedic: Privacy-Aware Data Intensive Computing on Hybrid Clouds,”Proc. 18th ACM Conf. Computer and Communications Security (CCS'11), pp. 515-526, 2011. BIOGRAPHY K. R. VIGNESH is the M.Tech student of the Computer science and Engineering in SRM University, India. His main areas of interest in cloud computing. P. SARANYA, is the Assistant Professor of Department of Computer Science and Engineering, Kattankulathur Campus, SRM University, India. Her main areas of interest in Data Mining and Web Mining.

Ieeepro techno solutions ieee java project - generalized approach for data

More Related Content

What's hot

Viewers also liked

Similar to Ieeepro techno solutions ieee java project - generalized approach for data

More from hemanthbbc

Recently uploaded

Ieeepro techno solutions ieee java project - generalized approach for data