sva-mk
diff --git a/‎machine-learning/README.md‎
Lines changed: 8 additions & 4 deletions b/‎machine-learning/README.md‎
Lines changed: 8 additions & 4 deletions
diff --git a/‎machine-learning/notebooks/r/README.md‎
Lines changed: 14 additions & 0 deletions b/‎machine-learning/notebooks/r/README.md‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎machine-learning/notebooks/sql/README.md‎
Lines changed: 9 additions & 6 deletions b/‎machine-learning/notebooks/sql/README.md‎
Lines changed: 9 additions & 6 deletions
diff --git a/‎machine-learning/python/README.md‎
Lines changed: 10 additions & 0 deletions b/‎machine-learning/python/README.md‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎machine-learning/python/oml4py/README.md‎
Lines changed: 10 additions & 0 deletions b/‎machine-learning/python/oml4py/README.md‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎machine-learning/r/oml4spark/README.md‎
Lines changed: 32 additions & 0 deletions b/‎machine-learning/r/oml4spark/README.md‎
Lines changed: 32 additions & 0 deletions
diff --git a/‎machine-learning/r/oml4spark/oml4spark_execute_cross_validation.r‎
Lines changed: 112 additions & 0 deletions b/‎machine-learning/r/oml4spark/oml4spark_execute_cross_validation.r‎
Lines changed: 112 additions & 0 deletions
@@ -17,15 +17,19 @@ The following structure represent the available APIs and environments for the di
  * __r__ - Oracle Machine Learning for R Notebooks, and Oracle Machine Learning for Spark Notebooks
  * __sql__ - Notebooks for the Oracle Machine Learning for SQL and also the Oracle Machine Learning Notebooks (based on Zeppelin) included with Oracle ADW and ATP
 
-* __python__ - Oracle Machine Learning for Python examples (coming soon)
+* __python__ (Python language examples)
+ * __oml4py__ - Oracle Machine Learning for Python examples (coming soon)
 
-* __r__
+* __r__ (R language examples)
  * __oml4r__ - Oracle Machine Learning for R (in-Database) R code examples
  * __oml4spark__ - Oracle Machine Learning for Spark (in-Data Lake) R code examples
 
-* __sql__
+* __sql__ (SQL and PL/SQL examples)
  * __19c__ - Oracle Machine Learning for SQL examples for Oracle Database 19c
  * __20c__ - Oracle Machine Learning for SQL examples for Oracle Database 20c
 
-For more information please visit our main page for Oracle Machine Learning at https://oracle.com/machine-learning
+For more information please visit our main page [Oracle Machine Learning](https://oracle.com/machine-learning)
 
+ #### Copyright (c) 2020 Oracle Corporation and/or its affilitiates.
+
+ ###### [The Universal Permissive License (UPL), Version 1.0](https://oss.oracle.com/licenses/upl/)
@@ -0,0 +1,14 @@
+# Oracle Machine Learning sample Notebooks in R language
+The sample notebooks on this folder are expected to be used with Oracle Machine Learning for R and Oracle Machine Learning for Spark. 
+
+The notebooks available in JSON format are compatible with any open-source Apache Zeppelin notebook environment.
+
+That includes an Apache Zeppelin running on a server that can connect to an Oracle Database (for using the OML4R notebook examples), or an Apache Zeppelin running on a server (Hadoop Edge Node) that can connect to a Hadoop Cluster.
+
+The OML4R notebooks can be used with an on-premise Oracle Database and also a Database Cloud Service (except the Autonomous Database, which runs its own Oracle Machine Learning Notebooks).
+
+The OML4Spark notebooks can also be used with the Oracle Big Data Manager (available in the Oracle Big Data Appliance or Oracle Big Data Cloud Service).
+
+#### Copyright (c) 2020 Oracle Corporation and/or its affilitiates.
+
+###### [The Universal Permissive License (UPL), Version 1.0](https://oss.oracle.com/licenses/upl/)
@@ -1,10 +1,9 @@
-# Oracle Machine Learning
-Oracle Machine Learning is a collaborative user interface for data scientists and business and data analysts who perform machine learning in the Autonomous Databases--Autonomous Data Warehouse (ADW) and Autonomous Transactional Database (ATP). 
+# Oracle Machine Learning Notebooks
+Oracle Machine Learning Notebooks is a collaborative user interface for data scientists and business and data analysts who perform machine learning in Oracle Autonomous Database -Autonomous Data Warehouse (ADW) and Autonomous Transactional Database (ATP). 
 
-Oracle Machine Learning enables data scientists, citizen data scientists, and data analysts to work together to explore their data visually and develop analytical methodologies in the Autonomous Data Warehouse Cloud. Oracle's high performance, parallel and scalable in-Database implementations of machine learning algorithms are exposed via SQL and PL/SQL using notebook technologies. Oracle Machine Learning enables teams to collaborate to build, assess, and deploy machine learning models, while increasing data scientist productivity Oracle Machine Learning focuses on ease of use and simplified machine learning for data science – from preparation through deployment – all in the Autonomous Database.
-
-Based on Apache Zeppelin notebook technology, Oracle Machine Learning provides a common platform with a single interface that can connect to multiple data sources and access multiple back-end Autonomous Database servers. Multi-user collaboration enables the same notebook document to be opened simultaneously by different users, such that changes made by one user to a notebook are instantaneously reflected to all users viewing that notebook. To support enterprise requirements for security, authentication, and auditing, Oracle Machine Learning supports privilege-based access to data, models, and notebooks, as well as being integrated with Oracle security protocols.
+Oracle Machine Learning Notebooks enables data scientists, citizen data scientists, and data analysts to work together to explore their data visually and develop analytical methodologies in the Autonomous Data Warehouse Cloud. Oracle's high performance, parallel and scalable in-Database implementations of machine learning algorithms are exposed via SQL and PL/SQL using notebook technologies. Oracle Machine Learning enables teams to collaborate to build, assess, and deploy machine learning models, while increasing data scientist productivity Oracle Machine Learning focuses on ease of use and simplified machine learning for data science – from preparation through deployment – all in Oracle Autonomous Database.
 
+Based on Apache Zeppelin notebook technology, Oracle Machine Learning Notebooks provides a common platform with a single interface that can connect to multiple data sources and access multiple back-end Autonomous Database servers. Multi-user collaboration enables the same notebook document to be opened simultaneously by different users, such that changes made by one user to a notebook are instantaneously reflected to all users viewing that notebook. To support enterprise requirements for security, authentication, and auditing, Oracle Machine Learning supports privilege-based access to data, models, and notebooks, as well as being integrated with Oracle security protocols.
 
 Key Features: 
 
@@ -15,4 +14,8 @@ Key Features:
 * Enables and supports deployments of enterprise machine learning methodologies in both Autonomous Data Warehouse (ADW) and Autonomous Transactional Database (ATP)
 
 
- See Introducing Oracle Machine Learning SQL Notebooks for the Oracle Autonomous Data Warehouse Cloud! blog post (https://blogs.oracle.com/datamining/introducing-oracle-machine-learning-sql-notebooks-for-the-oracle-autonomous-data-warehouse-cloud ) for more information. 
+ See [Introducing Oracle Machine Learning Notebooks for the Oracle Autonomous Data Warehouse](https://blogs.oracle.com/datamining/introducing-oracle-machine-learning-sql-notebooks-for-the-oracle-autonomous-data-warehouse-cloud) blog post for more information.
+
+#### Copyright (c) 2020 Oracle Corporation and/or its affilitiates.
+
+###### [The Universal Permissive License (UPL), Version 1.0](https://oss.oracle.com/licenses/upl/)
@@ -0,0 +1,10 @@
+# Oracle Machine Learning components for the Python language (coming soon)
+Oracle Machine Learning for Python (OML4Py) is a component of the Oracle Machine Learning family of products.
+
+OML4Py is going to be initially available inside Oracle Autonomous Database, but we expect it to be released later on all Oracle Database platforms.
+
+Outside of the Autonomous Database, OML4Py will be compatible Oracle Linux 7.x (or RHEL 7.x) and Python 3.x.
+
+The examples here would run on either environment.
+
+#### Copyright (c) 2020 Oracle Corporation and/or its affilitiates.
@@ -0,0 +1,10 @@
+# Oracle Machine Learning for Python (coming soon)
+Oracle Machine Learning for Python (OML4Py) is a component of the Oracle Machine Learning family of products.
+
+OML4Py is going to be initially available inside Oracle Autonomous Database, but we expect it to be released later on all Oracle Database platforms.
+
+Outside of the Autonomous Database, OML4Py will be compatible Oracle Linux 7.x (or RHEL 7.x) and Python 3.x.
+
+The examples here would run on either environment.
+
+#### Copyright (c) 2020 Oracle Corporation and/or its affilitiates.
@@ -0,0 +1,32 @@
+# OML4Spark-Tutorials
+
+## Tutorials for OML4Spark (a.k.a. ORAAH) release 2.8.x
+**The [Oracle Machine Learning for Spark][1] (OML4Spark) is a set of R packages and Java libraries**
+**It provides several features:**
+- An R interface for manipulating data stored in a local File System, HDFS, HIVE, Impala or JDBC sources, and creating Distributed Model Matrices across a Cluster of Hadoop Nodes in preparation for ML
+- A general computation framework where users invoke parallel, distributed MapReduce jobs from R, writing custom mappers and reducers in R while also leveraging open source CRAN packages
+- Parallel and distributed Machine Learning algorithms that take advantage of all the nodes of a Hadoop cluster for scalable, high performance modeling on big data. Functions use the expressive R formula object optimized for Spark parallel execution
+ORAAH's custom LM/GLM/MLP NN algorithms on Spark scale better and run faster than the open-source Spark MLlib functions, but ORAAH provides interfaces to MLlib as well
+- Core Analytics functionality halso available in a standalone Java library that can be used directly without the need of the R language, and can be called from any Java or Scala platform.
+
+
+**The following are a list of demos containing R code for learning about (OML4Spark)** 
+- Files on the current folder
+ - Introduction to OML4Spark (oml4spark_tutorial_getting_started_with_hdfs.r)
+ - Working with HIVE, IMPALA and Spark Data Frames (oml4spark_tutorial_getting_started_with_hive_impala_spark.r)
+ - Function in R to visualize Hadoop Data in Apache Zeppelin z.show (oml4spark_function_zeppelin_visualization_z_show.r)
+ - AutoML for Classification using Cross Validation with OML4Spark
+ * Sample Execution of the Cross Validation (oml4spark_execute_cross_validation.r)
+ * Function to run the Cross Validation (oml4spark_function_run_cross_validation.r)
+ * Function to Create a Balanced input Dataset (oml4spark_function_create_balanced_input.r)
+ * Function to run Variable Selection via GLM Logistic (oml4spark_function_variable_selection_via_glm.r)
+ * Function to run Variable Selection via Singular Value Decomposition (oml4spark_function_variable_selection_via_pca.r)
+ * Function to compute Confusion Matrix and statistics (oml4spark_function_confusion_matrix_in_spark.r)
+ * Function to build all OML4Spark models (oml4spark_function_build_all_classification_models.r)
+
+[1]:https://www.oracle.com/database/technologies/datawarehouse-bigdata/oml4spark.html
+
+#### Copyright (c) 2020 Oracle Corporation and its affiliates
+
+##### [The Universal Permissive License (UPL), Version 1.0](https://oss.oracle.com/licenses/upl/)
+
@@ -0,0 +1,112 @@
+######################################################################
+# oml4spark_execute_cross_validation.r
+#
+# This demo uses the “Ontime” Airline dataset from the Bureau of 
+# Transportation Statistics, and we want to run classification models 
+# to try to identify the best models to predict a cancelled flight. 
+#
+# The functions are capable of accepting R DataFrames, CSV files on HDFS, 
+# HIVE/IMPALA tables, and Spark DataFrames (generated in the Spark Session 
+# that OML4Spark is processing), as input for processing.
+#
+# The initial stage is used for balancing the Sample (50% ’0’s and 50% ’1’s) 
+# to improve the ability of the different models on detecting the cancellations. 
+# The final output is requested to be a maximum of 30,000 records
+#
+# The second stage uses the balanced data as input to run a test on several 
+# classification models available in OML4Spark, using k-Fold 
+# Cross-Validation with k=3
+# 
+# The final output is a list of the Models in descending order of the 
+# statistic requested (in this case it was Acccuracy), and a chart of descending 
+# Balanced Accuracy for the models
+# 
+# All processing is done on Spark by using OML4Spark’s interfaces to several 
+# functions and SparkSQL as well
+#
+# About the Ontime Airline dataset: The database contains scheduled and actual 
+# departure and arrival times reported by certified U.S. air carriers that 
+# account for at least one percent of domestic scheduled passenger revenues 
+# since 1987. The data is collected by the Office of Airline Information, 
+# Bureau of Transportation Statistics (BTS) , and can be downloaded from 
+# their site at 
+# https://www.transtats.bts.gov/tables.asp?DB_ID=120&DB_Name=&DB_Short_Name=#
+#
+# Copyright (c) 2020 Oracle Corporation 
+# The Universal Permissive License (UPL), Version 1.0 
+# 
+# https://oss.oracle.com/licenses/upl/ 
+# 
+# 
+######################################################################
+
+# Calls the OML4Spark libraries
+library(ORCH)
+
+# Create a new Spark Session
+if (spark.connected()) spark.disconnect()
+spark.connect('yarn', memory='9g', enableHive = TRUE)
+
+# Connect to IMPALA
+ore.connect(type='IMPALA',host='xxxxxxxx',user='oracle', port='21050', all=FALSE )
+# Synchronize the Table ALLSTATE
+ore.sync(table='ontime1m')
+ore.attach()
+# Check that the table is viewable
+ore.ls()
+
+# Show a sample of the data
+head(ontime1m)
+
+# Load functions written for Cross Validation using the OML4Spark facilities for
+# manipulating Spark DataFrames
+source ('~/oml4spark_function_create_balanced_input.r')
+source ('~/oml4spark_function_run_cross_validation.r')
+
+## Create a balanced Spark DF by smart sampling based on a specific formula
+
+# Formula for Classification of whether a customer had any Insurance Claims
+formula_class <- cancelled ~ distance + as.factor(month) + as.factor(year) + as.factor(dayofmonth) + as.factor(dayofweek) 
+
+# Create a Balanced Spark DataFrame with 50/50 output, requesting sampling down to 90,000 rows in total
+# The idea is to balance the target variable ANY_CLAIM (whether the customer had any insurance claims) is 50% '0's and 50% '1's
+# The input to the function is the IMPALA table, the formula that will be used for model build, 
+system.time({
+ balancedData <- createBalancedInput(input_bal=ontime1m,
+ formula_bal=formula_class, 
+ reduceToFormula=TRUE,
+ feedback = TRUE,
+ sampleSize = 10000
+ )
+})
+
+# Review the Spark DataFrame called "balancedData" before executing the Cross-Validation
+# The global average proportion of having any claims is 0.5 (since we balanced the data)
+balancedData$show()
+
+# Execute a 3-fold Cross-Validation using the algorithms provided by OML4Spark and Spark MLlib
+finalModelSelection <- runCrossValidation(input_xval=balancedData, 
+ formula_xval=formula_class, 
+ numKFolds=3, 
+ selectedStatistic='Accuracy', 
+ legend='',
+ feedback = TRUE )
+
+
+# Many detailed explanations of the different statistics printed can be 
+# found at https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers
+#
+# The original statistic requested as the one for sorting was the Mathews Correlation Coefficient
+# More information about the MCC at https://en.wikipedia.org/wiki/Matthews_correlation_coefficient 
+print(as.data.frame(finalModelSelection[[4]]))
+
+# Show the different components returned by the function
+finalModelSelection
+
+if (spark.connected()) spark.disconnect()
+if (ore.is.connected()) ore.disconnect()
+
+
+#####################################################
+### END CROSS-VALIDATION BEST MODEL IDENTIFICATION 
+#####################################################