Skip to content

Commit 5bf5ceb

Browse files
authored
Additional Readme details and comments (oracle-samples#96)
* Update README.md * Update README.md * New demo programs for sql, r and notebooks * Added examples for oml4r and oml4spark * Update README.md * Create OML4R Clustering.json * Updated Readme Reflect the nature of Python demos coming soon * Update README.md Added mention to Autonomous Database, renamed gui -> odmr * Updates to SQL examples and odmr Updates to SQL examples and odmr * Changes to License on SQL Demos Added License comments on SQL Demos to Universal Permissive License 1.0 * OML4R and OML4Spark updates and Demos OML4R and OML4Spark updates and Demos * Update license information Update license information * Updates to license information Updates to license information * Updated the releases of OML4SQL Notebooks Updated the releases of OML4SQL Notebooks * Update OML4R Clustering Notebook Update OML4R Clustering Notebook * Adjusting Notebook names Adjusting Notebook names * Adjusting names with _ Adjusting names with _ * Removed obsolete Demo Removed obsolete Demo * Added "Notebooks" to the OML Notebooks naming Added "Notebooks" to the OML Notebooks naming * Created Readme for Notebooks in R language * Readme for Python and OML4Py * Updated Python components * Update README.md * Minor changes to demos * Create README.md * Changes to Readme and XGBoost * Update README.md * Update README.md * Update README.md * Update README.md * Update oml4spark_function_build_all_classification_models.r * Update oml4spark_function_confusion_matrix_in_spark.r * Update oml4spark_function_create_balanced_input.r * Update oml4spark_function_run_cross_validation.r * Update oml4spark_function_variable_selection_via_glm.r * Update oml4spark_function_zeppelin_visualization_z_show.r * Update oml4spark_tutorial_getting_started_with_hdfs.r * Update oml4spark_tutorial_getting_started_with_hive_impala_spark.r * Update oml4sql-r-extensible-regression-neural-networks.sql
1 parent 8945494 commit 5bf5ceb

21 files changed

+1189
-65
lines changed

machine-learning/README.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,15 +17,19 @@ The following structure represent the available APIs and environments for the di
1717
* __r__ - Oracle Machine Learning for R Notebooks, and Oracle Machine Learning for Spark Notebooks
1818
* __sql__ - Notebooks for the Oracle Machine Learning for SQL and also the Oracle Machine Learning Notebooks (based on Zeppelin) included with Oracle ADW and ATP
1919

20-
* __python__ - Oracle Machine Learning for Python examples (coming soon)
20+
* __python__ (Python language examples)
21+
* __oml4py__ - Oracle Machine Learning for Python examples (coming soon)
2122

22-
* __r__
23+
* __r__ (R language examples)
2324
* __oml4r__ - Oracle Machine Learning for R (in-Database) R code examples
2425
* __oml4spark__ - Oracle Machine Learning for Spark (in-Data Lake) R code examples
2526

26-
* __sql__
27+
* __sql__ (SQL and PL/SQL examples)
2728
* __19c__ - Oracle Machine Learning for SQL examples for Oracle Database 19c
2829
* __20c__ - Oracle Machine Learning for SQL examples for Oracle Database 20c
2930

30-
For more information please visit our main page for Oracle Machine Learning at https://oracle.com/machine-learning
31+
For more information please visit our main page [Oracle Machine Learning](https://oracle.com/machine-learning)
3132

33+
#### Copyright (c) 2020 Oracle Corporation and/or its affilitiates.
34+
35+
###### [The Universal Permissive License (UPL), Version 1.0](https://oss.oracle.com/licenses/upl/)
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Oracle Machine Learning sample Notebooks in R language
2+
The sample notebooks on this folder are expected to be used with Oracle Machine Learning for R and Oracle Machine Learning for Spark.
3+
4+
The notebooks available in JSON format are compatible with any open-source Apache Zeppelin notebook environment.
5+
6+
That includes an Apache Zeppelin running on a server that can connect to an Oracle Database (for using the OML4R notebook examples), or an Apache Zeppelin running on a server (Hadoop Edge Node) that can connect to a Hadoop Cluster.
7+
8+
The OML4R notebooks can be used with an on-premise Oracle Database and also a Database Cloud Service (except the Autonomous Database, which runs its own Oracle Machine Learning Notebooks).
9+
10+
The OML4Spark notebooks can also be used with the Oracle Big Data Manager (available in the Oracle Big Data Appliance or Oracle Big Data Cloud Service).
11+
12+
#### Copyright (c) 2020 Oracle Corporation and/or its affilitiates.
13+
14+
###### [The Universal Permissive License (UPL), Version 1.0](https://oss.oracle.com/licenses/upl/)
Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,9 @@
1-
# Oracle Machine Learning
2-
Oracle Machine Learning is a collaborative user interface for data scientists and business and data analysts who perform machine learning in the Autonomous Databases--Autonomous Data Warehouse (ADW) and Autonomous Transactional Database (ATP).
1+
# Oracle Machine Learning Notebooks
2+
Oracle Machine Learning Notebooks is a collaborative user interface for data scientists and business and data analysts who perform machine learning in Oracle Autonomous Database -Autonomous Data Warehouse (ADW) and Autonomous Transactional Database (ATP).
33

4-
Oracle Machine Learning enables data scientists, citizen data scientists, and data analysts to work together to explore their data visually and develop analytical methodologies in the Autonomous Data Warehouse Cloud. Oracle's high performance, parallel and scalable in-Database implementations of machine learning algorithms are exposed via SQL and PL/SQL using notebook technologies. Oracle Machine Learning enables teams to collaborate to build, assess, and deploy machine learning models, while increasing data scientist productivity Oracle Machine Learning focuses on ease of use and simplified machine learning for data science – from preparation through deployment – all in the Autonomous Database.
5-
6-
Based on Apache Zeppelin notebook technology, Oracle Machine Learning provides a common platform with a single interface that can connect to multiple data sources and access multiple back-end Autonomous Database servers. Multi-user collaboration enables the same notebook document to be opened simultaneously by different users, such that changes made by one user to a notebook are instantaneously reflected to all users viewing that notebook. To support enterprise requirements for security, authentication, and auditing, Oracle Machine Learning supports privilege-based access to data, models, and notebooks, as well as being integrated with Oracle security protocols.
4+
Oracle Machine Learning Notebooks enables data scientists, citizen data scientists, and data analysts to work together to explore their data visually and develop analytical methodologies in the Autonomous Data Warehouse Cloud. Oracle's high performance, parallel and scalable in-Database implementations of machine learning algorithms are exposed via SQL and PL/SQL using notebook technologies. Oracle Machine Learning enables teams to collaborate to build, assess, and deploy machine learning models, while increasing data scientist productivity Oracle Machine Learning focuses on ease of use and simplified machine learning for data science – from preparation through deployment – all in Oracle Autonomous Database.
75

6+
Based on Apache Zeppelin notebook technology, Oracle Machine Learning Notebooks provides a common platform with a single interface that can connect to multiple data sources and access multiple back-end Autonomous Database servers. Multi-user collaboration enables the same notebook document to be opened simultaneously by different users, such that changes made by one user to a notebook are instantaneously reflected to all users viewing that notebook. To support enterprise requirements for security, authentication, and auditing, Oracle Machine Learning supports privilege-based access to data, models, and notebooks, as well as being integrated with Oracle security protocols.
87

98
Key Features:
109

@@ -15,4 +14,8 @@ Key Features:
1514
* Enables and supports deployments of enterprise machine learning methodologies in both Autonomous Data Warehouse (ADW) and Autonomous Transactional Database (ATP)
1615

1716

18-
See Introducing Oracle Machine Learning SQL Notebooks for the Oracle Autonomous Data Warehouse Cloud! blog post (https://blogs.oracle.com/datamining/introducing-oracle-machine-learning-sql-notebooks-for-the-oracle-autonomous-data-warehouse-cloud ) for more information.
17+
See [Introducing Oracle Machine Learning Notebooks for the Oracle Autonomous Data Warehouse](https://blogs.oracle.com/datamining/introducing-oracle-machine-learning-sql-notebooks-for-the-oracle-autonomous-data-warehouse-cloud) blog post for more information.
18+
19+
#### Copyright (c) 2020 Oracle Corporation and/or its affilitiates.
20+
21+
###### [The Universal Permissive License (UPL), Version 1.0](https://oss.oracle.com/licenses/upl/)

machine-learning/python/README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Oracle Machine Learning components for the Python language (coming soon)
2+
Oracle Machine Learning for Python (OML4Py) is a component of the Oracle Machine Learning family of products.
3+
4+
OML4Py is going to be initially available inside Oracle Autonomous Database, but we expect it to be released later on all Oracle Database platforms.
5+
6+
Outside of the Autonomous Database, OML4Py will be compatible Oracle Linux 7.x (or RHEL 7.x) and Python 3.x.
7+
8+
The examples here would run on either environment.
9+
10+
#### Copyright (c) 2020 Oracle Corporation and/or its affilitiates.
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Oracle Machine Learning for Python (coming soon)
2+
Oracle Machine Learning for Python (OML4Py) is a component of the Oracle Machine Learning family of products.
3+
4+
OML4Py is going to be initially available inside Oracle Autonomous Database, but we expect it to be released later on all Oracle Database platforms.
5+
6+
Outside of the Autonomous Database, OML4Py will be compatible Oracle Linux 7.x (or RHEL 7.x) and Python 3.x.
7+
8+
The examples here would run on either environment.
9+
10+
#### Copyright (c) 2020 Oracle Corporation and/or its affilitiates.
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# OML4Spark-Tutorials
2+
3+
## Tutorials for OML4Spark (a.k.a. ORAAH) release 2.8.x
4+
**The [Oracle Machine Learning for Spark][1] (OML4Spark) is a set of R packages and Java libraries**
5+
**It provides several features:**
6+
- An R interface for manipulating data stored in a local File System, HDFS, HIVE, Impala or JDBC sources, and creating Distributed Model Matrices across a Cluster of Hadoop Nodes in preparation for ML
7+
- A general computation framework where users invoke parallel, distributed MapReduce jobs from R, writing custom mappers and reducers in R while also leveraging open source CRAN packages
8+
- Parallel and distributed Machine Learning algorithms that take advantage of all the nodes of a Hadoop cluster for scalable, high performance modeling on big data. Functions use the expressive R formula object optimized for Spark parallel execution
9+
ORAAH's custom LM/GLM/MLP NN algorithms on Spark scale better and run faster than the open-source Spark MLlib functions, but ORAAH provides interfaces to MLlib as well
10+
- Core Analytics functionality halso available in a standalone Java library that can be used directly without the need of the R language, and can be called from any Java or Scala platform.
11+
12+
13+
**The following are a list of demos containing R code for learning about (OML4Spark)**
14+
- Files on the current folder
15+
- Introduction to OML4Spark (oml4spark_tutorial_getting_started_with_hdfs.r)
16+
- Working with HIVE, IMPALA and Spark Data Frames (oml4spark_tutorial_getting_started_with_hive_impala_spark.r)
17+
- Function in R to visualize Hadoop Data in Apache Zeppelin z.show (oml4spark_function_zeppelin_visualization_z_show.r)
18+
- AutoML for Classification using Cross Validation with OML4Spark
19+
* Sample Execution of the Cross Validation (oml4spark_execute_cross_validation.r)
20+
* Function to run the Cross Validation (oml4spark_function_run_cross_validation.r)
21+
* Function to Create a Balanced input Dataset (oml4spark_function_create_balanced_input.r)
22+
* Function to run Variable Selection via GLM Logistic (oml4spark_function_variable_selection_via_glm.r)
23+
* Function to run Variable Selection via Singular Value Decomposition (oml4spark_function_variable_selection_via_pca.r)
24+
* Function to compute Confusion Matrix and statistics (oml4spark_function_confusion_matrix_in_spark.r)
25+
* Function to build all OML4Spark models (oml4spark_function_build_all_classification_models.r)
26+
27+
[1]:https://www.oracle.com/database/technologies/datawarehouse-bigdata/oml4spark.html
28+
29+
#### Copyright (c) 2020 Oracle Corporation and its affiliates
30+
31+
##### [The Universal Permissive License (UPL), Version 1.0](https://oss.oracle.com/licenses/upl/)
32+
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
######################################################################
2+
# oml4spark_execute_cross_validation.r
3+
#
4+
# This demo uses the “Ontime” Airline dataset from the Bureau of
5+
# Transportation Statistics, and we want to run classification models
6+
# to try to identify the best models to predict a cancelled flight.
7+
#
8+
# The functions are capable of accepting R DataFrames, CSV files on HDFS,
9+
# HIVE/IMPALA tables, and Spark DataFrames (generated in the Spark Session
10+
# that OML4Spark is processing), as input for processing.
11+
#
12+
# The initial stage is used for balancing the Sample (50% ’0’s and 50% ’1’s)
13+
# to improve the ability of the different models on detecting the cancellations.
14+
# The final output is requested to be a maximum of 30,000 records
15+
#
16+
# The second stage uses the balanced data as input to run a test on several
17+
# classification models available in OML4Spark, using k-Fold
18+
# Cross-Validation with k=3
19+
#
20+
# The final output is a list of the Models in descending order of the
21+
# statistic requested (in this case it was Acccuracy), and a chart of descending
22+
# Balanced Accuracy for the models
23+
#
24+
# All processing is done on Spark by using OML4Spark’s interfaces to several
25+
# functions and SparkSQL as well
26+
#
27+
# About the Ontime Airline dataset: The database contains scheduled and actual
28+
# departure and arrival times reported by certified U.S. air carriers that
29+
# account for at least one percent of domestic scheduled passenger revenues
30+
# since 1987. The data is collected by the Office of Airline Information,
31+
# Bureau of Transportation Statistics (BTS) , and can be downloaded from
32+
# their site at
33+
# https://www.transtats.bts.gov/tables.asp?DB_ID=120&DB_Name=&DB_Short_Name=#
34+
#
35+
# Copyright (c) 2020 Oracle Corporation
36+
# The Universal Permissive License (UPL), Version 1.0
37+
#
38+
# https://oss.oracle.com/licenses/upl/
39+
#
40+
#
41+
######################################################################
42+
43+
# Calls the OML4Spark libraries
44+
library(ORCH)
45+
46+
# Create a new Spark Session
47+
if (spark.connected()) spark.disconnect()
48+
spark.connect('yarn', memory='9g', enableHive = TRUE)
49+
50+
# Connect to IMPALA
51+
ore.connect(type='IMPALA',host='xxxxxxxx',user='oracle', port='21050', all=FALSE )
52+
# Synchronize the Table ALLSTATE
53+
ore.sync(table='ontime1m')
54+
ore.attach()
55+
# Check that the table is viewable
56+
ore.ls()
57+
58+
# Show a sample of the data
59+
head(ontime1m)
60+
61+
# Load functions written for Cross Validation using the OML4Spark facilities for
62+
# manipulating Spark DataFrames
63+
source ('~/oml4spark_function_create_balanced_input.r')
64+
source ('~/oml4spark_function_run_cross_validation.r')
65+
66+
## Create a balanced Spark DF by smart sampling based on a specific formula
67+
68+
# Formula for Classification of whether a customer had any Insurance Claims
69+
formula_class <- cancelled ~ distance + as.factor(month) + as.factor(year) + as.factor(dayofmonth) + as.factor(dayofweek)
70+
71+
# Create a Balanced Spark DataFrame with 50/50 output, requesting sampling down to 90,000 rows in total
72+
# The idea is to balance the target variable ANY_CLAIM (whether the customer had any insurance claims) is 50% '0's and 50% '1's
73+
# The input to the function is the IMPALA table, the formula that will be used for model build,
74+
system.time({
75+
balancedData <- createBalancedInput(input_bal=ontime1m,
76+
formula_bal=formula_class,
77+
reduceToFormula=TRUE,
78+
feedback = TRUE,
79+
sampleSize = 10000
80+
)
81+
})
82+
83+
# Review the Spark DataFrame called "balancedData" before executing the Cross-Validation
84+
# The global average proportion of having any claims is 0.5 (since we balanced the data)
85+
balancedData$show()
86+
87+
# Execute a 3-fold Cross-Validation using the algorithms provided by OML4Spark and Spark MLlib
88+
finalModelSelection <- runCrossValidation(input_xval=balancedData,
89+
formula_xval=formula_class,
90+
numKFolds=3,
91+
selectedStatistic='Accuracy',
92+
legend='',
93+
feedback = TRUE )
94+
95+
96+
# Many detailed explanations of the different statistics printed can be
97+
# found at https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers
98+
#
99+
# The original statistic requested as the one for sorting was the Mathews Correlation Coefficient
100+
# More information about the MCC at https://en.wikipedia.org/wiki/Matthews_correlation_coefficient
101+
print(as.data.frame(finalModelSelection[[4]]))
102+
103+
# Show the different components returned by the function
104+
finalModelSelection
105+
106+
if (spark.connected()) spark.disconnect()
107+
if (ore.is.connected()) ore.disconnect()
108+
109+
110+
#####################################################
111+
### END CROSS-VALIDATION BEST MODEL IDENTIFICATION
112+
#####################################################

0 commit comments

Comments
 (0)