This topic describes how to create, read from, and write to external tables in ORC format that are stored in OSS.
Prerequisites
Ensure your Alibaba Cloud account, RAM user, or RAM role has the required permissions to access OSS external tables. For more information about authorization, see STS authorization for OSS.
(Optional) Prepare an OSS bucket, directories, and data files. For more information, see Create buckets, Manage directories, and Simple upload.
MaxCompute can now automatically create directories in OSS. You can use a single SQL statement to read from or write to an external table that uses a UDF. Manual directory creation is no longer required, but the legacy method is still supported.
A MaxCompute project has been created. For more information, see Create a MaxCompute project.
MaxCompute is deployed only in specific regions. To prevent cross-region data connectivity issues, we recommend that you use a bucket in the same region as your MaxCompute project.
The Alibaba Cloud account or RAM user has the CreateTable permission on your project. For more information about table operation permissions, see MaxCompute permissions.
Limits
The cluster attribute is not supported for OSS external tables.
A single file cannot exceed 3 GB. If a file exceeds this limit, we recommend that you split it.
Notes
If the schema in the ORC file is inconsistent with the external table schema:
Inconsistent number of columns: If the number of columns in the ORC file is less than the number of columns in the Data Definition Language (DDL) statement of the external table, the system fills the missing columns with NULL values when reading the ORC data. If the number of columns in the ORC file is greater, the extra column data is discarded.
Inconsistent column types: MaxCompute allows you to read INT data from an ORC file as the STRING type, but this is not recommended. When you attempt to read STRING data as the INT type, string values are converted to NULL, while numeric values are processed correctly.
Data type support
In the following table, indicates support and
indicates no support.
For more information about MaxCompute data types, see Data Types Version 1.0 and Data Types Version 2.0.
Java Native Interface (JNI) mode (Native ORC Reader is not used to read tables):
set odps.ext.oss.orc.native=false;
. Read and write operations are supported.Native mode (Native ORC Reader is used to read tables):
set odps.ext.oss.orc.native=true;
. Only read operations are supported.
Data type | JNI mode support | Native mode support |
TINYINT | ||
SMALLINT | ||
INT | ||
BIGINT | ||
BINARY | ||
FLOAT | ||
DOUBLE | ||
DECIMAL(precision,scale) | ||
VARCHAR(n) | ||
CHAR(n) | ||
STRING | ||
DATE | ||
DATETIME | ||
TIMESTAMP | ||
TIMESTAMP_NTZ | ||
BOOLEAN | ||
ARRAY | ||
MAP | ||
STRUCT | ||
JSON |
Supported compression formats
When you read from or write to compressed OSS files, add the with serdeproperties
attribute to the CREATE TABLE statement. For more information, see with serdeproperties parameters.
The supported data file formats for read and write operations are ORC files compressed using SNAPPY or ZLIB.
Create a external table
Syntax
For more information about the syntax structure of external tables for each format, see OSS external tables.
Basic syntax
CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name> ( <col_name> <data_type>, ... ) [COMMENT <table_comment>] [PARTITIONED BY (<col_name> <data_type>, ...)] STORED AS orc LOCATION '<oss_location>';
Full syntax
CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name> ( <col_name> <data_type>, ... ) [COMMENT <table_comment>] [PARTITIONED BY (<col_name> <data_type>, ...)] ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' WITH serdeproperties( 'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole' ) STORED AS orc LOCATION '<oss_location>' tblproperties ( '<xxx>'='<yyy>' );
Common parameters
For information about common parameters, see Basic syntax parameters.
Unique parameters
Properties for WITH SERDEPROPERTIES
property_name | Scenario | Description | property_value | Default value |
mcfed.orc.schema.resolution | Add this property when the data in the same OSS external table has different schemas. | Specifies the parsing method for ORC files. The value | name | By column position. This is equivalent to setting |
Properties for TBLPROPERTIES
property_name | Scenario | Description | property_value | Default value |
mcfed.orc.compress | Add this property to write ORC data to OSS in a compressed format. | ORC compression property. Specifies the compression method for ORC data. |
| None |
io.compression.codecs | Add this property if the OSS data file is in Raw-Snappy format. | Set this parameter to enable MaxCompute to read Raw-Snappy compressed data. | com.aliyun.odps.io.compress.SnappyRawCodec | None |
Write data
For more information about the write syntax in MaxCompute, see Write syntax.
Query and analyze data
For more information about the SELECT syntax, see Query syntax.
For more information about optimizing query plans, see Query optimization.
Scenarios
This section describes how to create an ORC external table with SNAPPY compression, and then read data from and write data to the table.
When you run the following sample code, replace
<uid>
with your Alibaba Cloud account ID.The role used in the following examples is
aliyunodpsdefaultrole
. If you want to use a different role, replacealiyunodpsdefaultrole
with the name of the target role and grant the target role permissions to access OSS.
Prepare an ORC file that is compressed by using SNAPPY.
In the
oss-mc-test
bucket for the Sample data, create theorc_snappy/dt=20250526
folder hierarchy, and store the snappy file in thedt=20250526
partition folder.Create an ORC external table with SNAPPY compression.
CREATE EXTERNAL TABLE orc_data_type_snappy ( vehicleId INT, recordId INT, patientId INT, calls INT, locationLatitute DOUBLE, locationLongtitue DOUBLE, recordTime STRING, direction STRING ) PARTITIONED BY (dt STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' WITH serdeproperties ( 'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole' ) STORED AS ORC LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/orc_snappy/' tblproperties ( 'mcfed.orc.compress'='SNAPPY');
Import partition data. If an OSS external table is a partitioned table, you must perform an additional operation to import partition data. For more information, see Syntax for adding partition data to an OSS external table.
-- Import partition data MSCK REPAIR TABLE orc_data_type_snappy ADD PARTITIONS;
Read data from the ORC external table.
SELECT * FROM orc_data_type_snappy WHERE dt=20250526 LIMIT 10;
The following result is returned:
+------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+ | vehicleid | recordid | patientid | calls | locationlatitute | locationlongtitue | recordtime | direction | dt | +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+ | 1 | 12 | 76 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:10 | SW | 20250526 | | 1 | 1 | 51 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:00 | S | 20250526 | | 1 | 2 | 13 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:01 | NE | 20250526 | | 1 | 3 | 48 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:02 | NE | 20250526 | | 1 | 4 | 30 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:03 | W | 20250526 | | 1 | 5 | 47 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:04 | S | 20250526 | | 1 | 6 | 9 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:05 | S | 20250526 | | 1 | 7 | 53 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:06 | N | 20250526 | | 1 | 8 | 63 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:07 | SW | 20250526 | | 1 | 9 | 4 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:08 | NE | 20250526 | | 1 | 10 | 31 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:09 | N | 20250526 | +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+
Write data to the ORC external table and query the data.
INSERT INTO orc_data_type_snappy PARTITION (dt ='20250526') VALUES (1,16,76,1,46.81006,-92.08174,'9/14/2014 0:10','SW'); -- Query the newly written data SELECT * FROM orc_data_type_snappy WHERE dt = '20250526' AND recordid=16;
The following result is returned:
+------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+ | vehicleid | recordid | patientid | calls | locationlatitute | locationlongtitue | recordtime | direction | dt | +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+ | 1 | 16 | 76 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:10 | SW | 20250526 | +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+