Solutions for exporting large amounts of data - MaxCompute

This topic describes several methods for exporting the results of MaxCompute SQL calculations.

Note

The SDK examples in this topic are provided in Java.

Overview

You can use the following methods to export the results of SQL statements:

For small amounts of data, you can use an SQL task to retrieve all query results.
To export an entire table or partition, you can use Tunnel to export the query results.
If the SQL statement is complex, you can use a combination of Tunnel and SQL to export the query results.
You can use DataWorks to run SQL statements, synchronize data, and use features such as timed scheduling and dependency configuration.
The open-source tool DataX can help you easily export data from MaxCompute to a destination data source.

Export data using SQLTask

You can use the SQLTask provided by the SDK to call the SQLTask.getResult(i) interface, which lets you run SQL statements and retrieve their results. For more information, see SQLTask.

When you use SQLTask, note the following:

SQLTask.getResult(i) is used to export the results of SELECT queries. It cannot be used to export the results of other MaxCompute command operations, such as show tables.
The number of data records that a SELECT statement returns to a client can be set using READ_TABLE_MAX_ROW. For more information, see Project operations.
A SELECT statement can return a maximum of 10,000 data records to a client. If you run a SELECT statement directly on a client, including using SQLTask, it is equivalent to adding LIMIT N to the end of the SELECT statement.

Export data using Tunnel

If the query result that you need to export is the entire content of a table or a specific partition, you can use Tunnel. For more information, see the command line interface and the Tunnel SDK.

The following code provides a simple example of exporting data using the Tunnel command line. For scenarios that are not supported by the Tunnel command line, you can compile the Tunnel SDK. For more information, see Batch data channel overview.

tunnel d wc_out c:\wc_out.dat; 2016-12-16 19:32:08 - new session: 201612161932082d3c9b0a012f68e7 total lines: 3 2016-12-16 19:32:08 - file [0]: [0, 3), c:\wc_out.dat downloading 3 records into 1 file 2016-12-16 19:32:08 - file [0] start 2016-12-16 19:32:08 - file [0] OK. total: 21 bytes download OK

Export data using SQLTask and Tunnel

SQLTask cannot process more than 10,000 data records, but Tunnel can. Because they complement each other, you can use them together to export more than 10,000 data records.

The following code provides an example.

Odps odps = OdpsUtils.newDefaultOdps(); // Initialize an Odps object. Instance i = SQLTask.run(odps, "select * from wc_in;"); i.waitForSuccess(); // Create an InstanceTunnel. InstanceTunnel tunnel = new InstanceTunnel(odps); // Create a DownloadSession based on the instance ID. InstanceTunnel.DownloadSession session = tunnel.createDownloadSession(odps.getDefaultProject(), i.getId()); long count = session.getRecordCount(); // Output the number of results. System.out.println(count); // The method for obtaining data is the same as that for TableTunnel. TunnelRecordReader reader = session.openRecordReader(0, count); Record record; while((record = reader.read()) != null) { for(int col = 0; col < session.getSchema().getColumns().size(); ++col) { // The fields in the wc_in table are all of type STRING. You can directly print the output or write it to a local file. System.out.println(record.get(col)); } } reader.close();

Export data using DataWorks data synchronization

Note

This example uses the basic mode of DataWorks. When you create a workspace, leave the Join The Public Preview Of DataStudio option deselected. This example does not apply to workspaces in public preview.

DataWorks lets you run SQL statements and configure data synchronization tasks to meet your data generation and exporting requirements.

Log on to the DataWorks console.
In the left navigation pane, click Workspace Management.
In the Actions column for the desired workspace, choose Quick Access > Data Development.
Create a business process.
1. Right-click Business Process and select New Business Process.
2. Enter a Business Name.
3. Click Create.
Create an SQL node.
1. Right-click the business process and select New Node > MaxCompute > ODPS SQL.
2. Set the Node Name to runsql and click Confirm.
3. Configure the ODPS SQL node and click Save.
Create a data synchronization node.
1. Right-click the business process and choose New Node > Data Integration > Offline Synchronization.
2. Set the Node Name to sync2mysql and click Confirm.
3. Select a data source and a destination.
4. Configure field mappings.
5. Configure channel control.
6. Click Save.
Connect the data synchronization node to the ODPS SQL node to configure a dependency. In this workflow, the ODPS SQL node generates the data, and the data synchronization node exports the data.

After configuring the workflow scheduling (you can use the default configurations), click Run. The following shows the operation log for the data synchronization.

2016-12-17 23:43:46.394 [job-15598025] INFO JobContainer - Task start time : 2016-12-17 23:43:34 Task end time : 2016-12-17 23:43:46 Total time consumed : 11s Average traffic : 31.36KB/s Record write speed : 1668rec/s Total records read : 16689 Total read/write failures : 0

Run the following SQL statement to view the result of the data synchronization.
```
select count(*) from result_in_db;
```