Posted on Sep 23

StoreView SQL: Unlock Geographical Limits for Data Analysis

Introduction
Simple Log Service (SLS) is a cloud-native observability and analysis platform. It provides large-scale, low-cost, and real-time platform-based services for such data types as logs, metrics, and traces. SLS provides multi-region support, enabling you to connect to SLS services in nearby regions based on your data sources, thus reducing unnecessary network latency and public network costs. However, what if data from different regions requires joint and integrated analysis?

StoreView, a newly launched feature of SLS, helps you perform cross-project and cross-region queries and statistical analysis. StoreView allows Logstore data from multiple projects (with cross-region access support) to be combined into a single virtual Logstore. You can perform various query and analysis operations using StoreView in the same manner as you would with a single Logstore.

Data Preparation
To illustrate several common scenarios, four demo projects from different regions will be used. Details of these projects are as follows:

Each of these projects includes a Logstore named user-queries, with the following core fields:

Perform Global Analysis
Previous Approach
Before StoreView was supported, to analyze Logstore data across multiple regions, ETL tasks are required to synchronize data from regional Logstores to a centralized Logstore. For example, for the Logstores in the four demo projects mentioned, we can perform the following steps to create ETL tasks. For more information, see Data Processing Overview.

Once all ETL tasks are successfully created, statistical analysis can be performed using the centralized Logstore, centralized-user-queries.

As can be seen, the above operation is quite cumbersome, as a separate processing task needs to be created for each Logstore. In addition, this method increases the data storage volumes: The storage volume of each Logstore is doubled. Moreover, it incurs public network traffic, which results in additional costs. Therefore, synchronizing data from multiple regions to a single project through ETL tasks is time-consuming, labor-intensive, and cost-prohibitive.

Current Approach
Given the drawbacks of synchronizing multi-region data to a single project through ETL tasks, SLS has launched the StoreView feature, which allows combining multiple Logstores from different regions and projects into a virtual Logstore. When you use StoreView to analyze the data of multiple Logstores across regions and projects, it is as simple as analyzing a single Logstore. Go to any project and create the following StoreView definition by referring to the document Dataset (StoreView) Overview.

After creating the above StoreView definition, you can go to the corresponding query and analysis page to perform operations. With the following SQL statement, you can achieve the same query and analysis effect under StoreView as in the centralized Logstore through ETL tasks.

As can be seen, creating a StoreView is much simpler compared to creating a processing task for each Logstore. In addition, when using StoreView for data analysis, there is no issue of data synchronization delay caused by ETL tasks, as it reads data from each underlying Logstore in real time.

StoreView SPL Features
In addition to effectively addressing the inconvenience in query and analysis caused by geographical data isolation, StoreView also integrates the capability to process data using SPL syntax. For more information, see SPL syntax. StoreView only supports the following commands: extend, project, and where. Based on the rich functions and operators of SPL, StoreView can implement many capabilities that Logstores themselves do not have. The following topics describe data visibility control, query-based ETL processing, and heterogeneous data schema alignment.

Data Visibility Control
For data in a Logstore, we sometimes need to control the row-level data visibility. For example, we may want operation and maintenance (O&M) personnel to have only the permission to view information about error queries (to facilitate their monitoring of system abnormalities), but not normal queries. For ordinary Logstores, row-level visibility control is currently not possible (as the existing authorization only supports Logstore-level granularity). However, this can be easily achieved with StoreView. For instance, we can create a StoreView as shown in the figure below.

The StoreView above additionally defines a query filter that restricts the return to data where the status does not match 200. From the following SQL results, it can be seen that only the failed query statistics can be seen from the StoreView failed_user_queries, while the succeeded queries (with a status of 200) are all excluded.

Therefore, if you have a need for data visibility management at the row level, StoreView can come in handy.

Query-based ETL Processing
Sometimes, the Logstore may contain some sensitive information, which we do not want ordinary team members to see (but allow them to view other non-sensitive field information). For example, regarding the Logstore "user-queries", if it is necessary to prevent all personnel from viewing the two pieces of user-sensitive information, namely "sourceIp" and "userId", how should this be handled?

Following the previous approach, one can desensitize the data in the original logstore through a ETL task, save the desensitized data to another logstore, and then make this new logstore accessible to ordinary team members. Although this operation can also meet the demand, it is not only complicated to perform, but also incurs additional storage costs. Based on the SPL capability of StoreView, this requirement can be easily met. We can create a StoreView as follows.

The following result shows that the sourceIp and userId values are not found in the query or SQL analysis.

It can be seen that with the SPL capability integrated into StoreView, various processing operations can be easily performed on the original data. Compared with traditional ETL processing tasks, the update of StoreView SPL definitions is more flexible. After modification, the changes are immediately visible in query and analysis. In contrast, after a processing task is updated, it only takes effect on the new data written into the source Logstore.

Heterogeneous Data Schema Alignment
In SQL scenarios, the fields supported by the virtual table StoreView are a superset of all fields with statistical analysis enabled in the underlying Logstores. Analysis fields with the same name but different types across multiple Logstores will be normalized to the varchar type. But how to conduct unified analysis on fields that have different names but the same meaning? For example, in the following scenario, when calculating the average latency of queries under each project, it is found that the result corresponding to sls-cn-guangzhou-queries is null. What might account for this phenomenon?

Analysis reveals that the query data in sls-cn-guangzhou-queries does not contain the field latencyMs. Instead, the corresponding field in its data is named latency. In this case, we can still resolve the issue using the extend operator in SPL, which involves adding an alias field to the latency field (SPL processes all fields as varchar by default, and type conversion is performed in the following SPL).

After the preceding operations, the execution result of SQL will meet the expectations, as shown below.

StoreView Meta Field
StoreView SQL provides two meta fields: project and _logstore. They respectively represent the original project and logstore names corresponding to the data. Users can identify the source of results in StoreView based on these two meta fields. For example, we can use the following SQL to conduct a comparative analysis of the daily data processing volume under each project.

Summary
As can be seen from the above case analysis, the SLS StoreView feature provides users with extremely convenient cross-project query and analysis capabilities. Users no longer need to create processing tasks to aggregate data, thus reducing their usage costs. StoreView is supported for direct SQL analysis and dashboard visualization in the console (alerts are not supported, but may be released in the future). That said, since querying and analyzing across projects involves cross-domain data reading, the entire processing link may be significantly affected by network conditions. We will continue to improve the ease of use, stability, and performance of StoreView, so that users can conduct easy and smooth queries and analysis of data across all regions based on StoreView and truly ensure that data analysis is not limited by regional boundaries.

DEV Community

StoreView SQL: Unlock Geographical Limits for Data Analysis

Top comments (0)