How to Create a Stream System with Tens of Millions of Feeds?

A Technical Interpretation of Alibaba's Databases.

China’s biggest shopping day, known as “Singles Day”, has broken yet more records this year with a whopping 325,000 transactions and 256,000 payments per second. Such high volumes however generate vast amounts of data. At Alibaba, this data is analyzed in real-time to ensure any system failures are detected instantly and reported as quickly as possible so continual improvements can be made to customer experience.

Naturally this is a big challenge for the Ali Tech team. So, how do they tackle such a huge task?

First, a bit of background. An ecommerce business usually involves many different elements such as merchants, stores, and logistics. The activities of each of these elements generate massive amounts of business data, e.g. orders, complaints, and refunds, and any single transaction can involve multiple business systems.

In order to identify issues, meet operational needs, and analyze requirements, event tracking is used to record logs generated in the systems by user behaviors (known as feed logs). From order placement, payment, and shipment, to the receipt and a possible refund, an order typically produces several interlinked records. Each link may produce different attributes, some may be new, and some may be changes to existing attribute values.

In order to facilitate analyses, it is normal to consolidate multiple records generated by an order, throughout the process, into a single record (or order table). Based on the aggregated data, active services can perform analyses, display public opinions, and identify objects that require active services in real time, delivering intelligent service operation.

In order to accomplish this, the Ali Tech team use a real-time analysis scheme through RDS PostgreSQL and HybridDB for PGSQL with the following features:

- Throughput at hundreds of thousands of writes per second (several times the transactions) are withstood, and data cleansing is performed

- Real-time analyses with minute-level latency is achieved, and five billion-level tables with second-level response are available

- Abnormal transactions are identified in real time to improve Taobao's user experience

In addition to the real-time requirements, actions such as data consolidation and cleaning are required for the writing process. DDBA or data analysis professionals understand that it is not difficult to achieve hundreds of thousands of writing, switching, consolidation, and cleaning actions per second, or to perform millisecond-level analyses on terabytes of data. However, where the real difficulty lies is in scheduling properly while simultaneously achieving real-time writing and millisecond-level analyses, with minute-level latency.

The Alibaba Scheme

Alibaba’s scheme to support these needs is as follows:

- RDS PostgreSQL, developed by Alibaba Cloud, is the cloud version of PostgreSQL, an open-source relational database

- HybridDB for PostgreSQL is a distributed analytical database with MPP architecture. It offers excellent performance in areas such as multi-table association, complex queries, and real-time statistics, and supports unique features including JSON, GIS and HLL valuation

- OSS (Open Storage Service) is a massive, secure, low-cost, and reliable cloud storage service operated by Alibaba Cloud. It is used, here, for offline data storage

- The ability of RDS PostgreSQL and HybridDB for PostgreSQL to transparently access offline OSS storage is key

This scheme accepts business writes to multiple PostgreSQL databases. Data is cleansed in each RDS PostgreSQL database, and then written to resilient OSS by manipulating external tables (similar to heap tables). After writing is completed, the data is loaded from the OSS to HybridDB, for PostgreSQL, by just one SQL. This is also done by manipulating external tables. Millisecond-level queries of numerous terabytes of data are realized in HybridDB.

Advantages of the scheme

So, what are the advantage s of using this scheme? Well, although it may not seem complicated at first, the scheme solves the following problems:

1. Performance

PostgreSQL's powerful concurrent write performance combines with HybridDB's analytics performance.

A single RDS PostgreSQL can support millions of writes, and batches loaded into HybridDB, after writes to PostgreSQL, seamlessly integrate PostgreSQL and HybridDB. This enables real-time millisecond-level queries through MPP's superior analytical performance.

2. Data handling and cleaning

Data handling is, traditionally, a cumbersome task with poor performance, resulting in a long distance between TP and AP, leading to the use of different methods. In addition, the exertion when handling heterogeneous databases increases significantly.

Would it be easier if everything could be operated through SQL, with data cleansing and handling as the definition and execution of SQL statements?

Both RDS PostgreSQL and HybridDB for PostgreSQL can directly read from and write to the OSS, and can be easily linked. Through scheduling and encapsulation, functions that previously required higher workloads can be realized at a lower cost.

3. Unification access of hot and cold data

By taking advantage of offline storage capabilities, cold data can be stored in the OSS and hot data in PostgreSQL or HybridDB for PostgreSQL. Hot and cold data can be accessed in the same way through SQL.

4. Dynamic online adjustment of resources

Key benefits of the cloud ecology are its dynamics and resilience. RDS PostgreSQL resources can be dynamically adjusted at any time without affecting availability, while capacity scaling is possible for HybridDB with second-level switching. The resilience of the OSS itself allows the user to store any amount of data.

As a result, the scheme offers the following benefits:

1. Compared to a traditional data analysis scheme, data is managed in a unified manner through SQL, reducing the heterogeneity

2. Dynamically scheduled resources reduce costs

3. The boundary between hot and cold data is made fuzzy, enabling direct access between them

4. Integrated TP and AP

5. Limitless number of RDS PostgreSQL databases and HybridDB clusters

PostgreSQL, Alibaba Cloud's cloud database

Alibaba Cloud's cloud database PostgreSQL is based on what has been called the "most advanced" open-source relational database. During the 2017 Stack Overflow Developers survey, PostgreSQL was voted, by developers, as one of the most loved and desired relational databases of the year, largely due to the following features:

1. Stability

PostgreSQL's code quality is recognized by numerous users. A major version of PostgreSQL can essentially be used for production after two or three minor versions. This can be demonstrated by the "beautiful" commit log of PostgreSQL.

With PostgreSQL's multi-process architecture, the abnormality of one connection will not affect the master process or other connections, delivering excellent stability.

2. Performance

After collecting internal performance data: TPCC performance tests show that the performance of PostgreSQL can match or outperform commercial databases in certain scenarios.

3. Richness

The richness of PostgreSQL is its main advantage. We cannot cover all of its extensive benefits, so we will focus on the more interesting points – rich queries, functions, and types

Rich queries

PostgreSQL features various functions such as HASH\Merge\NestLoop JOIN, materialized views, and full-text searches. Most notable, PostgreSQL has powerful cost-based optimizers, which, along with parallel execution (parallel scan, parallel JOIN, etc.) and multiple cost factors, supports a wide variety of flexible and efficient queries. In addition, there are numerous indexes, such as btree, hash, and gist, and you can create indexes and index scans for your own defined types.

Rich functions

PostgreSQL has an unparalleled plug-in feature. With this feature, you can use the Hook in the kernel code to add functions without modifying the database kernel code. For example, the plug-in feature can be used to create PostGIS, JSON, and genes to meet requirements without affecting the kernel code. Moreover, PostgreSQL has numerous plug-ins.

The FDW mechanism enables you to access other data sources such as Hadoop, MySQL, Oracle, and Mongo within the same PostgreSQL without consuming too many PostgreSQL resources. For example, the OSS_FDW developed by our team is used to read from and write to the OSS.

Rich types

Supported types include high-precision numerics, floating points, auto-incrementing sequences and many more.

PS: The array enables the user to manipulate data in the database like manipulating an array in JAVA. For example, item [0][1] represents an element in a two-dimensional array, and the item can be a field in a table.

If insufficient, you can customize your own types and reload operators for those types. For example, adding, subtracting, multiplying and dividing the IP type (the operation definition depends on the specific implementation, so you can make any IP additions).

Other benefits

Other benefits include PostgreSQL's DDL (such as addition and subtraction) which can be done in transactions (PostgreSQL is catalog-driven, so DDL modification can be understood as the modification of a record). Operational business personnel will understand the unique advantage of this.

In regard to the open-source version, Alibaba Cloud's cloud database PostgreSQL adds features such as HA, seamless capacity scaling, and automatic backup to enable carefree usage.

Alibaba Cloud's HybridDB for PostgreSQL

HybridDB for PostgreSQL is a distributed analytical database with MPP architecture based on open-source Greenplum. It offers excellent performance in many areas such as multi-table associations, complex queries, and real-time. With this foundation, Alibaba Cloud HybridDB for PostgreSQL provides unique features including JSON, GIS and HLL valuation, backup and recovery, and automatic abnormality repair. It also optimizes many aspects to qualitatively upgrade the open-source version.

Alibaba Cloud's HybridDB for PostgreSQL has the following features:

1. Real-time analysis

Supports SQL syntax to analyze distributed data from a geographical information system (GIS) in real time, assist with data processing for location-based services on the IoT and Internet, and analyze distributed JSON, XML and fuzzy string data in real-time. This helps clients in finance, government and commerce industries to perform message data processing and fuzzy text data collection.

2. Stability and reliability

Supports distributed ACID data consistency to achieve transactional consistency across nodes. All data is synchronized across two nodes for redundancy and the SLA guarantees 99.9% availability. Deployment is distributed with three layers of security protection, which are computing units, servers, and cabinets, improving the security of critical data infrastructure.

3. Ease of use

Supports extensive OLAP SQL syntax and functions and multiple Oracle functions. Popular industry BI software tools can be directly used online. Data can be communicated between the cloud database RDS (PostgreSQL/PPAS) to offer a hybrid OLTP+OLAP (HTAP) transaction analytics solution.

Supports distributed SQL OLAP statistics and window functions, as well as distributed PL/pgSQL stored procedures and triggers, to enable database-side distributed computing process development.

Capable of hybrid geographical data analyses in line with international OpenGIS standards. This allows geographical information analyses to be executed from massive data using one SQL statement, such as visitors, area calculation or whereabouts analysis.

4. Excellent performance

Supports row-column hybrid storage, with column storage 100 times the performance of row storage during OLAP analysis. Also supports high-performance OSS parallel data imports, eliminating performance bottlenecks of single-channel imports.

In addition, it can scale up storage capacity and computing capability in proportion to the increase in computing units based on distributed massive parallel processing. Makes the most of each computing unit for OLAP computation performance.

5. Flexible scalability

Computing units, CPUs, memory, and storage capacity can be scaled synchronously and on-demand, with scalable OLAP up to hundreds of terabytes. Transparent OSS data manipulation is supported, and cold data not used for online analysis can be transferred to OSS object-based storage. Data storage capacity expansion is limitless.

High-performance data imports, through mysql2pgsql, can be made to MySQL databases. Most popular industry ETL tools support ETL data imports with HybridDB as the target.

Formatted files stored in the OSS can be used as data sources and can be manipulated in real time through external table schema, while data queries can be completed using standard SQL syntax.

Supports transparent data flow from PostgreSQL/PPAS. Continuous incrementation can be enabled without programming, simplifying maintenance work. After data is stored, high-performance internal data modeling and data cleansing can be performed.

6. Security

Supports a configurable IP whitelist of up to 1,000 permitted server IP addresses that can connect to RDS instances, enabling direct risk control straight from the access source.

The built-in DDoS protection capability monitors the network portal in real time, cleanses source IPs if ultra-high traffic attacks are detected, and sends them directly into the black hole if cleansing fails.

To sum up…

With the aid of cloud services such as RDS PostgreSQL and HybridDB for PostgreSQL, Alibaba Cloud helps clients build smart corporate data BI platforms. HybridDB for PostgreSQL can also serve as the core engine for real-time analytic computing and storage of corporate big data. This way, business data from online businesses can transition from closed-loop to a real-time data analysis in the cloud.

. . .

Alibaba Tech
First hand, detailed, and in-depth information about Alibaba’s latest technology → Search “Alibaba Tech” on Facebook