Some additional information in one line

User profiling is closely related to user operations, which has undergone the transformation from extensive operations, to refined operations, and then to targeted operations. User profiling is widely used in a variety of scenarios:

  • Precision marketing
  • Group analysis
  • Risk warning
  • Effects analysis
  • Channel analysis

However, customers are confronted with the following difficulties during user profiling:

  • A flexible retrieval method is required to search for data among huge data volume.
  • Combined tag calculation requires complex and sophisticated development.
  • Accurate distinct count consumes a large amount of resources.
  • Aggregate tag collection causes high query concurrency.

In most cases, marketing personnel select attributes, combined relation conditions (such as and/or), push channels, and push methods based on business requirements. The data platform constructs queries based on the specified combined relation condition, and calculates and returns data of the target users.

Solution Offered by Elasticsearch and Apache HBase®

When data volume is small, customers usually use Elasticsearch as the storage and query engine for user profiling. When data volume becomes large, customers use Apache HBase to store detailed data and use Elasticsearch to accelerate queries. 

Application:

Complex and time-consuming operations. Customers must perform searches several times by using multiple indexes and aggregate data before they can obtain data of target users.

Technical implementation:

  • Heavy structure and complex maintenance
  • Insufficient SQL capabilities because join and aggregate operations are not supported
  • High development costs
  • Poor extendability and requirements for customized development

Solution Offered by StarRocks

StarRocks is a next-generation massively parallel processing (MPP) database system designed for all scenarios. It offers superior performance in various OLAP analytics scenarios by leveraging its vectorized engine and a cost-based optimizer (CBO). StarRocks supports a variety of data models.

Advantages of StarRocks 

  • Support for standard SQL
  • Powerful multi-table join queries and aggregation
  • Support for the Bitmap data structure, which provides the following features:
    • Set calculation
    • Accurate distinct count
    • Converting one column into multiple rows by using the BITMAP_TO_ARRAY function and the UNNEST operator
  • High-concurrency query on detailed data
  • Simple architecture, easy O&M

User Retention

StarRocks implements the Bitmap data structure based on Roaring Bitmap. In traditional distinct count, an SQL statement similar to "select distinct count...from...groupby join..." must be executed, which causes heavy load on clusters because this process involves the scheduling of a large number of SQL jobs and join operations on large tables.

Roaring Bitmap enables StarRocks to easily implement UV, retention, and funnel chart analysis that use a lot of intersection calculations. This facilitates user behavior analysis and user profiling.

Example 1: Check user retention by comparing data on the current day and the previous day based on different tags.

select tag, bitmap_intersect(user_id) 
from (
   select tag, date, bitmap_union(user_id) user_id 
   from table 
   where date in ('2020-05-18', '2020-05-19') 
   group by tag, date) a 
group by tag;

The bitmap_intersect function can be used with the bitmap_to_string function to obtain specific data in an intersection.

Example 2: Find specific users that are retained.

select tag, bitmap_to_string(bitmap_intersect(user_id)) 
from (
  select tag, date, bitmap_union(user_id) user_id 
  from table 
  where date in ('2020-05-18', '2020-05-19') 
  group by tag, date) a 
group by tag;

User Selection

StarRocks uses the bitmap_union function together with materialized views to accelerate user selection based on duplicate tables. The following example demonstrates how to obtain users who have performed the "click" action on the shopping cart and the "view" action on favorited pages from a duplicate table.

Step 1: Create a table.

create table t1(
  event_day date,
  event_time datetime,
  uid int,
  action string,
  page string,
  product_code string
  )
DUPLICATE KEY(`event_day`,`event_time`, `uid`)
PARTITION BY RANGE(event_day)
(
  PARTITION p20210401 VALUES LESS THAN ('2021-04-01'),
  PARTITION p20210501 VALUES LESS THAN ('2021-05-01')
)   
DISTRIBUTED BY HASH(`uid`) BUCKETS 3;

Step 2: Insert data into the table.

insert into t1 values('2021-04-03','2021-04-03 10:01:30',274649163,'click','shopping_cart','MDS');
insert into t1 values('2021-04-03','2021-04-03 10:04:30',274649163,'view','favorite_page','MDS');
insert into t1 values('2021-04-03','2021-04-03 10:03:30',274649164,'click','shopping_cart','MDS');
insert into t1 values('2021-04-03','2021-04-03 10:06:30',274649165,'click','shopping_cart','MMS');
insert into t1 values('2021-04-03','2021-04-03 10:08:30',274649164,'view','favorite_page','MDS');
insert into t1 values('2021-04-03','2021-04-03 10:09:30',274649165,'view','shopping_cart','MDS');
mysql> select * from t1;
+-------------+----------------------------+--------------+--------+-----------+---------------+
| event_day |     event_time      |    uid    | action | page   | product_code |
+-------------+----------------------------+--------------+--------+-----------+---------------+
| 2021-04-03 | 2021-04-03 10:01:30 | 274649163 | click | shopping_cart  | MDS      |
| 2021-04-03 | 2021-04-03 10:04:30 | 274649163 | view  | favorite_page  | MDS      |
| 2021-04-03 | 2021-04-03 10:03:30 | 274649164 | click | shopping_cart  | MDS      |
| 2021-04-03 | 2021-04-03 10:06:30 | 274649165 | click | shopping_cart  | MMS      |
| 2021-04-03 | 2021-04-03 10:08:30 | 274649164 | view  | favorite page  | MDS      |
| 2021-04-03 | 2021-04-03 10:09:30 | 274649165 | view  | shopping_cart  | MDS      |
+------------+---------------------------+---------------+-------+------------+--------------+

Step 3: Create a materialized view.

CREATE MATERIALIZED VIEW user_profile_view AS
SELECT event_day ,action, page , bitmap_union(to_bitmap(uid)) b_uid   
  FROM t1 
  GROUP BY event_day , action, page;

Step 4: Select target users.

WITH tbl_c AS (
  SELECT bitmap_union(to_bitmap(uid)) b_uid   
  FROM t1 
  WHERE event_day = '2021-04-03'
  AND action='click' and page= 'shopping_cart'
  GROUP BY action, page) , 
  tbl_v AS(
  SELECT bitmap_union(to_bitmap(uid)) b_uid   
  FROM t1 
  WHERE event_day = '2021-04-03'
  AND action='view' and page='favorite page'
  GROUP BY action, page) ,
  tbl_u AS(
  SELECT b_uid from tbl_c
  UNION ALL
  SELECT b_uid from tbl_v) 
SELECT 
  bitmap_count(bitmap_intersect(b_uid)) uid_ct,
  bitmap_to_string(bitmap_intersect(b_uid)) uid_list
FROM tbl_u
+--------+---------------------+
| uid_ct | uid_list            |
+--------+---------------------+
|   2    | 274649163,274649164 |
+--------+---------------------+

The bitmap_and and bitmap_andnot functions can be used to obtain the intersection of users with the "male" tag and users with the "high spenders" tag.

Step 1: Create a table.

create table user_tag(
 uid int,
 create_time date,
 Tag string)
DUPLICATE KEY(`uid`)
PARTITION BY RANGE(create_time)
(
  PARTITION p20210401 VALUES LESS THAN ('2021-04-01'),
  PARTITION p20210501 VALUES LESS THAN ('2021-05-01')
)   
DISTRIBUTED BY HASH(`uid`) BUCKETS 3;

Step 2: Insert data into the table.

insert into user_tag values(274649163, '2021-04-03','male');
insert into user_tag values(274649163, '2021-04-03','high spenders');
insert into user_tag values(274649164, '2021-04-03','female');
insert into user_tag values(274649165, '2021-04-03','male');
insert into user_tag values(274649164, '2021-04-03','high spenders');
insert into user_tag values(274649165, '2021-04-03','low spenders');
mysql> select * from user_tag;

+-----------+-------------+---------------+
| uid    | create_time | Tag      |
+-----------+-------------+---------------+
| 274649163 | 2021-04-03 | male     |
| 274649163 | 2021-04-03 | high spenders |
| 274649164 | 2021-04-03 | female    |
| 274649165 | 2021-04-03 | male     |
| 274649164 | 2021-04-03 | high spenders |
| 274649165 | 2021-04-03 | low spenders |
+-----------+-------------+---------------+

Step 3: Select target users.

with t1 as (
 SELECT bitmap_union(to_bitmap(uid)) b_uid , create_time
 FROM user_tag
 WHERE create_time = '2021-04-03' AND tag = 'high spenders'
 group by create_time
 ), t2 as (
 SELECT bitmap_union(to_bitmap(uid)) b_uid , create_time
 FROM user_tag
 WHERE create_time = '2021-04-03' AND tag = 'male'
 group by create_time 
) 
select  bitmap_to_string(bitmap_and(t1.b_uid, t2.b_uid)) as uid_select
from t1 join t2
on t1. create_time = t2. create_time;
+------------+
| uid_select |
+------------+
| 274649163 |
+------------+

StarRocks has been deployed and tested in the production environments of some enterprises. Practices show that StarRocks accelerates user profiling by more than 30% for an enterprise engaged in microblogging and social networking services.


Apache®, Apache HBase®,and its logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.