Data Ingestion of Advertiser for Tencent Ads and Retail

Data Ingestion of Advertiser for Tencent Ads & Retail

2023｜Type: Platform｜Tag: Design for Efficiency｜Role: Product, UX Design

Background

Tencent Marketing has multiple data ingestion channels in two major areas, Tencent Ads and Tencent Smart Retail. This has led to a lot of work and time in demand communication, data governance, and guiding the advertiser to re-ingest data.

View the project on Slides (These slides were from my instant talk when problems were discovered during the delivery phase. A little bit cringed though, not every time could we send confetti to us. Face it, record it for sure.)

What is data ingestion?

Over 90% of ad cost is oCPA ads, which are strongly dependent on the data postback by the advertiser. In the last 30 days since 17 March 2023, the Zhishu - as known as DMP previously - ingested nearly 200 billion pageviews. So what exactly is the first-party data ingestion?

The advertisers based on their demands, conduct the event tracking at the front end and (or) data acquisition at the backend, the data is by a certain protocol to report to the server, this data transmission process is called data ingestion. Common data is ingested such as JS, SDK data acquisition, API endpoints, and files to the server for parsing.

A complete data ingestion can be as short as two to three months or as long as six months.

What kind of data is ingested?

The data ingested mainly includes user-generated data from various scenarios, such as apps and WeChat Mini-Programs. Under ads and retail, we uniformly call the objects described by the data as entities. Users, orders, goods, etc. are all entities. Entities constitute the data assets after collection.

Ads

User Action Data: User Action Data refers to the behaviours of a user that occur in apps, the WeChat ecosystem (WeChat Mini-Programs, WeChat Official Account, WeChat Mini-Games), web pages, offline and other scenarios, and consists of a user identification + action type + timestamp of the action + action parameters (optional).
User Attribute Data: User attribute data describes a person's attributes, such as the user's gender, age, membership level, et cetera.
Audience Files: An asset formed by uploading the files used for ad delivery. It is identified by ID in CSV format, encrypted and uploaded to the data management platform.

Retail

User Action Data: Refers to the behaviours of a user that occur in the WeChat ecosystem, and consists of user identification + action type + timestamp of the action + action object.
User Attribute Data: Describes the attributes of a WeCha user in retail, including basic personal information, member information, sales associate labels, consumer status and so on.
Order Data: Includes sales order and refund order.
Product Data: Includes SKU, SPU, sales information, category, grouped products, product information, and product sentiment.
Card and Coupon Data: Includes card and coupon information, redeem information, and verification information.
Store Data: Includes seller information, and warehouse.
WeChat Official Account: Includes follower amount, follower summary table, article table, and follower profile.
WeChat Work Group Data: Includes WeChat Work member information, group information, group chat statistics, and customer reach statistics.

What is the First-Party Data Ingestion

Discover Problems

Feedbacks from ads and retail are received, the Vice President of 37GAMES said that the data ingestion was ineffective, and our retail clients were crushed by the communications. What's the problem with data ingestion? The product managers and operators were guided by the designers to sort out the problems in the ads and retail before, during and after the ingestion.

Data Ingestion in Ads

In the ads, in the era of cost-per-click, the platform does not strongly rely on data ingestion, and the click data is with it. With the evolution of oCPX, the optimization goals are getting deeper and deeper, and more and more rely on the action data postback by advertisers. The conversion data would be used in downstream applications after processing, including audience targeting, scoring, ranking, oCPX, attribution, ad delivery report, and so on.

For advertisers, there are four touchpoints for data ingestion, namely, Audience Files and Data Sources in Zhishu, Attribution in the Ad Delivery Platform, and API endpoints.

Audience Files: It is uploaded to Zhishu in CSV formats, which are used to bid on specified audiences by the advertisers for price adjustment or targeting in the ad delivery. Specifically include single-column audience files with only ID columns and multi-column audience files with besides ID columns for downstream reading and use.

Data Source, which is with JS, SDK or API endpoints, loads the conversion data back to the ads side. The data statistics are from 27 September 2022 to 26 October 2022.

Nearly 90% of the data is postbacked through the API endpoints, and more than 10% of the data is postbacked through the SDK.
With the API endpoints, more than 80% of the data is from the website.
With the SDK, more than 60% of the data comes from Android Apps, and nearly 40% of the data comes from iOS Apps.

Ingestion Steps:

Register Developer
Create a data source
Select the ingestion method, and load the data.
- With JS: full code (write code within base and action part), low code (write code with base part, visually configure action part), no code (visually configure everything).
- With SDK: through the action of the data source ID and data ingestion key.
- With API: create an application, complete authorisation and authentication, and use the API endpoints.
View postbacked data.

The data within the data source has many uses, one of which is attribution. The attributed data will be used in various aspects of machine learning as a tool. We can use attribution as an example to analyse the data flow.

The operational path for the old version of attribution was:

The advertisers register developers through the Marketing API and obtain key users for building.
The advertisers create click detection through the ad delivery platform, and Tencent Ads transmits the click logs back to the advertisers.
The advertisers complete self-attribution through first-click attribution or multi-channel attribution (only available for App promotion) and check the conversion data via Tencent Ads. The advertisers create a data source through Zhishu, post back the filtered conversion data to Tencent Ads, and finish the job.

With the development of the refined operation of the advertising platform, the different industry demands vary. The data gradually met the requirements that the platform suggested. Cooperating with KA customers has gradually developed into an all-channel, all-path data being postback. In terms of this, the old version of attribution has the following problems:

Experience breaks. Data ingestion for attribution requires multi-platform collaboration.
Restricted fields. No flexibility to post back valid data as required by various models.
Prone to make mistakes. omitted string, exceeded string, and wrong string, et cetera.

This requires more information, such as optimisation goals and deep optimisation goals. It also pushes the oCPX ad delivery platform from "campaign + optimisation goal+ whether to turn on deep optimisation goal" to "campaign + industry path + optimisation goal".

The new version of attribution was born. It is a flexible one-stop solution, where advertisers add a click detection link when they create attribution rules (selecting a combination of optimisation goals under an industry path) in the ad delivery platform. This is achieved by agreeing on a string with a specified format to represent the position of the field that will be replaced in the future, this format is collectively known as "macro", click to see the introduction.

At the same time, as of 03 November 2022, 67.31% of customers are still using old version attribution, which still accounts for 77.52% of the cost. The old attribution still needs to be compatible. As complex as attribution is, it is also a challenge to consider the new version of attribution alongside the old one.

Advertisers in various industries post back conversion data to the platform under the requirement of oCPX ads, which is concerned about the accuracy of the data.

Data ingestion in Ads

Data Ingestion in Retail

That is smart retail under the WeChat ecosystem. We found that in the process of data ingestion in retail, due to its positioning of business analysis, and many applications on the cloud, clients need to ingest the data on their own to see the lively business in real time. Clients consider their own needs to open accounts or purchase applications on the cloud. This has the biggest advantage of retail data ingestion: one-time ingestion for multiple uses. To satisfy all of them, except for a few scenarios such as Cloud Alliance and Preferred Alliance, almost all retail applications meet the requirements of Youshu. However, there are several very strict requirements on Youshu, while other applications are not so complex, resulting in labour intensive, but can not be used. In addition, in data ingestion of retail, there are some functional features, such as providing clients with the log query, you can locate what each pageview looks like, which is very friendly to them.

Ingestion Steps:

Apply for data ingestion service.
Get account information and test Token, test data into the test environment, and production data into the test environment.
Obtain production Token, and production data into the production environment.
Active application and validation.

Data ingestion in Retail

Define Problems

For the External Clients

Advertisers and retailers have the following issues with the access process:

Data ingestion is not uniform.

Differences in the ingestion process and its standard of the two domains have led to reticulate communication, repeated postback, and unstandardised postback by clients and agencies during the cooperation with Tencent, which has directly led to busy communication, rotten efficiency, and poor data quality in data ingestion.

For example: When a customer builds an application, there will be a demand for data ingestion at Tencent. According to incomplete statistics, we have close to 200 documents. When a customer uses A capability and wants to use B capability, he may have to check B documents from A documents. These documents may be partially the same and fields also. It can be confusing for the customer's development team. If this customer has multiple teams, it creates reticulate communication, resulting in a lot of repeated data ingestion, which is inefficient.

Data management is inadequate.

Data ingestion is indirect leading to difficulties in efficiency between multiple channels, as well as data scattered-managed and differences in field standards. Clients and agencies have poor experience in managing data.

For example The ROAS lift-based strategy, before 2021, due to the difference between WeChat traffic and non-WeChat traffic in the parent-child ad ids, the difference in statistical standards, the difference in the meaning of the time of the intermediate table, the difference in the statistical objects, the missing data and other problems, there is a Gap of nearly 70% between two statistical results.

For the Internal Data Governance Teams

For the governance of data, both the ads and the retail struggled to meet the current demands of the business in terms of validation:

Validation in ads is weak.

There is only a simple engineering validation with no record of the reason for failure and no traceability. There is no clear logic for repeated postback. The logic is also not reflected in the documents. The validation also is incomplete and unclear.

For example, the platform at the beginning of the data ingestion is mainly three types: The data of attribution is provided for the optimization goals. The data of the audience file is collected for the targeting. The action data is loaded for the data insights. These data is historically, in the governance and management of different teams which built a lot of silos. To reach an agreement on the effects, industry operations negotiated with the advertisers, and at first everyone postback action type data, and when it wasn't enough, they added it inside the attributes. This resulted in poor data postbacked by the advertisers.

Validation in retail is strong.

Because products, platforms and services have their data in retail, the evolution and automation of each branch are far apart, to improve the efficiency of data governance, the most stringent standards as the bottom line, but in fact, resulting in a high input, low output of customer service.

For example, in addition to a few scenarios such as Cloud Alliance and Preferred Alliance, nearly all retail applications postback data by the requirements of Youshu.

For the Internal Data Application Teams

Data in ads is used deeply and retail is used widely.

There are a lot of barriers to conversion paths.

Each conversion path has different data processing, and there are many process sessions within four major problems less, more, wrong, and slower.

For example, problems in some sessions lead to less data, retries lead to more data, format conversion tampering with the original data leads to data errors, and more sessions lead to slower distribution.

There is a lot of waste of data resources.

Inefficient consumption. Each data-consuming side takes the data and makes a copy and cold backup, resulting in a waste of resources.

For example, in the past, the federated modelling, attribution and dee cooperation were three data flows, so you had to store three copies. Then it was merged into two, and now it's advancing into one.

Difficult to make downstream applications fully understand.

Downstream application parties have difficulty understanding the data being ingested. The requirements for data are different for each application side. Especially in machine learning, the data for training and predicting is lost a lot, and the value of conversion data application is limited.

For example, unfamiliar understanding. In the advertising system, attribution, models, strategies, et cetera. Each has its understanding and usage on the data-consuming side. With inconsistent goals, it's easy to become a Simpson's paradox. "Good individually, bad together."

These types of problems can be further abstracted into:

Data ingestion quality issues
Data ingestion efficiency issues

Objectives

Based on these three types of users, we try to explore and build a new full-domain data ingestion and application distribution experience. It meets the requirements of deep application, more accurate and real-time data in ads, and also meets the requirements of rich applications and multi-purpose applications in retail. Taking ads and retail data as a bridge, it creates an all-domain marketing data assistant that connects public and private traffic for clients.

Client Objectives

Efficiently complete high-quality data ingestion on the same Tencent platform by the data standard of a specific industry, and meet the use of multiple applications with a one-time data ingestion. Clients can meet up to 22 applications with a one-time data ingestion.

Business Objectives

By unifying ads and retail, integrating the requirements and standards of multi-channel, multi-application and multi-industry data ingestion, and unifying the ETL services and application distribution, it provides customers with a one-stop data ingestion and data management platform. Truly draw the positive cycle from data ingestion to data marketing.

Design Objectives

Build a unified data ingestion and data management platform.
Improve data ingestion efficiency and quality.

Solution

First Objective: Unified Data Ingestion and Asset Management

First, in the early days of the team merger, when everyone was unfamiliar with each other, we took a design sprint to guide the ads side and retail side to quickly reach a consensus and come up with an information architecture.

What are assets: raw assets are data ingested originally, and usable assets are data cleaned.
What entities are described by assets: users, products, orders, cards and coupons, stores, WeChat Official Account, WeChat Work Group, et cetera.
The form of assets: including data sources and audience files, data sources need to be developed separately, and audience files only need to be uploaded.
What are applications: specific capabilities, features, and services within a Tencent product.

Information Architecture

Second, we figured out a plan to promote ingestion through application hooks. After collecting the applications of the two domains, they were categorised, from the public domain to the private domain for marketing purposes, brands and touchpoints perceived by users.

How to organised information

We then had the design of the application selected. Taking it into the process means that tasks are based on whatever applications you're going to use.

Application Selecting

There is also a concept of "one ingestion, multiple applications". Such as two applications have been ingested and distributed, and accidentally meet the third. It is not necessary to re-ingest for that one.

One-Time Ingestion for Multiple Uses

Again, in data management, we have to first help customers solve the problem of where to look at the data assets. That is the relationship among all of the platforms. Then we solve the management problem. Previously, it was for the management of assets without applications also distribution, because there is just one and only one. Now it is a one-on-one relationship between assets and applications, to achieve accurate authorisation and distribution.

We provide users with first-time solutions for the platform jump, convey the value proposition of each platform, switch all the platforms at dark launch, clarify the platform differences and value, and support the future jump with Youshu.

Jumps between Platforms

Establish one-on-one relationships between data sources or audience files and applications to improve efficiency from filtering and batching. The object of distribution and authorisation is no longer the data source or audience files, but the precise authorisation and precise distribution for <one data source, one application> and <one file, one application>.

Authorisation and Distribution

Second Objective: Ingestion efficiency

Fewer steps, fewer actions, shorter duration, and fewer repeated fields are achieved by quantifying ingestion fields and ingestion steps.

First of all, the ingestion ways take the intersection and the fields take the concatenation. In this way, the least ingestion ways can satisfy the most fields needed.

The Ingestion Ways Take The Intersection and The Fields Take The Concatenation

Secondly, to solve the problem of carrying a big package over to the test as well as spending 1,000 CNY on validation, we split the test into matrix-like steps. With no test data going into the formal environment to avoid data contamination.

Definition: Ingestion efficiency = Ingestion Data Volume / Ingestion Duration
Ingestion Duration can be broken down by ingestion steps, positioning each session's problem.
Ingestion Data Volume is the average daily ingestion data volume after the completion of data ingestion.

Matrix-like Steps

Third objective: Ingested Data Quality

The difficulty lies in the optimisation of the cores, which includes the data model and validation rules.

Through the practice of data governance on the DataCube Platform, the data side built and validated the data model of quadruple - UserInfo, ItemInfo, ActionInfo, and QualityInfo - based on the shuttle mechanism of internal experiments. Among them, QualityInfo will be measured from the following dimensions:

Uniqueness: whether there are repeated records
Real-time: the length of time between the occurrence and the write
Accuracy: whether the data is consistent with the standards of its entities
Completeness: both in row and column, completeness of the row relies on no missing records compared to the validation set, and completeness of the column relies on no missing fields or records
Consistency: whether an entity is consistent across different datasets

Through industry research, we have constructed brand-new sets of validation rules, including entity base rules, application rules, and industry rules. And 184 sets have been sorted out, which are still advancing according to the priority:

User entity: (5 ads applications ✕ 270 Ops industries) + (13 retail applications ✕ all industries) = 148 sets
Merchandise Entity: (1 ad application + 10 retail applications) ✕ all industries = 11 sets
Card and Voucher Entity: 3 retail applications ✕ all industries = 13 sets
Store Entity: 9 retail applications ✕ all industries = 9 sets

What the designer can do is to keep on going to optimise the documents where the field information is located and improve it. There are two main touchpoints:

Firstly, there is a resident button inside the project with an accessible industry-customised document.
Second, in the document centre, the designer guides the product manager to deliver a new information architecture to centralize all the documents. Even now, there are still 2 sets of documents that exist before and after the login interface.

Documents by hand and Customised Fields by Industries

Effect

By the end of 2022, DataNexus covered 8,000+ clients, with both ads and retail dropping from days to hours in terms of data ingestion efficiency. On the ingested data quality, a significant increase in GMV was also achieved. In addition to this, as a designer bringing the team together at the beginning also wins the trust. The downside was the full release as well as the research after the full release has not been conducted.

Overall Effect

Last but not least

Whether it's data governance internally or data ingestion externally, whether it's human-centred or machine-centred, what we need to do is to create order out of chaos. To be a designer who can guard traditional design values and also explore the boundaries for it.