adjoe Engineers’ Blog

/ Data Science / DCNs for Replacing Individual Models

Data Science

How Does a Deep and Cross Network Enable Us to Replace Over 100 Individual Models

Mohamadhosein Dehghani

December 13, 2024

Welcome to part two of our article on Neural and Cross Networks.

In this section, I’ll share how we developed a single model that replaced over 100 individual models—enhancing accuracy, speed, and coverage while significantly reducing maintenance costs.

Our Models Architecture

At adjoe, we process millions of predictions daily across 100+ countries. Managing separate models fine-tuned for different data slices eventually became unsustainable due to growing complexity. We needed a scalable, generalized solution—and that’s where deep and cross networks came into the picture.

Let’s look at how our user engagement prediction system previously worked in detail before jumping into our new approach.

We used to have over 100 models based on gradient-boosted trees. Each model was responsible for a specific slice of the data. This was necessary because each slice of data has its own unique properties. As an example, each ad targets a different audience, and standard models couldn’t effectively capture the differences between these audiences. Using traditional model architectures, a single model couldn’t be trained to handle everything. The tree-based models were chosen for their speed and efficiency.

However, one challenge with this approach was that we needed sufficient data from each slice to effectively train a model. If a new data slice appeared, we couldn’t make any predictions until enough data was gathered. But at the same time, we couldn’t create a single model to handle everything either.

Looking at Users and App Install Data

Let’s analyze one example of input features of our model.

In this case, we aimed to estimate the value on the Y-axis, representing the probability of installation (in percents). The X-axis shows the number of ads a user has been exposed to. As we can see, the likelihood of installation decreases as the number of ads viewed increases. This is intuitive: excessive ad exposure typically leads to diminishing interest, with users becoming less engaged and less likely to install the advertised app.

The blue line represents users who have never installed an app from an ad, while the red line indicates users who have previously installed one app by clicking on an ad. As shown, users with prior installs have nearly double the likelihood of installation at every point on the graph.

Furthermore, the likelihood of installation increases with each additional app installed. For instance, after a user installs three apps, their probability of installation could double again. The data reveals a clear pattern: after two installs, the likelihood doubles, and after three, it doubles once more. This highlights the core issue discussed in my previous article (predicting x²).

We can see that previous installs have a multiplicative effect on the installation likelihood. An interesting pattern emerges: users with three previous installs have the highest chance of installing among all the groups shown. However, even at their peak, this likelihood is still lower than that of users with two previous installs.

We expect the purple line to have the dotted-purple-line trajectory, but what the model takes from the data is that the green line has the highest value.

This discrepancy arises because we don’t have enough data for users who have installed three apps. It’s clear that you can’t install three apps after only seeing two ads, even if you install every app shown. The data just doesn’t align. Additionally, it’s uncommon for a user to see three ads and install all of them. As a result, we need more time to collect enough data for this scenario.

As you may notice, there is a lack of sufficient data for users who have made three installs. If we had more data, it would likely be positioned higher on the graph. With limited information, traditional models, including neural networks, might simplify assumptions, such as treating two installs as more valuable than three. This highlights the importance of refining our approach to ensure accurate predictions. This is a common pitfall of standard models. In contrast, deep and cross networks, especially with their automatic feature crossing, are designed to avoid this mistake.

Training Our Deep-and-Cross-Based Model

Now equipped with the power of deep and cross networks, we are no longer forced to make many different models so we chose to train a single deep-and-cross-based model using all the data we had. By combining different slices as an added feature to model input, we created a model that could make predictions across all possible scenarios, demonstrating the advanced capabilities of machine learning in adtech.

What benefits did this switch bring us?

This model considers all of the available data, leading to improved accuracy and better overall metrics. We saw an increase in both aggregated metrics and for each individual slice.
Having a single model to maintain significantly reduced costs, both in terms of infrastructure and the time developers spent on maintenance.
With just one model, we could batch all our ads and requests together and process them in a single operation. This batching led to faster inference time.
Previously, when we onboarded a new app advertiser, we couldn’t start making predictions right away due to the lack of data. However, because this model was trained on all available data and was designed to handle a wide range of scenarios, we could quickly start using it for any new case. The new model’s versatility allowed us to significantly increase coverage without any delays.

Final Thoughts: What to Consider When Using DCNs

When you want to use DCNs, there are some key points to keep in mind.

Don’t overdo it. While adding layers to deep and cross networks might seem tempting, more isn’t always better. The vanishing gradient problem, common in neural networks, is even more pronounced here. Based on both my experience and the literature, keeping the depth to three or four layers tends to give the best results. Adding more layers can actually hurt performance.
These models are fast at making predictions, but the training process can be much slower. Expect potential delays during training.
There are variations of DCN worth considering. What we’ve covered is based on DCNv2, but for better results, check out GDCN (gated deep and cross networks). A new version of the DCN paper (v3) has also been released, which I haven’t covered here—be sure to check it out.

If you are working with structured data and especially if you have the feeling that your models can learn more from your data, deep and cross networks might be your solution.

Sources:

Ruoxi Wang, Gang Fu, Bin Fu, and Mingliang Wang.
Deep & Cross Network for Ad Click Predictions.
Stanford University and Google Inc.
Fangye Wang, Tun Lu, Hansu Gu, Dongsheng Li, Peng Zhang, and Ning Gu.
Towards Deeper, Lighter and Interpretable Cross Network for CTR Prediction.
Fudan University, Microsoft Research Asia, and Seattle.
Honghao Li, Hanwei Li, Yiwen Zhang, Lei Sang, Yi Zhang, and Jieming Zhu.
DCNv3: Towards Next Generation Deep Cross Network for Click-Through Rate Prediction.
Anhui University and Huawei Noah’s Ark Lab.

Machine Learning Engineer (f/m/d)

adjoe
BI & Data Science
Full-time

Apply for this job

adjoe is a leading mobile ad platform developing cutting-edge advertising and monetization solutions that take its app partners’ business to the next level. Part of the applike group ecosystem, adjoe is home to an advanced tech stack, powerful financial backing from Bertelsmann, and a highly motivated workforce to be reckoned with.

Meet Your Team: WAVE Data Science

Did you know that in-app ads are sold in a real-time auction before they get rendered in thousands of mobile apps? Dozens of ad networks compete for every single view, choosing their best ad and deciding what price to bid in just a couple hundred milliseconds. We at adjoe have developed our own ad network that takes part in this competition, fighting against giants like Google, Meta and TikTok to present its own ads. To stand a chance in this fierce competition, the WAVE Data Science team builds creative algorithms using technologies ranging from simple linear regression to advanced deep learning models. Everything we do is based on solid research about user behavior as well as publisher and advertiser analysis to build competitive bidding algorithms that balance the advertisers’ goals, the publishers’ expectations and of course the user experience. To make our inventions come to life we use state-of-the art technology and work closely with the product and business teams to shape the future of adjoe’s core business and technologies.

What You Will Do

Dive into state-of-the-art algorithms and deep learning models to create recommendation systems, predict user behavior, and optimise user retention.

Conduct bidding algorithm experiments end-to-end: from idea generation and research to deployment to production, monitoring and decision making based on the results.

Dive deep into the technical implementation of our algorithms and optimise them for use in production.

Build systems to monitor technical and business KPIs in real-time.

Act as an advocate for data-related topics in the company and become the go to person in your area of expertise.

Who You Are

You have 2+ years of professional experience in the Data Science field building recommendation systems or similar (for example in adtech, retail, search, ranking).

You have shown a great level of understanding of deep neural networks (libraries such as TensorFlow and PyTorch) and have experience developing deep neural networks for recommendation systems.

You have already deployed machine learning models to production yourself and you know how to monitor them.

You have a strong knowledge of Python, R, Scala, Julia or similar typical programming languages for Data Science and have experience writing production-ready code in it

You have experience drilling into large amounts of data coming from various sources – including AWS Athena, Kafka, Spark, Flink, S3, MySQL.

You are able to dive deep into mathematical foundations and explain complex topics in a simple way.

You are a strong team player and enjoy helping others.

Generating new ideas and solutions to problems, even unconventional ones, is something that brings you joy and that’s easy for you.

Plus: Experiences in deep cross networks and multi-task deep learning.

Plus: You have hands-on experience working with common Data Science / Machine Learning tools in production, for example TensorFlow, TensorFlow Serving, Airflow, Flink, Kafka, Terraform or other tools from our Tech Stack.

Heard of Our Perks?

Work-Life Package: 2 remote days per week, 30 vacation days, 3 weeks per year of remote work, flexible working hours, dog-friendly kick-ass office in the center of the city.

Relocation Package: Visa & legal support, relocation bonus, reimbursement of German Classes costs and more.

Happy Belly Package: Monthly company lunch, tons of free snacks and drinks, free breakfast & fresh delicious pastries every Monday

Physical & Mental Health Package: In-house gym with personal trainer, various classes like Yoga with expert teachers.

Activity Package: Regular team and company events, hackathons.

Education Package: Opportunities to boost your professional development with courses and trainings directly connected to your career goals

Wealth building: virtual stock options for all our regular employees.