adjoe Engineers’ Blog
 /  Data Science  /  Overcoming Neural Network Limitations with DCNs
Data Science

Overcoming Neural Network Limitations: DCNs in High-Dimensional Ad Prediction Systems

There are two sides to our business at adjoe. We help 

  1. the applications that are trying to acquire new users
  2. the applications that have users and want to generate revenue by showing ads to these users.

A core challenge our Data Science Team tackles is the best approach to optimizing ad relevance by predicting the most effective ad for each user. This involves using our 600 terabytes of user data and advanced modeling techniques to ensure accurate, real-time ad delivery while keeping the models fast enough to make more than 20 million predictions each day. 

In this two-part series, I’ll explore the limitations of traditional neural networks and feature crossing and why we looked to deep and cross networks (DCNs) to have one generalized model to answer all our needs accurately. The first article primarily explains the theoretical details of deep and cross networks, whereas the second focuses on how we leveraged their capabilities in the adtech industry.

Working with Neural Networks

Artificial neural networks dominate machine learning and AI, powering over 80% of applications for a good reason. Built from simple structured neurons, they can be stacked horizontally and vertically to form highly complex architectures capable of solving a wide range of problems. This is possible because the universal approximation theorem mathematically proves that even a single hidden layer can approximate any function with sufficient accuracy. Pretty amazing, right?

Of course, there are some conditions to this, which I haven’t mentioned yet, but we’ll get to that soon.

Source: Medium

Looking at the illustration above, you can roughly see how neural networks estimate a function. They create many small linear approximations (if you’re using ReLU as an activation function). These small lines approximate any function you want, and as you increase the number of neurons, the accuracy improves.

This illustration highlights an instance in which you have only one input feature, but neural networks work similarly for higher dimensions of input. In this scenario, the lines would convert to planes. 

Training Simple Neural Networks 

Now, let’s see how neural networks work on a simple problem: x². 

Here I trained a simple neural network with fewer than ten neurons, and you can see that it performs quite well – especially, for such a small number of neurons compared to what’s typically used in the industry. Networks otherwise often have millions of neurons to solve complex problems.

This is pretty impressive, however, when we move outside the boundaries of the training data, things change. 

Within the range of zero to ten, our training data bounds, the network performs perfectly. However, once we move beyond these boundaries, you can see that the prediction starts to diverge from reality. 

This is one of the conditions of the universal approximation theorem. It only guarantees this model performance within the training data bounds.

If you’re familiar with neural networks, you might be thinking: “Isn’t this why we use feature crossing?” That’s true. 

Why Is Automatic Feature Crossing Not Always the Answer?

For those who may not know, feature crossing is quite simple. You take your raw input features and create more complex ones. For example, you can square a feature, multiply two features together, or divide one by the other. You then use these derived features as inputs instead of the raw ones. So, here, you could take x² as a feature.

The problem would be solved easily, right? Yes, but there are some issues with feature crossing. 

  1.  You need to know that this relationship exists. If you don’t know that the target you’re trying to predict is a function of x², then you have no way of creating that feature beforehand.
  1. The relationship could involve a combination of multiple features. For instance, there could be five different features that need to be multiplied together, or it might involve an even more complex function. Only then can you have a meaningful feature that correlates with your target.
  1. There could also be higher-degree functions involved. For example, you might have x² * y³ / z or any other complex combination. 

This becomes practically impossible to manage manually. And this is where deep and cross networks come into play.

What Are Deep and Cross Networks?

Let’s start by looking at simple neural network architecture, which I’m sure you’re familiar with.

We have our input features, followed by a matrix of trainable weights that is multiplied by the input features, plus a bias term. After this, we apply an activation function to introduce nonlinearity. This is the typical feed-forward neural-network architecture.

Source: Arxiv


Now, when we move to deep and cross networks, you’ll notice that in the middle, the architecture looks similar to a normal neural network. However, instead of just applying the activation function at this stage, we multiply everything by the input layer, and then we add the previous layer to it before producing the output.

Let’s break this down. 

Imagine the weight matrix here is the identity matrix, so we have ones along the diagonal and zeros elsewhere. For simplicity, let’s also ignore the bias and the added input. 

What we’d have is x (our input), and since the bias is zero and we’re ignoring the added input, x​ is multiplied by itself. If the weight matrix is the identity matrix, we end up with x², meaning each feature is multiplied by itself.

This is exactly what we were looking for in the example above. But this structure isn’t limited to just this – you can create any linear combination of features using the weight matrix, multiply it by the inputs, and generate any feature crossing you need. Plus, you can stack multiple layers of these on top of each other to create even higher dimensions.

How Would Deep and Cross Layers Look in a Model?

Cross layers create the features crosses for you, but you can’t use deep and cross layers by themselves; you need a full structure around them. Let’s take a look. 

Source: Arxiv

These two are the most common structures used for deep and cross networks, and here’s where the “deep” comes into play.

Let’s start by looking at the stacked version. First, we have the embedding layer. 

We need embedding because a lot of our features are sparse in many tasks. We often use one-hot encoding, which creates sparse features that are mostly zeros. When we multiply those zeros by other values, this results in zeros. 

On top of that, the amount of computation required would significantly increase, without embedding one-hot encoded features. 

So, the first thing we do is create embedding layers to turn those sparse features into dense ones. If you already have dense features, you can concatenate them here.

Next, we move onto the cross layers. We stack several layers on top of each other, and the output of these layers gives us the feature crossings we need. After that, we proceed into a typical feed-forward neural network.

Now, for the parallel version. The embedding layer remains the same, but the output goes simultaneously into both the cross network and the deep network. Once we get outputs from both networks, we concatenate them and pass them through one final layer to calculate the result we want.

Which Architecture Should I Use?

When it comes to comparing stacked versus parallel architecture – as far as I know from reading the papers and experimenting myself – there is no way to determine which one will work best for your specific case. So, if you want to use them, just make sure to try both and see which one performs better for you. 

The only notable difference is that in the stacked version, after you’ve trained your network, you can take the weights from those layers and visualize them. This is because everything passes through the cross layers. 

This allows you to see what your model is doing and which features are being combined. Thus, this structure offers slightly better interpretability. Other than that, I recommend trying both approaches.

How Does This Work in Practice? 

Let’s see how a deep and cross network (DCN) performs with the problem we discussed earlier.

What you see here is the result of training a model with one cross layer with only two trainable weights. The results were identical – so much so that I had to separate them into different plots. The training data was also the same as the previous model. I used only the range from zero to ten for training. 

As you can see, the prediction is accurate throughout this range and also outside this range. 

We designed this example to demonstrate the power of these networks; therefore, to truly understand their capabilities, we need to tackle a real-world problem. In the next article, I’ll discuss how we implemented DCNs at adjoe to solve real-world challenges, improve ad relevance, and boost performance metrics – read it here.

Sources:

  1. Ruoxi Wang, Gang Fu, Bin Fu, and Mingliang Wang.
    Deep & Cross Network for Ad Click Predictions. Stanford University and Google Inc.
  2. Ruoxi Wang, Rakesh Shivanna, Derek Z. Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed H. Chi.
    DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems. Google Inc.
  3. Fangye Wang, Tun Lu, Hansu Gu, Dongsheng Li, Peng Zhang, and Ning Gu.
    Towards Deeper, Lighter and Interpretable Cross Network for CTR Prediction. Fudan University, Microsoft Research Asia, and Seattle.
  4. Honghao Li, Hanwei Li, Yiwen Zhang, Lei Sang, Yi Zhang, and Jieming Zhu. DCNv3: Towards Next Generation Deep Cross Network for Click-Through Rate Prediction. Anhui University and Huawei Noah’s Ark Lab.

We’re programmed to succeed

See vacancies