purple abstract decorative text with code snippets

Topic Modeling App Genres with NLP & Latent Dirichlet Allocation

I’m Björn-Elmar Macek and I’ve been working as a data scientist at adjoe for seven years. 

Next to data engineering and carrying out analyses, my main focus is on creating models that help us better understand our users. One of our projects involved topic modeling with NLP – particularly, latent dirichlet allocation – I’ll demonstrate how we did this by:

  • giving you a brief introduction to our adtech product
  • explaining why we need certain app usage information from our users
  • taking you through the topic modeling process – from preparation to our findings

What Is Playtime and Its App Usage Data?

Let’s start with an introduction to how our Playtime product works. It’s a rewarded ad unit that mobile users choose to engage with. It serves (mostly) gaming ads that we consider to be of interest to these users. 

Since we reward users for their engagement (based on time spent in the app or levels reached), our users need to accept permissions to benefit from our rewarding mechanism. These permissions allow us to gain deeper insights into their likes and dislikes, which can in turn help us to serve them ads for games they would enjoy.

 Why Is This App Usage Information Important? 

Think of app usage information as a list of IDs that are used by Google to identify apps. This list is enriched by some additional information such as the timestamp identifying when an app was last used. WhatsApp messenger, for example, has the ID com.whatsapp; YouTube’s ID is com.google.android.youtube. 

These app lists work as a kind of fingerprint; they can help you understand your users’ preferences. To do this, it’s important for us to be able to group a user’s existing applications into categories – such as app genre. You can scrape this and other information, such as the description from the app store, using libraries like Google-Play-Scraper. Even despite the app store providing this information, genres often contain a wide range of apps that are quite different. 

Take the lifestyle genre, for instance. This category contains the following three entries, which are quite different:

  • Kasa Smart allows you to configure and control your smart home devices
  • Pinterest allows users to post, browse, and pin images to their own boards
  • H&M is a shopping app for clothes and accessories

To improve how we categorize apps, we decided to use app descriptions from the Google Play Store for more granular classification. This is when we decided to use natural language processing with Python.

Preparing Data before Topic Modeling

Before our team started topic modeling, we had to make sure we first cleaned the data. First of all, we wanted to focus on English and filter out all the rows of our scraped data. These contained app descriptions written in another language. We used polyglot for language detection and applied it to the description column (descr).

import pandas as pd
import regex
from polyglot.detect import Detector

# remove non ASCII characters
def remove_bad_chars(text):
   return regex.compile(r"\p{Cc}|\p{Cs}").sub("", text)

# detect the (most likely) language
def detectLang(x):
   languages = Detector(remove_bad_chars(x), quiet=True).languages
   if (len(languages) == 0):
       return "__"
   max_conf = max([lan.confidence for lan in languages])
   lang = [lan.code for lan in languages if lan.confidence == max_conf][0]
   return lang

data = pd.read_json("playstore_raw.json", lines=True)
data["lang"] = data.descr.apply(lambda x: detectLang(x))
data_en = data[data.lang == "en"]

At this point, we had the plain English words and eliminated special characters, such as smiley faces. Moving forward, we decided to only have nouns, named entities, and verbs – and to get rid of all other words, such as adjectives, adverbs, auxiliary words, numbers, etc. 

Keeping other word types might, of course, have been beneficial (depending on the use case), but we decided to go with these. The next step involved using the pre-trained language model en_core_web_sm in the spaCy library.

import spacy


def getWordType(words, allowed, nlp):
   res = []
   doc = nlp(words)
   for token in doc:
       if token.pos_ in allowed:
           res.append(token.text)
   return ' '.join(res)

nlp = spacy.load("en_core_web_sm")

data_en["text_nouns_propns_verbs"] = data_en.descr.apply(lambda x: getWordType(str(x), ["NOUN", "PROPN", "VERB"], nlp))

At this stage, we were nearly done – we just had to complete a final stemming step. This normalized different forms of the same word: “play” and “plays,” for example. Although both strings are not the same, they reference the same word. We used the NLTK library’s snowball stemmer to do this.

import nltk


def stemWordArray(x, sno):
   return ' '.join([sno.stem(i) for i in x])

sno = nltk.stem.SnowballStemmer('english')
data_en["text_nouns_propns_verbs_stemmed"] = data_en.text_nouns_propns_verbs.apply(lambda x: stemWords(str(x), sno))

The data was then ready to use for training.

Topic Modeling with LDA

You can employ a wide range of algorithms out there. The approaches we consider here expect a set of documents (also known as “corpus”) as input. 

In our example, each app description (descr-column) is considered a document. A document is interpreted in its bag-of-words representation: Just think of it as a simple vector, in which every word is represented in exactly one dimension, and the value in that dimension is equal to the number of times this word occurs in the respective document. 

It’s natural that these bag-of-word models are very sparse. This means that as only limited words appear in one sentence/document, then our array may contain many zeros. Depending on how these vectors are used, this is an unpleasant property that we will automatically overcome when applying topic modeling.

Nonnegative matrix factorization interprets the corpus as a matrix, in which each document is contained in a row with its bag-of-words. This matrix is decomposed into two matrices, which provides

  • insights into the topics/genres that were assigned to a document/app (“MAPPING APP to GENRE” in the figure below).
  • the indication of the extent to which a word insinuates that an app has a genre (“MAPPING WORD to GENRE”).

The diagram below illustrates the process.

diagram of nonnegative matrix factorization

Latent Dirichlet Allocation (LDA) achieves something very similar to what you can see in the diagram above. Its basic assumption is that each document consists of a set of topics (in our case a topic corresponds to a genre), while a topic is actually considered an assignment of probabilities to words. When given a number of final topics (often denoted as “k”) and a corpus, it will estimate “the probability” of each word being associated with a certain topic in a way that makes the existence of the documents in the corpus most likely.

In our example here, we use Latent Dirichlet Allocation to try to identify subgenres within the adventure game genre. We actually used the Gensim implementation here.

from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel
import pandas as pd

def takeSecond(elem):
   return elem[1]

GENRE = "GAME_ADVENTURE"
data = data_en[data_en.genreId==GENRE]

texts = data.text_nouns_propns_verbs_stemmed.apply(lambda x: x.split(" "))
common_dictionary = Dictionary(texts)
common_corpus = [common_dictionary.doc2bow(text) for text in texts]
lda = LdaModel(common_corpus, id2word=common_dictionary, num_topics=6, passes=50)



data["clusters"] = data.text_nouns_propns_verbs_stemmed.apply(lambda x: sorted(lda[common_dictionary.doc2bow(x.split(" "))],key=takeSecond, reverse=True))
data["cluster"] = data.text_nouns_propns_verbs_stemmed.apply(lambda x: sorted(lda[common_dictionary.doc2bow(x.split(" "))],key=takeSecond, reverse=True)[0][0])
data["cluster_prob"] = data.text_nouns_propns_verbs_stemmed.apply(lambda x: sorted(lda[common_dictionary.doc2bow(x.split(" "))],key=takeSecond, reverse=True)[0][1])

What Happened Next?

Since it’s not easy to give you quick insights into the quality of the results, I can show you some screenshots of the apps that have been grouped together.

cluster of games under subcategory adventure games via topic modeling
cluster of racing games and hidden object and riddle games under topic modeling

One of the subsequent subgenres hardly contained any relevant apps, and the racing game subgenre also contained a few outliers, such as the rollercoaster game, which did not belong there. I must mention that there were generally several apps that were assigned to multiple genres (the probability of belonging to each genre being comparably low). We in the end only considered apps that could clearly belong to one subgenre.

For Google Play Store’s lifestyle category mentioned toward the beginning of this article,  we could identify the following subgenres:

  • Spirituality
  • Wallpaper & Themes (for smartphones)
  • DIY and inspirational apps (knitting, crafting, interior and garden design)
  • Fashion, hairstyle, tattoos, make-up selfcare
  • Horoscope
  • Hinduism and Islam

We found overall that Latent Dirichlet Allocation did a nice job of creating a topic model for apps based on their descriptions. We were able to refine the Google Play Store categories and gain a deeper understanding of what a user is interested in.

What’s Next in Topic Modeling?

Going forward, the team still has things to do. We need to further improve the coverage of apps for which we can provide a proper subgenre. This also means we need to detect clusters that we know exist but have not yet been identified. 

Our next step is to keep different kinds of words in the description and try out other topic modeling techniques like LDA2Vec, which combines the strength of understanding plain text that Word2Vec models have with the effective LDA topic modeling.

Are you data-driven – and driven?

See vacancies