1 Introduction

According to a study by Adobe Analytics, global year over year online sales have grown in June 2020 by 78% and experts project that total online sales surpass the sales in 2019 by Oct 5 (Crets 2020). The current Corona pandemic reinforces this trend across countries and has expanded the scope of e-commerce. In light of the convenience of the new purchasing habits and the incentive for firms to capitalize on investments in new sales channels, some of these changes in the e-commerce landscape will likely be of a long-term nature (OECD 2020). The increasing importance of online distribution channels is paralleled by a rising interest in gaining insights into the customer journey and browsing behavior.

We focus on measuring dependences between visits of different websites by means of machine learning methods. We derive measures of cross-site dependences from topic models and the replicated softmax model (RSM). As topic models and the RSM are comprehensive probabilistic models, these measures do not depend on visits of the remaining considered sites. This property constitutes an advantage over conventional bivariate measures based on 2 \(\times \) 2 cross-tabulations. From a managerial point of view, high cross-site dependences may suggest that one of the sites should join an affiliate program of the other or one site should invite the other site to its own affiliate program. High cross-site dependences are also arguments in favor of higher price bids for advertising slots or corresponding contracts. We can derive such implications only by looking at websites of several firms. Nonetheless, the machine learning methods we investigate could also be used to analyze browsing behavior dependences across subdomains within a website (e.g., presenting jackets, coats, trousers etc. on an apparel retailer’s website or dependences between different genres on a book retailer website).

Schröder et al. (2019) show that many publications analyze Internet browsing behavior on the website of one firm or across types of websites (types may, e.g., consist of all book, travel or music sites), but very few investigate browsing behavior across websites of different individual firms. Applying this comprehensive approach, researchers and decision makers can get a better understanding of the journey of each customer. To fill this research gap, Schröder et al. (2019) determine topics underlying online users’ browsing behavior by means of a popular topic model, latent Dirichlet allocation (LDA). They conceive the websites that a user visits during a calendar week as browsing basket in analogy to shopping baskets that are well known in retailing.

The study by Schröder et al. (2019) is a typical application of topic models which have gained a lot of attention in recent years. Topic models have been frequently applied in the marketing literature (see, e.g., Büschken and Allenby 2016; Jacobs et al. 2016; Tirullinai and Tellis 2014; Trusov et al. 2016). Moreover, Reisenbichler and Reutterer (2018) provide a comprehensive overview on this topic. Like most topic models, LDA is a mixed membership model, i.e., each basket is related to multiple topics in proportions that vary across baskets (Blei 2012). Mixture models determine convex combinations of distributions and do not renormalize. For high dimensional data mixture models may run into problems, as the final distribution cannot be sharper than the distributions of the individual hidden variables each of which is adapted to all observed variables (Hinton 2002).

We consider two further topic models which can be seen as extensions of LDA, the correlated topic model (CTM) and the structural topic model (STM). The CTM allows for correlation between topics. The STM in addition includes effects of covariates.

The four machine learning models (LDA, CTM, STM, and RSM) that we investigate constitute recent approaches to two-mode factor analysis. Two-mode factor analysis starts from a rectangular matrix with different entities on the rows and columns (in our case websites and browsing baskets). Two-mode factor analysis compresses such a matrix to fewer latent variables (Deerwester et al. 1990).

Our paper fits well to the current high interest of marketing academics in machine learning methods to which both topic models and the RSM belong (Bradlow et al. 2017; Chintagunta et al. 2016; Dzyabura and Yoganarasimhan 2018; Hagen et al. 2020; Wedel and Kannan 2016). Our paper also complies the call to investigate alternative machine learning methods especially for the analysis of clickstream data which Ma and Sun (2020) raise in their recent overview on machine learning in marketing.

As an alternative to topic models, we introduce the RSM, a data analytic method, which is new to the academic marketing community. The RSM is an extension of the restricted Boltzmann machine (RBM) which deals with binary data to count data (e.g., the number of visits of a user to a website). Like the RBM, the RSM provides a distributed representation because probability functions, each specific to a hidden variable, are multiplied in the first step and renormalized in the second step. This way, sharp distributions may be detected. In their review of topic models with emphasis on marketing applications, Reisenbichler and Reutterer (2018) also mention this property of the RSM referring to Salakhutdinov and Hinton (2009). In the empirical part of Salakhutdinov and Hinton (2009), RSMs with 50 hidden variables outperform LDAs with 50 topics for three different text datasets. We investigate whether such performance differences also apply to browsing baskets which besides being non textual are also much smaller than the documents analyzed by Salakhutdinov and Hinton (2009).

Hruschka (2021) applies several machine learning methods to analyze retail basket data. In his study, the RBM is clearly superior to topic models. Because of the results obtained by Salakhutdinov and Hinton (2009) as well as by Hruschka (2021), we think it is justified to investigate the performance of both the RSM and LDA on browsing data. Please also note that to the best of knowledge our paper constitutes the first application of the RSM to marketing.

In accordance with the suggestions of two anonymous reviewers, we also perform singular value decomposition (SVD). SVD is a classical two-mode factor analysis technique, which Eckard and Young (1936) introduced in psychometrics. SVD serves as straightforward benchmark to evaluate the topic models and the RSM.

In the next section, we present both the four investigated machine learning methods and SVD. We discuss estimation of these models and explain how we evaluate their statistical performance. To improve readers’ comprehension of the investigated methods, we illustrate how to apply the CTM, the RSM, and SVD for a small number of websites. Then we explain the preparation of the analyzed data, present descriptive statistics, and give estimation and evaluation results for varying numbers of topics or latent variables. The RSM attains a hugely better model fit than LDA, but the CTM attains the overall best performance. The STM does not improve performance over the CTM though the former includes covariates. The CTM and the RSM lead to a more efficient compression of web browsing data than SVD, whereas LDA turns out to be inferior to SVD.

We continue by interpreting the CTM and RSM using combinations of topics or hidden variables that differ with respect to websites with high visiting probabilities. In the final section, we show that conclusions inferred from both the CTM and the RSM are in clear contrast to bivariate conditional probabilities, which can be computed simply from pairwise joint frequencies. We show that the uncertainty in predicting household expenditures with topics or hidden variables as independent variables is lowest for the CTM, followed by the RSM. On the other hand, topics determined by LDA are as a rule not appropriate to predict household expenditures. In addition, we indicate how the RSM can be used to support online marketing decisions of websites.

2 Investigated models

We now explain the main differences between the investigated topic models, the RSM, and SVD. Each of these models includes latent variables, i.e., topics, hidden variables, and components for the topic models, the RSM, and SVD, respectively. In the following sections we give more details on these models.

Topics are multinomial variables. Topic models relate the visiting probability of a website in a browsing basket to two types of proportions, the proportion of each topic for the website and the proportion of each topic for the browsing basket.

For LDA the two types of topic proportions are Dirichlet distributed. LDA leads to slight negative correlations between topics (Blei and Lafferty 2007). The CTM is more general by allowing correlations that are not restricted, e.g., correlations may be positive or negative. To achieve this flexibility the CTM replaces the Dirichlet distribution by the logistic normal distribution. The STM extends the CTM by adding effects of covariates.

The RSM includes binary hidden variables that are sampled from binary logistic functions. Linear combinations of the number of visits to each website contained in a browsing basket serve as argument of these functions. The RSM computes visiting probabilities of websites by a multinomial logistic function that depends on site-specific linear combinations of the hidden variables for the respective browsing basket.

SVD considers the number of visits to each website, which it compresses to a lower number of metric latent variables, called components. SVD is known as a data reduction technique in psychometrics for more than 80 years. Low-dimensional plots of rows and columns of a data matrix based on SVD results are quite popular in marketing research, just as in other application areas (Gabriel 1971; Gower and Hand 1995; Kuhfeld 2010). More than 30 years ago, SVD was introduced to the text mining literature and relabeled latent semantic analysis (Deerwester et al. 1990).

SVD shows several weaknesses compared to the investigated topic models and the RSM. Contrary to these machine learning methods, SVD is not based on a probabilistic model. It approximates the number of visits, which is a count variable by a \(L^2\) norm (i.e., the square root of the sum of the squared vector values) which appears to be rather ad hoc (Hofmann 2001). Another problem of SVD is the fact that components determined by SVD may be negative, which makes interpretation difficult.

Let us introduce the basic notation used for the investigated models. I and J denote the number of browsing baskets (i.e., the number of calendar weeks in which at least one website is visited) and the number of websites. K is the number of latent variables (topics, hidden variables, components). \(\mathbf {V_i}\) is a \((J,S_i)\) binary indicator matrix with an element \(v_{ijs}\) equaling one if the s-th visit contained in basket i takes place at website j. \(S_i\) denotes the size of the browsing basket, i.e., the number of visits to all websites.

2.1 Latent Dirichlet allocation

LDA is based on the assumption that a mixture of latent variables called topics generates websites visited by an online user. These topics explain why an online user visits certain websites. All visits share the same topics, but their proportions are specific to each visit and randomly drawn from a Dirichlet visit-topic distribution.

As an example, consider a situation where a person is browsing through online stores with two possible topics, groceries and party preparation. One week (= one browsing basket), he purchases his normal groceries but also some beverages for unexpected visitors. Therefore, he visits only a few sites (i.e., \(S_i\) is small) and his latent topic combination would be 90% groceries and 10% party preparation. In the following week (= a different browsing basket), the person is host of a large gathering of people, so he visits many different sites and the topics are more inclined toward party preparation (98 %) than groceries (2 %).

LDA forms topics in such a way that websites with higher conditional probabilities for a topic frequently co-occur with each other in weekly visits (Crain et al. 2012). For each topic assigned to a visit, a website is chosen randomly from its corresponding distribution.

Parameters in a (JK) matrix \({\varvec{\phi }}\) and a (KI) matrix \({\varvec{\theta }}\) indicate the importance of websites for topics and the importance of topics for browsing baskets, respectively. Note that the \(k-\)th column of \({\varvec{\phi }}\) represents the probability of websites conditional on topics t and therefore sums up to one. The number of parameters equals the number of topics plus the number of sites multiplied by the number of topics, i.e., \(K + J K\) (Blei et al. 2003).

The probability \(P(v_{ijs}=1)\) that browsing basket i contains website j is related to the importance of this website for topics and the importance of topics for this browsing basket in the following manner (Griffiths and Steyvers 2004):

$$\begin{aligned} P(v_{ijs}=1) = \sum _{k=1}^K \phi _{jk} \theta _{ki}. \end{aligned}$$
(1)

Like Schröder et al. (2019), we estimate LDA models by blocked Gibbs sampling implemented in the R package topicmodels (Grün and Hornik 2011). For each browsing basket, the Gibbs sampling procedure considers each visited website and determines the probability of assigning the current website to each topic, conditional on the topic assignments of the other websites. From this conditional distribution, a topic is sampled and stored as new topic assignment for this website (see Griffiths and Steyvers 2004 for more details).

2.2 Correlated topic model and structural topic model

The correlated topic model (CTM) extends LDA by allowing for flexible dependences between topics based on a \((K-1,K-1)\) covariance matrix \(\Sigma \) of a multivariate Gaussian distribution with zero mean vector (Blei and Lafferty 2007; Roberts et al. 2019). The structural topic model (STM) specifies this mean vector of the multivariate Gaussian as linear function \(X_i^{'} \gamma \) with a \((p, K-1)\) coefficient matrix \(\gamma \) and a vector \(X_i\) consisting of p covariates (Roberts et al. 2019). Therefore, in contrast to the CTM, site visits within a topic may vary by covariates’ values for the STM.

Both the CTM and the STM replace the Dirichlet of LDA by the more flexible logistic normal distribution in the following way. Vectors \(\eta _ {ki}\) with \(K-1\) elements are drawn from the appropriate multivariate Gaussian distribution to obtain importances of topics \(\theta _{ki}\) for a browsing basket i:

$$\begin{aligned} \theta _{ki}= & {} \frac{\exp {(\eta _{ki})}}{1 + \sum _{k^{'} = 1}^{K-1} \exp {(\eta _{k^{'}i})}} \quad \text{ for }\quad k=1,\ldots ,K-1 \nonumber \\ \theta _{Ki}= & {} \frac{1}{1 + \sum _{k^{'} = 1}^{K-1} \exp {(\eta _{k^{'}i})}} \end{aligned}$$
(2)

The number of parameters of the CTM and the STM amount to \(K + J K + (K-1)(K-2) / 2\) and \(K + J K + (K-1)(K-2) / 2 + p (K-1)\), respectively.

We estimate CTM and STM by the variational expectation-maximization algorithm implemented in the R package stm (Roberts et al. 2019). Each iteration of this algorithm consists of two steps. The expectation step updates the topic proportions \(\theta _{ki}\) of each basket and topic assignments to visited sites. The maximization step serves to estimate parameters \(\phi \), \(\Sigma \) and in the case of the STM \(\gamma \).

2.3 Replicated Softmax model

The RSM associates observed browsing behavior with a combination of binary hidden variables. Consider the example of a person looking for both tablets and mobile phones which the RSM reflects by a hidden variable A. However, for several weeks she looks only for mobile phones (tablets). The RSM reproduces this focus by another hidden variable B, which is negatively related to visits of sites that offer tablets (mobile phones). Finally, for a few weeks this person visits only sites with products other than tablets and mobile phones. The RSM generates such a browsing behavior by combining hidden variable A with both hidden variables B and C.

Our following description is based on Salakhutdinov and Hinton (2009). For the RSM the probability of the i-th browsing basket \(P({\mathbf {V}}_i) \) can be written using energy function \(F({\mathbf {V}}_i,{\mathbf {h}}_i)\) and partition function \(Z_i\):

$$\begin{aligned} P({\mathbf {V}}_i)= & {} \frac{1}{Z_i} \sum _{\mathbf {h_i}} \exp (-F({\mathbf {V}}_i,{\mathbf {h}}_i)) \quad \text{ with } \quad Z_i = \sum _{V_i} \sum _{\mathbf {h_i}} \exp (-F({\mathbf {V}}_i,{\mathbf {h}}_i)) \nonumber \\ F({\mathbf {V}}_i,{\mathbf {h}}_i)= & {} - S_i \sum _{k=1}^{K} a_k h_{ik} - \sum _{j=1}^{J} b_j y_{ij} -\sum _{k=1}^K \sum _{j=1}^{J} W_{kj} h_{ik} y_{ij} \end{aligned}$$
(3)

\({\mathbf {h}}_i\) is a vector of K binary hidden variables. \(b_j\) and \(a_k\) are constants for website j and hidden variable k, respectively. \(y_{ij}\) is defined as count of visits to website j, i.e., \(y_{ij} =\sum _{s=1}^{S_i} v_{ijs}\).\(W_{kj}\) links hidden variable \(h_{ik}\) to the count of visits to website \(y_{ij}\) in brwosing basket i.

Please note that the weights \(W_{kj}\) that relate a hidden variable to website visits can be positive or negative. This property distinguishes the RSM from LDA for which the importances of websites for topics are restricted to be positive. The fact that it allows negative weights makes the RSM more flexible than LDA even for a similar number of parameters.

The conditional distributions of visits and hidden variables have the form of softmax (synonymous with multinomial logistic) and binary logistic functions, respectively:

$$\begin{aligned} P(v_{ijs}=1|{\mathbf {h}}_i)= & {} \frac{\exp \left(b_j + \sum _{k=1}^{K} W_{kj} h_{ik} \right)}{\sum _{j^{'}=1}^{J} \exp \left(b_{j^{'}} + \sum _{k=1}^{K} W_{kj^{'}} h_{ik} \right) } \end{aligned}$$
(4)
$$\begin{aligned} P(h_{ik}=1|{\mathbf {V}}_i)= & {} \frac{1}{1 + \exp \left(-\left(a_k + \sum _{s=1}^{S_i} \sum _{j=1}^{J} W_{kj} v_{ijs}\right) \right) } \end{aligned}$$
(5)

The model is called replicated softmax because softmax units have the same weights for each of the \(S_i\) visits. The number of parameters equals \(J + K + J K\) as parameters consist of constants for websites and hidden variables as well as weights \(W_{kj}\).

As direct maximum likelihood estimation of the RSM turns out to be intractable, we use the contrastive divergence algorithm developed by Salakhutdinov and Hinton (2009) slightly modifying the implementation of Mochihashi (2013) (Interested readers may download the respective Python code together with an example dataset of browsing baskets from the GitHub repository https://github.com/HHruschka/RMS_Estimation/).

Contrastive divergence changes parameters in each iteration by adding:

$$\begin{aligned} \varDelta W_{kj}= & \, \alpha \, (E_{P_{data}} [y_{ij} h_{ik}] - E_{P_{L}} [y_{ij} h_{ik}]) \nonumber \\ \varDelta a_k= & \, \alpha \, (E_{P_{data}} [h_{ik}] - E_{P_{L}}[h_{ik}]) \nonumber \\ \varDelta b_j= & \, \alpha \, (E_{P_{data}}[y_{ij} ] - E_{P_{L}} [y_{ij}]) \end{aligned}$$
(6)

\(\alpha < 1\) is a learning constant with \(0< \alpha < 1\). \(E_ {P_{data}}\) denotes the expectation with respect to the data distribution, \(E_{P_{L}}\) the expectation obtained by running L Gibbs sampling steps starting from the observed data. Gibbs sampling is efficient, because visits depend on hidden variables only (expression (4)) and hidden variables depend on visit counts only (expression (5)). In addition, we sample S times from a single softmax unit. In line with usual practice, we make just one sampling step by setting \(L=1\).

2.4 Singular value decomposition

SVD processes the number of visits contained in a (JI) matrix \({\varvec{N}}\). Element \(N_{ji} \equiv \sum _{s=1}^{S_i} v_{ijs} \) indicates the number of visits to site j in browsing basket i. SVD compresses these data into a lower dimensional space \(K<J\):

$$\begin{aligned} {\varvec{N}} = {\varvec{T}} {\varvec{E}} {\varvec{B}}^{'} \end{aligned}$$
(7)

\({\varvec{T}}\) is a reduced (JK) matrix of sites, \({\varvec{E}}\) a diagonal (KK) matrix of singular values, \({\varvec{B}}\) a reduced (IK) matrix of browsing baskets. The number of parameters equals \(K + J K\). We apply the truncated SVD routine of the Python library Scikit-learn (Pedregosa 2011).

Referring to the reduced matrix \({\varvec{T}}\) we compute the probability \(P(N_{ji})\) that basket i contains \(N_{ji}\) visits of a site j by means of the method of Coccaro and Jurafsky (1998):

$$\begin{aligned} P(N_{ji}) = \frac{\cos (\mathbf {T}_j,\mathbf {m}_i ) - { mincos}}{\sum _{j^{'}=1}^{J} \cos \left(\mathbf {T}_{j^{'}},\mathbf {m}_i \right) - { J\, mincos}} \end{aligned}$$
(8)

Vector \(\mathbf {T}_j\) consists of the elements \((t_{j1},\ldots , t_{jK})\) of matrix \({\varvec{T}}\).

The centroid \(\varvec{{m}_i}\) of the sites contained in basket i (\(n_i\) denotes the number of these sites) is:

$$\begin{aligned} \quad \mathbf{m_{i}} = { 1/n_i} \sum _{N_{ji} > 0} \varvec{T_j}. \end{aligned}$$
(9)

The cosine similarity between a site j and the centroid \(\varvec{{m}_i}\) is:

$$\begin{aligned} cos(\mathbf {T}_j,\mathbf {m}_i ) = (\mathbf {T}_j \mathbf {m}_i)/(||\mathbf {T}_j|| \, ||\mathbf {m}_i ||) \end{aligned}$$
(10)

\(||\mathbf {x}||\) denotes the \(L^2 \) norm of vector \(\mathbf {x}\) defined as \(\sqrt{x_1^2+x_2^2+\cdots }\).

The minimum cosine similarity mincos of the centroid \(\varvec{{m}_i}\) across all sites is:

$$\begin{aligned} mincos = \min _{j=1}^J cos(\mathbf {T}_{j},\mathbf {m}_i ) \end{aligned}$$
(11)

2.5 Model evaluation

Like Salakhutdinov and Hinton (2009), we evaluate the investigated models by perplexity on validation data. Perplexity is defined as geometric mean of the inverse probabilities (Murphy 2012). For the investigated topic models and the RSM the perplexity can be computed as:

$$\begin{aligned} \exp \left( -\frac{1}{I} \sum _{i=1}^I \frac{1}{S_i} \sum _{s=1}^{S_i} \sum _{j=1}^{J} v_{ijs} \log P(v_{ijs}=1)\right) . \end{aligned}$$
(12)

In case of SVD the perplexity can we written as:

$$\begin{aligned} \exp \left( -\frac{1}{I} \sum _{i=1}^I \frac{1}{n_i} \sum _{N_{ji} > 0} \log P(N_{ji})\right) . \end{aligned}$$
(13)

The lower its perplexity, the better a model performs. The worst (i.e., highest) possible value of perplexity equals the number of websites J. This value results if, according to a model, each website has the same visiting probability.

3 Illustrative application example

To illustrate the application of the RSM, the CTM and SVD described in Sect. 2, we construct a small scale example for which we only consider browsing baskets containing the five most frequently visited sites, i.e., msn, aol, ebay, go, and apple (for a description of the complete data set please see Sect. 4). Based on these reduced data, we estimate a RSM with three hidden variables, a CTM with three topics, and a SVD model with three components. We explain the workings of the models using three selected browsing baskets. Table 1 gives the visiting frequencies for these baskets. This table also shows how the data input for the models looks like.

Table 1 Visit frequencies of selected browsing baskets

Table 2 provides information on the latent variables of the browsing baskets, i.e., the transposed \(\theta \) matrix for the CTM, the hidden variables for the RSM and the matrix B of SVD. This table shows that for the selected baskets either one or two hidden variables are active, i.e., equal to 1. We also see that the hidden variables of the RSM are not normalized (their row sums may be greater than 1.0) in contrast to the values of the transposed \(\theta \) matrix of the CTM. For the SVD model entries of matrix B are real valued, some are even negative. Similarly, matrix T shown in Table 3 also contains negative values, which makes interpretation of SVD results difficult.

Table 2 Latent variables

Using the relevant latent variables of Table 2 and the estimated parameters of Table 3 we can compute visiting probabilities of each basket \( i=1,\ldots ,3\) and each site \(j=1,\ldots ,5\). To this end we refer to expressions (1), (4), and (8) for the CTM, the RSM and the SVD model, respectively. Based on these computations Table 4 lists the three sites with the highest visiting probabilities in each selected basket.

For this illustrative example the RSM and the CTM outperform the SVD model in reproducing the most frequently visited site in each browsing basket. Comparing the input data of Tables 1, 2, 3 and 4 shows that the highest probability site equals the most frequently visited site three times for the RSM, two times for the CTM and never for the SVD model.

Table 3 Estimated parameters
Table 4 Three highest probability sites for the selected baskets

4 Data

Like Schröder et al. (2019), we aggregate clickstream data of the 2009 calendar year acquired from the ComScore Web Behavior Panel to weekly browsing baskets. This way, 222,800 browsing basket result that contain visits to 524 websites. In contrast to Schröder et al. (2019), we do not exclude websites with very high visit frequencies, but restrict our investigation to the 60 most frequently visited websites. We delete browsing baskets that do not contain any of these 60 websites. From the remaining data, we take two random samples each with 20,000 baskets. We use one sample for estimation, the other one for validation. Browsing baskets of both samples consist on average of 8705 sites with a standard deviation of 11.698. In the estimation (validation) sample, each website is visited on average 0.144 (0.146) times per panelist with a standard deviation of 0.430 (0.440). Both browsing basket size and website visit frequencies follow very skewed distributions.

Table 5 Relative visit frequencies
Table 6 Relative pairwise visit frequencies

To demonstrate the importance of counting the number of visits instead of only considering whether a website is contained in a browsing basket or not, we compute the average ratio of the relative frequency for one visit divided by the relative frequency of two or more visits. The ratio of 0.504 together with a standard deviation of 0.414 shows that the number of visits is quite diverse and should not be treated as a mere binary value. For several websites (e.g., singlesnet, msn, aol, cox), the frequency of two or more visits even turns out to be higher than the frequency for just one visit (see Table 5 for more details).

Table 6 lists the 60 highest of the total \(1,770=0.5 \times 60 \times 59\) relative pairwise frequencies. We obtain 0.0830 as highest relative pairwise frequency for aol and msn which means that \(8.30 \%\) of the browsing baskets contain both aol and msn.

5 Estimation and evaluation results

Table 7 gives the perplexities for LDA, the RSM, the CTM, and SVD all with increasing number of topics, hidden variables, and components, respectively. This table also contains perplexities for several variants of the STM. Figure 1 plots perplexities versus the number of topics or hidden variables for LDA, the RSM, and the CTM.

We note that the perplexities for the same model turn out to be very similar in both the estimation and the validation sample. The perplexities of LDA improve with a higher number of topics. However, even the RSM with only five hidden variable excels the LDA with 40 topics. From the different RSMs, we choose the model with 17 hidden variables, which has the lowest perplexities for the estimation and the validation data and is clearly superior to the LDA. Therefore, these results are in line with those obtained by Salakhutdinov and Hinton (2009) in their analysis of text data.

The CTM attains better (i.e., lower) perplexities than the RSM, especially if the former has ten or more topics. We choose the CTM with 37 topics because the perplexity increases for 38 or more topics.

We estimate several variants of STM with 37 topics that differ with regard to the included covariates. These covariates comprise of household attributes (most education, household size, oldest age, household income, children) and the weekly time index of a visit or its logarithm. Inclusion of covariates results in perplexities that are almost indistinguishable from the perplexities of the CTM with 37 topics. We obtain analogous results if we include more than one covariate or if we investigate STMs with a different number of topics. We therefore conclude that site visits do not depend on these covariates and that it is sufficient to consider the CTM instead of the STM.

Finally, we demonstrate how LDA, CTM, and RSM perform relative to the benchmark method SVD. LDA turns out to be clearly inferior to SVD. For example, the perplexity of SVD with 30 components is about 50 % of the perplexity of LDA with 30 topics. On the other hand, both RSM and CTM compress the browsing data more efficiently as perplexities of models with the same number of latent variables show. The RSM with 17 hidden variables reduces perplexity by about 47 % compared to SVD with 17 components. In a similar manner, the CTM with 37 hidden variables reduces perplexity by about 26 % compared to SVD with 37 components.

Table 7 Model perplexities
Fig. 1
figure 1

Perplexity Plots (solid line: estimation data, dashed line: validation data)

6 Model interpretation

In this section, we interpret the RSM with 17 hidden variables and the CTM with 37 topics. We ignore LDA because it is clearly outperformed by the other models. We also do not interpret any of STMs, as they do not attain better perplexity values than the related, less complex CTMs.

In the following, we number hidden variables and topics consecutively according to their importances with hidden variable or topic 1 symbolizing the most important one. We measure the importance of a hidden variable or a topic by its mean probability or its mean topic proportion across all baskets. For the RSM mean probabilities of the five most important hidden variables are 0.999, 0.997, 0.960, 0.683, and 0.513, respectively. The mean probability of the next important variable amounts to only 0.006. For the CTM mean proportions of the ten most important topics amount to 0.250, 0.204, 0.058, 0.051, 0.043, 0.043, 0.043, 0.039, 0.023, and 0.022. The mean proportion of the eleventh topic is 0.018.

Next, we want to show how hidden variables of the RSM and topic proportions of the CTM translate into observable browsing behavior. Each combination of hidden variables or topic proportions is associated with certain sites that users visit frequently.

For the RSM, we generate 200 combinations of 17 hidden variables in the following way using estimated coefficients. We determine activations \(a_k + \sum _{s=1}^{S_i} \sum _{j=1}^{J} W_{kj} v_{ijs}\) of each hidden variable and each browsing basket. Then we compute K mean values and the (KK) covariance matrix of these activations. Subsequently, we draw 200 samples from the multivariate normal distribution with these K mean values and the (KK) covariance matrix. We put each of these samples for each hidden variable into the binary logistic function in accordance with expression (5) to obtain the respective hidden variable probability. Visit probabilities for sites result by inserting these hidden variable probabilities into expression (4).

For the CTM we determine 200 combinations of 37 topic proportions in similar manner. We draw 200 samples from the multivariate normal distribution with K zero mean values and the estimated \((K-1,K-1)\) covariance matrix \(\Sigma \). Using expressions (2) we obtain sampled importances of topics for each visit. Visit probabilities for sites result by inserting these importances into expression 1 using the estimated \(\phi \) coefficients.

Finally, we search for the five most heterogeneous combinations among these 200 combinations. We measure heterogeneity by the average distance between combination pairs. The distance between two combinations is one (zero) if they have completely different (equal) 10 highest probability sites. For these five heterogeneous combinations Table 8 shows the indices of hidden variable (topics) with probabilities (proportions) greater than 0.5 (0.1). By looking at these indices, one can immediately see that the five combinations are diverse.

In addition, Table 8 lists one or several specific high probability sites, i.e., sites with high surfing probabilities for the respective combination and low browsing probabilities for the other four combinations. Each combination of hidden variables or topics can be characterized by frequent visits of focal sites. For instance, adobe and mlb are the focal sites of the first combination for the RSM, while eharmony, earthlink, qvc, priceline, and microsoft are the focal sites for the CTM. We see that the RSM and the CTM provide completely different focal sites. Moreover, we note that, in contrast to the RSM, for the CTM all five high probability sites are specific for all combinations.

Table 8 Heterogeneous latent variable combinations

7 Discussion and managerial implications

Despite the better statistical performance of the RSM and the CTM, one may ask whether managers could not get the same information by looking at bivariate measures of site visits, which can be easily computed from frequency counts across browsing baskets. To answer this question, we consider conditional probabilities. The probability p(j|l) of a visit to site j conditional a visit of site l is defined as:

$$\begin{aligned} p(j|l) = n(l,j)/(n(l,j) + n(l,-j)) \end{aligned}$$
(14)

n(lj) denotes the number of joint visits to site l and site j, \(n(l,-j)\) the number of joint visits to site l and all sites different from j.

Expression (14) makes it clear that a conditional probability does not eliminate the effect of visits to other sites \(-j\) on visits to site j. That is why we compare conditional probabilities to both marginal cross effects inferred from the RSM with 17 hidden variables and similarities between to sites inferred from the CTM with 27 topics.

Hruschka (2021) uses marginal cross effects (simply called cross effects from now on) to interpret a RBM estimated on retail basket data of category purchases. In our study, cross effects refer to visits of site pairs. They are computed from the estimated RSM. The cross effect cr(j|l) of visits of site j conditional on visits of a site l is defined by the first derivative of the visit share of site j with respect to the visit share of site l. It can be written as:

$$\begin{aligned} cr(j|l) = \langle v_j \rangle \, (1- \langle v_j \rangle ) \sum _{k=1}^{K} W_{kl} W_{kj} \, \langle h_k \rangle \, (1-\langle h_k \rangle ) \end{aligned}$$
(15)

\(\langle h_k \rangle \) denotes the average of hidden variable k across all baskets, \( \langle v_j \rangle \) the corresponding visit share for site j.

We consider all variations consisting of two sites selected from the investigated 60 sites without repetition (i.e., selecting the same site two times is not allowed). Because both conditional probabilities and cross effects are asymmetric, order does matter. The number of variations equals 3540 (\(=60!/ (60-2)!\)).

We see high differences of the ranks of these two measures for a remarkable number of variations. Table 9 lists the variations with the 20 highest conditional probabilities and their rank. For each of these variations, we juxtapose the cross effect and its rank as well. For 18 of these 20 variations cross effects have a rank greater than 400, which means that contrary to conditional probabilities corresponding cross effects are not high.

For the CTM, we compute the similarity of site pairs. This measure increases the more two sites agree on the importances of topics. We compute similarities s(jl) between two sites j and l which are based on their Euclidean distance in the following way:

$$\begin{aligned} s(j,l) &= 1 - \frac{d(j,jl)}{\max _{j1, j2>j1} d(j1,j2)}\quad \text{ with } \nonumber \\ \quad d(j,l) &= \sqrt{\sum _{k=1}^{K} (\phi _{jk}-\phi _{lk})^2}. \end{aligned}$$
(16)

We obtain high differences of ranks for the majority of the site pairs with the 20 highest conditional probabilities (see Table 9). Only for two site pairs rankings of similarities are comparable to rankings of conditional probabilities. For 13 of these 20 site pairs we obtain very high rankings of similarities that are greater than 1,000. This result shows that the similarities of these site pairs inferred from the CTM are very low which contradicts their high conditional probabilities.

Table 9 Conditional probabilities, cross effects and similarities of selected sites

Following suggestions of an anonymous reviewer, we also investigate whether the better statistical performance of both the RSM and the CTM is related to a managerial relevant outcome variable, namely the yearly expenditure of each household on each of the considered 60 websites. We regress this outcome variable on sums of probabilities of topics for LDA and the CTM, respectively. For the RSM, we use sums of probabilities of hidden variables as independent variables. We compute probability sums of topics and hidden variables across all browsing baskets of the respective household. As we have 60 websites and three methods (LDA, RSM, CTM), we estimate a total of 180 regression models.

Decision makers should prefer the model whose predictions are less uncertain. We measure the uncertainly for LDA, RSM, and CTM by the interquartile range of the prediction intervals of yearly expenditures across households for each website. A lower interquartile range reflects a lower prediction uncertainty. CTM attains to the lowest interquartile range for 35 sites, RSM for 14 sites, and LDA only for two sites. In other words, the ranking of models with respect to the prediction of yearly household expenditure turns out to be the same as the ranking with respect to statistical performance.

Table 10 shows the quartiles and the interquartile range of the predictive intervals of each model for three selected websites. For amazon, the RSM attains the lowest interquartile range. For the other two sites, the CTM leads to the lowest predictive uncertainty. For each of these three sites the worst predictive performance results if the topics determined by LDA serve as independent variables.

Table 10 Prediction interval statistics of household expenditures for selected websites

In comparison to both the LDA and the CTM, the RSM allows for asymmetric cross effects. This property of the RSM leads to more managerially useful implications. Table 11 shows the highest seven cross effects for selected conditioned sites inferred from the estimated RSM with 17 hidden variables. We demonstrate how managers can benefit from applying the RSM by explaining two possible application examples based on these seven cross effects:

  • Managers may use RSM to design appropriate affiliate programs to further increase revenues of the website. Viable partners may be websites with a high cross effect if the compensation in affiliate programs depends on the number of leads or sales sent to the merchant’s website (Gatautis and Vitkauskaite 2020). For example, travelocity may join an affiliate marketing program of aol, dell, apple, classmates, mlb or ticketmaster. On the other hand, affiliate program managers may proactively invite website operators to join their program based on the RSM results. E.g., cross effects suggest that the affiliate program management of dell should invite kohls, priceline, or travelocity.

  • The literature on online advertisement is full of studies that show the importance of behavioral targeting like retargeting (Lambrecht and Tucker 2013). Managers could use the cross effects to find websites which could be a reasonable part of the targeting strategy. If, for instance, comcast plans to run a banner ad campaign, they could integrate the suggested website in the campaign placing higher price bids for the advertising slot on websites that show high cross effects with comcast (kohls, mate1, nascar, priceline, travelocity). Alternatively, managers could directly negotiate contracts with these sites to buy advertising space exclusively for the banner ads from comcast.

Table 11 Highest seven cross effects of selected conditioned sites inferred from the RSM with 17 hidden variables

To summarize, we find that both the RSM and the CTM are clearly superior to simple bivariate cross-tabulation and LDA in all important aspects. The CTM is better at reproducing browsing behavior though it needs more parameters to do so. We obtain the same ranking of methods when the focus lies on predicting household expenditure.

The browsing baskets, which we analyze in our study, refer to individual households. Please note that the investigated methods can also be applied to aggregate browsing baskets containing the number of visits to websites summed across all members of a cluster. That is why the methods can deal with more restrictive privacy laws as long as data are available at the cluster level. In the same manner, these methods are capable to analyze the number of visits of all members of a cohort determined by Federated Learning of Cohorts (FLoC) according to Google’s FLoC proposal (Bindra 2021).

Given their excellent performances, researchers could investigate other applications of both the CTM and the RSM. One task related to the one studied here is to analyze browsing across subpages of one or several websites. In addition, the RSM could be applied to other types of marketing data for which conventional topic models haven been used, such as unstructured text like websites and online advertisements, social media postings, online product reviews, mobile apps usage records (see the review of Reisenbichler and Reutterer (2018) for more details).

An alternative avenue of research consists in extending the models themselves. One extension of the RSM consists in adding independent variables in a manner analogous to the conditional RBM (Mnih et al. 2011). Another possibility is a deep RSM encompassing two or more layers of hidden variable in place of just one hidden layer (Salakhutdinov et al. 2013). This extension will entail an increase of computing times needed for estimation and inference, but might lead to a further improvement of model performance.