Skip to main content
Log in

Unsupervised domain adaptation with non-stochastic missing data

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

We consider unsupervised domain adaptation (UDA) for classification problems in the presence of missing data in the unlabelled target domain. More precisely, motivated by practical applications, we analyze situations where distribution shift exists between domains and where some components are systematically absent on the target domain without available supervision for imputing the missing target components. We propose a generative approach for imputation. Imputation is performed in a domain-invariant latent space and leverages indirect supervision from a complete source domain. We introduce a single model performing joint adaptation, imputation and classification which, under our assumptions, minimizes an upper bound of its target generalization error and performs well under various representative divergence families (\(\mathscr {H}\)-divergence, Optimal Transport). Moreover, we compare the target error of our adaptation-imputation framework and the “ideal” target error of a UDA classifier without missing target components. Our model is further improved with self-training, to bring the learned source and target class posterior distributions closer. We perform experiments on three families of datasets of different modalities: a classical digit classification benchmark, the Amazon product reviews dataset both commonly used in UDA and real-world digital advertising datasets. We show the benefits of jointly performing adaptation, classification and imputation on these datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. https://github.com/mkirchmeyer/adaptation-imputation.

  2. \(h \in {\mathscr {F}} {\Delta } {\mathscr {F}} \iff h({\mathbf {x}})=f_1({\mathbf {x}})\oplus f_2({\mathbf {x}})\) for some \(f_1, f_2 \in \mathscr {F}\) where \(\oplus \) is the XOR function.

  3. CTR is the number of clicks made on ads divided by the number of shown ads. CR replaces clicks with purchases.

  4. http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/.

References

  • Aggarwal K, Yadav P, Selvaraj KS (2019) Domain adaptation in display advertising: An application for partner cold-start. In: Proceedings of the 13th ACM conference on recommender systems, pp. 178–186

  • Amini MR, Gallinari P (2005) Semi-supervised learning with an imperfect supervisor. Knowl Inf Syst 8:385–413

    Article  Google Scholar 

  • Arjovsky M, Bottou L, Gulrajani I, Lopez-Paz D (2019) Invariant risk minimization. Arxiv:1907.02893

  • Barjasteh I, Forsati R, Masrour F, Esfahanian A, Radha H (2015) Cold-start item and user recommendation with decoupled completion and transduction. In: Proceedings of the 9th ACM conference on recommender systems, pp. 91–98

  • Ben-David S, Blitzer J, Crammer K, Kulesza A, Pereira F, Vaughan JW (2010) A theory of learning from different domains. Mach Learn 79(1):151–175

    Article  MathSciNet  Google Scholar 

  • Blitzer J, McDonald R, Pereira F (2006) Domain adaptation with structural correspondence learning. In: Proceedings of the 2006 conference on empirical methods in natural language processing, pp. 120–128

  • Bora A, Price E, Dimakis AG (2018) AmbientGAN: Generative models from lossy measurements. In: International conference on learning representations

  • Cai L, Wang Z, Gao H, Shen D, Ji S (2018) Deep adversarial learning for multi-modality missing data completion. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 1158–1166

  • Chen C, Dou Q, Chen H, Qin J, Heng P (2019) Synergistic image and feature adaptation: Towards cross-modality domain adaptation for medical image segmentation. In: Proceedings of the 33rd conference on artificial intelligence (AAAI), pp. 865–872

  • Chen M, Xu Z, Weinberger KQ, Sha F (2012) Marginalized denoising autoencoders for domain adaptation. In: Proceedings of the 29th international conference on international conference on machine learning, pp. 1627–1634

  • Cortes C, Mohri M (2014) Domain adaptation and sample bias correction theory and algorithm for regression. Theor Comput Sci 519:103–126

    Article  MathSciNet  Google Scholar 

  • Courty N, Flamary R, Amaury H, Rakotomamonjy A (2017) Joint distribution optimal transportation for domain adaptation. In: Advances in neural information processing systems

  • Crammer K, Kearns M, Wortman J (2008) Learning from multiple sources. J Mach Learn Res 9:1757–1774

    MathSciNet  MATH  Google Scholar 

  • Damodaran BB, Kellenberger B (2018) DeepJDOT : Deep joint distribution optimal transport for unsupervised domain adaptation. In: European conference in computer visions, pp. 467–483

  • Ding Z, Shao M, Fu Y (2014) Latent low-rank transfer subspace learning for missing modality recognition. In: Proceedings of the 28th AAAI conference on artificial intelligence, pp. 1192–1198

  • Doinychko A, Amini MR (2020) Biconditional gans for multiview learning with missing views. In: Advances in information retrieval, pp. 807–820

  • Gama J, Žliobaitundefined I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46(4):1–37

    Article  Google Scholar 

  • Ganin Y, Lempitsky V (2015) Unsupervised domain adaptation by backpropagation. In: Proceedings of the 32nd international conference on machine learning, pp. 1180–1189

  • Grandvalet Y, Bengio Y (2005) Semi-supervised learning by entropy minimization. In: Proceedings of the 17th international conference on neural information processing systems, pp. 529–536

  • Hull JJ (1994) A database for handwritten text recognition research. IEEE Trans Pattern Anal Mach Intell 16(5):550–554

    Article  Google Scholar 

  • Isola P, Zhu JY, Zhou T, Efros A (2017) Image-to-image translation with conditional adversarial networks. In: IEEE Conference on computer vision and pattern recognition, pp. 5967–5976

  • Johansson FD, Sontag D, Ranganath R (2019) Support and invertibility in domain-invariant representations. In: Proceedings of the 32th international conference on artificial intelligence and statistics, pp. 527–536

  • LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  • Leek JT et al (2010) Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet 11(10):733–739

    Article  Google Scholar 

  • Li S, B J, Marlin B (2019) MisGAN: Learning from incomplete data with generative adversarial networks. In: International conference on learning representations

  • Lipton Z, Wang YX, Smola A (2018) Detecting and correcting for label shift with black box predictors. In: Proceedings of the 35th international conference on machine learning, pp. 3122–3130

  • Little R, Rubin D (1986) Statistical analysis with missing data. John Wiley, Hoboken

    MATH  Google Scholar 

  • Long M, Cao Y, Wang J, Jordan M (2015) Learning transferable features with deep adaptation networks. In: Proceedings of the 32nd international conference on international conference on machine learning, vol 37, pp. 97–105

  • Long M, Cao Z, Wang J, Jordan MI (2018) Conditional adversarial domain adaptation. In: Advances in neural information processing systems, vol 31

  • Mattei PA, Frellsen J (2019) MIWAE: Deep generative modelling and imputation of incomplete data. In: Proceedings of the 36th international conference on machine learning, vol 97, pp. 4413–4423

  • Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng AY (2011) NIPS Workshop on deep learning and unsupervised feature learning 2011. In: Proceedings of the IEEE

  • Pajot A, de Bezenac E, Gallinari P (2019) Unsupervised adversarial image reconstruction. In: International conference on learning representations

  • Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22:1345–1359

    Article  Google Scholar 

  • Pathak D, Krähenbühl P, Donahue J, Darrell T, Efros A (2016) Context encoders: feature learning by inpainting. In: IEEE Conference on computer vision and pattern recognition, pp. 2536–2544

  • Peyré G, Cuturi M et al (2019) Computational optimal transport. Found Trends Mach Learn 11(5–6):355–607

    Article  Google Scholar 

  • Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592

    Article  MathSciNet  Google Scholar 

  • Sahebi S, Brusilovsky P (2013) Cross-domain collaborative recommendation in a cold-start context: the impact of user profile size on the quality of recommendation. User modeling, adaptation, and personalization. Springer, Berlin, pp 289–295

    Chapter  Google Scholar 

  • Shen J, Qu Y, Zhang W, Yu Y (2018) Wasserstein distance guided representation learning for domain adaptation. In: 32nd AAAI Conference on artificial intelligence

  • Tran L, Liu X, Zhou J, Jin R (2017) Missing modalities imputation via cascaded residual autoencoder. In: IEEE Conference on computer vision and pattern recognition, pp. 4971–4980

  • Tzeng E, Hoffman J, Saenko K, Darrell T (2017) Adversarial discriminative domain adaptation. IEEE Conference on computer vision and pattern recognition pp. 2962–2971

  • Van Buuren S (2018) Flexible imputation of missing data, 2nd edn. Chapman and Hall/CRC, London

    Book  Google Scholar 

  • Wang C, Niepert M, Li H (2018) LRMM: Learning to recommend with missing modalities. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp. 3360–3370

  • Wang R, Fu B, Fu G, Wang M (2017) Deep cross network for ad click predictions. In: Proceedings of the ADKDD’17

  • Wei P, Ke Y, Goh CK (2017) Domain specific feature transfer for hybrid domain adaptation. In: 2017 IEEE International conference on data mining, pp. 1027–1032

  • Wei P, Ke Y, Goh CK (2019) A general domain specific feature transfer framework for hybrid domain adaptation. IEEE Trans Knowl Data Eng 31(8):1440–1451

    Article  Google Scholar 

  • Yoon J, Jordon J, Van Der Schaar M (2018) GAIN: Missing data imputation using generative adversarial nets. In: Proceedings of the 35th international conference on machine learning, pp. 5689–5698

  • You K, Wang X, Long M, Jordan M (2019) Towards accurate model selection in deep unsupervised domain adaptation. In: Proceedings of the 36th international conference on machine learning, pp. 7124–7133

  • Zablocki E, Bordes P, Soulier L, Piwowarski B, Gallinari P (2019) Context-aware zero-shot learning for object recognition. In: Proceedings of the 36th international conference on machine learning, vol 97, pp. 7292–7303

  • Zhao H, des Combes RT, Zhang K, Gordon GJ (2019) On learning invariant representation for domain adaptation. In: Proceedings of the 36th international conference on machine learning, vol 97, pp. 7523–7532

  • Zhu JY, Park T, Isola P, Efros A (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE International conference on computer vision, pp. 2242–2251

Download references

Funding

Alain Rakotomamonjy is funded by RAIMO ANR-20-CHIA-0021-01 and OATMIL ANR-17-CE23-0012 Projects of the French National Research Agency (ANR).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matthieu Kirchmeyer.

Additional information

Responsible editor: Annalisa Appice, Sergio Escalera, Jose A. Gamez, Heike Trautmann.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A: Additional related work

See Tables 5, 6, 7 and Figs. 6, 7, 8, 9, 10.

We present in this section some other secondary topics related to our problem in complement to Sect. 2.

Concept drift in data streams Adapting to non i.i.d. data is also considered in evolving data streams where concept drift may occur (Gama et al. 2014). The hypotheses are different from the ones in our setting where adaptation is performed between static domains.

Table 5 Statistics on digits datasets

Batch effect and multiple environments Data may come from different environments with different distributions. Classical learning frameworks like ERM consider shuffled data without making the distinction between environments which may lead to erroneous conclusions. In biology, this is known as the batch effect (Leek 2010). In ML, recent papers learn domain invariant representations from different environments (Arjovsky et al. 2019). This is different from the situation considered here where one explicitly adapts from a source to a target environment.

Table 6 Statistics on ads datasets
Table 7 Feature mean and standard deviation on \(\texttt {ads-kaggle}\). We set features 1, 6, 7, 11, 12 to zero on T

B: OT Adaptation-imputation formulation

We present here in more details our model using Optimal Transport as a divergence metric. The formulation is slightly different compared to \(\texttt {ADV}\) models. We replace the \(\mathscr {H}\)-divergence approximation given by the discriminators \(D_1\) and \(D_2\) by the Wasserstein distance between source and target instances (\(D_1\)) and true and imputed feature representations (\(D_2\)), following the original ideas in Shen et al. (2018), Damodaran and Kellenberger (2018). In practice, we compute the Wasserstein distance using its primal form by finding a joint coupling matrix \(\gamma \), using a linear programming approach (Peyré and Cuturi 2019). In Damodaran and Kellenberger (2018), Courty et al. (2017), the optimal transport problem is formulated on the joint p(XY) distributions. Similarly to Shen et al. (2018), in our case, we focus on a plan that acts only on the feature space without taking care of the labels. This leads to the following optimization problem:

$$\begin{aligned} L_{1} = \sum _{ij} \left( ||\mathbf {z_{S_1}^{(i)}}-\mathbf {z_{T_1}^{(j)}}||^2 + ||{\mathbf {\widehat{z}_{S_2}}}^{(i)}-{\mathbf {\widehat{z}_{T_2}}}^{(j)}||^{2} \right) {{\varvec{\gamma }}_{1}}_{ij} \end{aligned}$$
(14)

where \({{\varvec{\gamma }}_{1}}_{ij}\) is the alignment value between source instance i and target instance j.

For the imputation part, we keep the reconstruction MSE component in Eq. 6 and derive the distribution matching loss as:

$$\begin{aligned} L_{OT} = \sum _{ij} ||\mathbf {z_{S_2}^{(i)}}- \mathbf {\widehat{z}_{S_2}}^{(j)}||^2 {{\varvec{\gamma }}_{2}}_{ij} \end{aligned}$$
(15)

where \({{\varvec{\gamma }}_{2}}_{ij}\) is the alignment value between source instance i and j. The final imputation loss is:

$$\begin{aligned} L_2 = \lambda _{OT} \times L_{OT} + \lambda _{MSE} \times L_{MSE} \end{aligned}$$
(16)

The classification term in Eq. 7 is unchanged.

Fig. 6
figure 6

Source (blue) and Target (red) distributions on ads-kaggle for each feature (1–13)

The optimization problem in Eq. 9 is solved in two stages following an alternate optimization strategy:

  • We fix all parameters but \({{\varvec{\gamma }}_{1}}\) and \({{\varvec{\gamma }}_{2}}\) and find the joint coupling matrices \({{\varvec{\gamma }}_{1}}\) and \({{\varvec{\gamma }}_{2}}\) using EMD

  • We fix \({{\varvec{\gamma }}_{1}}\) and \({{\varvec{\gamma }}_{2}}\) and solve \(\min _{g_1, g_2, r, f} L\)

In practice, we first minimize \(L_3\) for a couple of epochs (taken to be 10 for digits) then minimize \(\lambda _1 L_1 + \lambda _2 L_2 + \lambda _3 L_3\) in the remaining epochs. Learning rate and parameters are detailed further in Section E.

Fig. 7
figure 7

Base architecture for the ADV DANN model

C: Proofs

Theorem(1) (1)

Given \(f \in \mathscr {F}, {\widehat{g}}\) in (2) and \(p_S(\widehat{Z})\), \(p_T(\widehat{Z})\) the latent marginal distributions obtained with g.

$$\begin{aligned} \begin{aligned} \epsilon _T(f\circ {\widehat{g}}) \le \underbrace{\left[ \epsilon _S(f\circ {\widehat{g}})+d_{\mathscr {F} \Delta \mathscr {F}}(p_S(\widehat{Z}), p_T(\widehat{Z}))+\lambda _{\mathscr {H}_{{\widehat{g}}}}\right] }_{\mathrm {Domain~Adaptation~} (DA)} \end{aligned} \end{aligned}$$

with \(\epsilon _{S}(\cdot ), \epsilon _{T}(\cdot )\) the expected error w.r.t to the labelling function \(f_S, f_T\) on S, T respectively; \(\mathscr {F}\Delta \mathscr {F}\) the symmetric difference hypothesis space; \(d_{\mathscr {H}}\) the \(\mathscr {H}\)-divergence for \(\mathscr {H}=\mathscr {F}\Delta \mathscr {F}\) and \(\lambda _{\mathscr {H}_{{\widehat{g}}}}=\min _{f' \in \mathscr {F}}\left[ \epsilon _{S}(f' \circ {\widehat{g}})+\epsilon _{T}(f'\circ {\widehat{g}})\right] \), the joint risk of the optimal hypothesis.

Proof

We apply Ben-David et al. (2010) to form the bound in \(\mathscr {Z}\) using \({\widehat{g}}\). \(\square \)

Lemma(1) (1)

For any continuous density distribution p, q defined on an input space \(\mathscr {X}\), such that \(\forall {\mathbf {x}}\in \mathscr {X}, q({\mathbf {x}}) > 0\), the inequality \(\sup _{{\mathbf {x}}\in \mathscr {X}}[p({\mathbf {x}})/q({\mathbf {x}})] \ge 1\) holds. Moreover, the minimum is reached when \(p=q\).

Proof

Suppose that \(\not \exists {\mathbf {x}}\in \mathscr {X} s.t. \sup _{{\mathbf {x}}} p({\mathbf {x}})/q({\mathbf {x}}) \ge 1\). This means that \(\forall {\mathbf {x}}, p({\mathbf {x}}) < q({\mathbf {x}})\). By integrating those positive and continuous functions on their domains, we are lead to the contradiction that the integral of one of them is not equal to 1. Thus, \(\exists {\mathbf {x}}\in \mathscr {X} s.t. p({\mathbf {x}})/q({\mathbf {x}}) \ge 1\). Thus, \(\sup _{{\mathbf {x}}\in \mathscr {X}}[p({\mathbf {x}})/q({\mathbf {x}})] \ge 1\), with equality trivially when \(p=q\). \(\square \)

Proposition(1) (1)

Under Assumption 3, given \(f \in \mathscr {F}, {\widehat{g}}\) in (2) and g in (1),

$$\begin{aligned} \epsilon _T(f\circ g) \le \underbrace{\underbrace{\sup _{{\mathbf {z}}\sim p(Z)}[\dfrac{p_S(Z_2={\mathbf {z}}_2|{\mathbf {z}}_1)}{p_S(\widehat{Z}_2={\mathbf {z}}_2|{\mathbf {z}}_1)}]}_{\mathrm {Imputation~error~on~ S~} {(I_S)}} \times \underbrace{\sup _{{\mathbf {z}}\sim p(Z)}[\dfrac{p_S(\widehat{Z}_2={\mathbf {z}}_2|{\mathbf {z}}_1)}{p_T(\widehat{Z}_2={\mathbf {z}}_2|{\mathbf {z}}_1)}]}_{\mathrm {Transfer~error~of~Imputation~} {(T_I)}}}_{\mathrm {Imputation~error~on~T~} {(I_T)}} \times \epsilon _T(f\circ {\widehat{g}}) \end{aligned}$$
(17)

Under Lemma 1, (\(I_T\))=1 is the minimal value reached when \(p_S(Z_2|Z_1)=p_S(\widehat{Z}_2|Z_1)\) and \(p_S(\widehat{Z}_2|Z_1)=p_T(\widehat{Z}_2|Z_1)\). In this case, \(\epsilon _T(f\circ g) = \epsilon _T(f\circ {\widehat{g}})\).

Proof

We denote \(f_T^z\), the latent target labeling function. Moreover, for simplicity, we write \(h_{{\widehat{g}}}=f \circ {\widehat{g}}\), \(h_g=f \circ g\) and \(\forall {\mathbf {z}}\sim p(Z), S_D({\mathbf {z}})=\dfrac{p_D(Z_2={\mathbf {z}}_2|{\mathbf {z}}_1)}{p_D(\widehat{Z}_2={\mathbf {z}}_2|{\mathbf {z}}_1)}\)

$$\begin{aligned} \epsilon _T(h_{g})&= {\mathbb {E}}_{{\mathbf {x}}_T \sim p_T(X)}[{\mathbb {I}}(h_{g}({\mathbf {x}}_T) \ne f_T({\mathbf {x}}_T))] \\&= {\mathbb {E}}_{\mathbf {z_{T_1}}\sim p_T(Z_1), \mathbf {z_{{T}_{2}}}\sim p_T(Z_2|Z_1) }[{\mathbb {I}}(f(\mathbf {z_{T_1}}, \mathbf {z_{{T}_{2}}}) \ne f_T^z(\mathbf {z_{T_1}}, \mathbf {z_{{T}_{2}}}))] \\&= {\mathbb {E}}_{\mathbf {z_{T_1}}\sim p_T(Z_1), \mathbf {\widehat{z}_{T_2}}\sim p_T(\widehat{Z}_2|Z_1) }\left[ \dfrac{p_T(Z_2=\mathbf {\widehat{z}_{T_2}}|\mathbf {z_{T_1}})}{p_T(\widehat{Z}_2=\mathbf {\widehat{z}_{T_2}}|\mathbf {z_{T_1}})} {\mathbb {I}}(f(\mathbf {z_{T_1}}, \mathbf {\widehat{z}_{T_2}}) \ne f_T^z(\mathbf {z_{T_1}}, \mathbf {\widehat{z}_{T_2}}))\right] \\&\le \sup _{{\mathbf {z}}\sim p(Z)}[S_T({\mathbf {z}})] {\mathbb {E}}_{{\mathbf {x}}_T \sim p_T(X)}[{\mathbb {I}}(h_{{\widehat{g}}}({\mathbf {x}}_T) \ne f_T({\mathbf {x}}_T))] \\&= \sup _{{\mathbf {z}}\sim p(Z)}[S_T({\mathbf {z}})] \epsilon _T(h_{{\widehat{g}}}) \end{aligned}$$

However, \(\forall {\mathbf {z}}\in \mathscr {Z}, S_T({\mathbf {z}})\) cannot be computed as there is not supervision possible on T. We will instead apply Assumption 3 and use source data for which we can compute \(S_S({\mathbf {z}})\).

$$\begin{aligned} \forall {\mathbf {z}}\in \mathscr {Z}~S_T({\mathbf {z}})&=\dfrac{p_T(Z_2={\mathbf {z}}_2|{\mathbf {z}}_1)}{p_T(\widehat{Z}_2={\mathbf {z}}_2|{\mathbf {z}}_1)} \\&=\dfrac{p_S(Z_2={\mathbf {z}}_2|{\mathbf {z}}_1)}{p_T(\widehat{Z}_2={\mathbf {z}}_2|{\mathbf {z}}_1)}&\text {Assumption~} 3\\&=\dfrac{p_S(Z_2={\mathbf {z}}_2|{\mathbf {z}}_1)}{p_S(\widehat{Z}_2={\mathbf {z}}_2|{\mathbf {z}}_1)} \times \dfrac{p_S(\widehat{Z}_2={\mathbf {z}}_2|{\mathbf {z}}_1)}{p_T(\widehat{Z}_2={\mathbf {z}}_2|{\mathbf {z}}_1)} \\&=S_S({\mathbf {z}}) \times \dfrac{p_S(\widehat{Z}_2={\mathbf {z}}_2|{\mathbf {z}}_1)}{p_T(\widehat{Z}_2={\mathbf {z}}_2|{\mathbf {z}}_1)} \\ \end{aligned}$$

Thus by applying \(\sup \),

$$\begin{aligned} \sup _{{\mathbf {z}}\sim p(Z)}[S_T({\mathbf {z}})] = \sup _{{\mathbf {z}}\sim p(Z)}[S_S({\mathbf {z}})] \times \sup _{{\mathbf {z}}\sim p(Z)}\left[ \dfrac{p_S(\widehat{Z}_2={\mathbf {z}}_2|{\mathbf {z}}_1)}{p_T(\widehat{Z}_2={\mathbf {z}}_2|{\mathbf {z}}_1)}\right] \end{aligned}$$

This yields (11).

If (\(I_T\))=1 when \(p_S(Z_2|Z_1)=p_S(\widehat{Z}_2|Z_1)\) and \(p_S(\widehat{Z}_2|Z_1)=p_T(\widehat{Z}_2|Z_1)\) per Lemma 1, then \(S_T({\mathbf {z}})=1\) and \(\epsilon _T(f\circ g) = \epsilon _T(f\circ {\widehat{g}})\). \(\square \)

Proposition(1) (2)

Assume a joint distribution \(p_{\widetilde{T}}(X,Y)\) where \(p_{\widetilde{T}}(X) = p_T(X)\) and \(Y = h_{\widehat{g}}(X)\) where \(h_{\widehat{g}}= f \circ {\widehat{g}}\in \mathscr {H}_{\widehat{g}}\) is a candidate hypothesis. Then,

$$\begin{aligned} \lambda _{\mathscr {H}_{{\widehat{g}}}} \le \min _{h_{\widehat{g}}\in \mathscr {H}_{\widehat{g}}} \left[ \epsilon _S(h_{\widehat{g}}) + \epsilon _{\widetilde{T}}(h_{\widehat{g}}) + \epsilon _{T}(f_{\widetilde{T}})\right] \end{aligned}$$

with \(\epsilon _{T}(f_{\widetilde{T}})={\text {Pr}}_{{\mathbf {x}}\sim p_T(X)}(f_{\widetilde{T}}({\mathbf {x}}) \ne f_T({\mathbf {x}}))\) the error of the pseudo-labelling function \(f_{\widetilde{T}}\) on T.

Proof

We know that \(p_{\widetilde{T}}(X)=p_T(X)\) as instances are not changed by applying the pseudo-labelling function. Thus, given \(h_{\widehat{g}}\in \mathscr {H}_{\widehat{g}}\)

$$\begin{aligned} \epsilon _{T}(h_{\widehat{g}})=\epsilon _{T}(h_{\widehat{g}},f_T)=\epsilon _{\widetilde{T}}(h_{\widehat{g}},f_T) \end{aligned}$$

Applying the triangle inequality for classification error (Crammer et al. 2008),

$$\begin{aligned} \epsilon _{\widetilde{T}}(h_{\widehat{g}}, f_T) \le \epsilon _{\widetilde{T}}(h_{\widehat{g}}, f_{\widetilde{T}}) + \epsilon _{\widetilde{T}}(f_{\widetilde{T}}, f_T) \end{aligned}$$

Finally, we can rewrite \(\epsilon _{\widetilde{T}}(h_{\widehat{g}}, f_{\widetilde{T}}) = \epsilon _{\widetilde{T}}(h_{\widehat{g}})\) and \(\epsilon _{\widetilde{T}}(f_{\widetilde{T}}, f_T)=\epsilon _{T}(f_{\widetilde{T}}, f_T)=\epsilon _{T}(f_{\widetilde{T}})\). \(\square \)

D: Dataset description

1.1 D.1: Digits

We scale all images to \(32 \times 32\) and normalize the input in \([-1, 1]\). When adaptation involves a domain with three channels (SVHN or MNIST-M) and a domain with a single channel, we simply triplicate the channel of the latter domain. As in Damodaran and Kellenberger (2018) we use balanced source batches which proves to increase performance especially when the source dataset is imbalanced (e.g. SVHN and USPS datasets) while the target dataset (usually MNIST derived) is balanced. Scaling the input images enables us to use the same architecture across datasets. In practise the embedding size is 2048 after preprocessing. For missing versions, we set pixel values to zero in a given patch as shown in Fig. 3. The digits datasets are provided with a predefined train / test split. We report accuracy results on the target test set and use the source test set as validation set (Sect. 1). The number of instances in each dataset is reported in Table 5. We run each model five times.

1.2 D.2: Amazon

Each domain has around 2000 samples and we use features freely available at https://github.com/jindongwang/transferlearning/tree/master/data#amazon-review which follows the data processing pipeline in Chen et al. (201). Each review is preprocessed as a feature vector of unigrams and bigrams keeping only the 5000 most frequent features. In practise, we consider the dense version of these features after projection onto a low-dimensional sub-space of dimension 400 with PCA as in Chen et al. (201). Datasets with missing features are built by setting the first half of the features to 0.

1.3 D.3: Ads

Table 6 lists statistics on the traffic for the two ads datasets; we now describe how they are preprocessed. On both datasets the train and test sets are fixed. We run each model five times.

\({{\texttt {\textit{ads-kaggle}}}}\) The Criteo Kaggle dataset is a reference dataset for CTR prediction and groups one week of log data. The objective is to model the probability that a user will click on a given ad given his browsing behaviour. Positives refer to clicked ads and negatives to non-clicked ads. For each datum, there are 13 continuous and 26 categorical features. We divide the traffic into two domains using feature number 30 corresponding to an engagement feature; for a given value for this categorical feature, all instances have a single missing numeric feature (feature number 5). We then construct an artificial dataset simulating transfer between known and new users. We process the original Criteo Kaggle dataset to have an equal number of source and target data. We then perform train / test split on this dataset keeping 20\(\%\) of data for testing. We used in our experiments only continuous features; to show the benefit of modelling additional missing features, we extend the missing features list to features 1, 5, 6, 7, 11 and 12 by setting them to zero on the target domain. After these operations, 6 features are missing and 7 are non-missing. Preprocessing consists in normalizing continuous features using a log transform.

\({{\texttt {\textit{ads-real}}}}\) This private dataset is similar to \(\texttt {ads-kaggle}\). We filter out non-clicks and the final task is to model the sale probability for a clicked ad given an user’s browsing history. Positives refer to clicked ads which lead to a sale; negatives to clicked ads which did not lead to a sale. We use one week of sampled logs as a training set and use the following day data as the test set. This train/test definition is used so to better correlate with the performance of a production model. Features are aggregated across user timelines and describe the clicking and purchase behavior of a user. In comparison to \(\texttt {ads-kaggle}\) more continuous features are used. The count features can be User-centric i.e. describe the global activity of the user (number of clicks, displays, sales done globally across partners) or User-partner features i.e. describing the history of the user on the given partner (number of clicks, sales... on the partner). The latter are missing for prospecting users. Counts are aggregated across varying windows of time and categories of partner catalog. We bucketize these count features using log transforms and project the features into an embedding space of size 596 with 29 features. 12 features are missing and 17 are non-missing.

E: Implementation details

1.1 E.1: Neural net architecture

digits We use the ADV and OT versions of our imputation model. For ADV models, we use the DANN model description in Ganin and Lempitsky (2015); for OT we use the DeepJDOT model description in Damodaran and Kellenberger (2018). Both models can be considered as simplified instances of our corresponding ADV and OT imputation models when no imputation is performed. Performance of the adaptation models is highly dependent on the NN architectures used for adaptation and classification. In order to perform fair comparisons and since our goal is to evaluate the potential of joint adaptation-imputation-classification, we selected these architectures through preliminary tests and use them for both the ADV and OT models. The two models are described below and illustrated in Fig. 7.

  • Feature extractors \(g_1\) and \(g_2\) consists of three convolutional layers with \(5 \times 5\) kernel and 64 filters interleaved with max pooling layers with a stride of 2 and \(2 \times 2\) kernel. The final layer has 128 filters. We use batch norm on convolutional layers and ReLU as an activation after max pooling. As in Damodaran and Kellenberger (2018) we find that adding a sigmoid activation layer as final activation is helpful.

  • Classifier f consists of two fully connected layers with 100 neurons with batch norm and ReLU activation followed by the final softmax layer. We add Dropout as an activation for the first layer of the classifier.

  • Discriminator \(D_1\) and \(D_2\) is a single layer NN with 100 neurons, batch norm and ReLU followed by the final softmax layer. On USPS \(\rightarrow \) MNIST and MNIST \(\rightarrow \) USPS dataset we use a stronger discriminator network which consists of two fully connected layers with 512 neurons.

  • Generator r consists of two fully connected layers with 512 neurons, batch norm and ReLU activation. This architecture is used for ADV and OT imputation models. In practice using wider and deeper networks increases classification performance with the more complicated classification tasks (SVHN \(\rightarrow \) MNIST, MNIST \(\rightarrow \) MNIST-M); in these cases we add an additional fully connected network with 512 neurons. The final activation function is a sigmoid.

We use the same architecture described above for all our models to guarantee fair comparison. As a side note, the input to the imputation model’s classifier is twice bigger as in the standard adaptation models.

\({{\texttt {\textit{ads-kaggle}}}}\,\, and {{\texttt {\textit{amazon}}}}\) We experiment with ADV models only. As input data is numeric and low dimensional, architectures are simpler than in digits. Our feature extractor is a three layered NN with 128 neurons and with a final sigmoid activation. The classifier is taken to be a single layered NN with 128 neurons and a final softmax layer. Activations are taken to be ReLUs. The domain discriminator is taken to be a two layered NN with 128 neurons and a final softmax layer. Finally the reconstructor is taken to be a two-layered NN with 256 neurons and final sigmoid activation.

\({{\texttt {\textit{ads-real}}}}\) We experiment with ADV models only. Input features after processing are fed directly into the feature extractors \(g_1\), \(g_2\) consisting of two fully connected layers with 128 neurons. The classifier and discriminator is taken to be single-layered NN with 25 neurons. The reconstructor is taken to be a two-layered NN with 128 neurons. Inner activations are taken to be ReLUs and the final activation of the feature extractor is taken to be a sigmoid.

1.2 E.2: Network parameters

1.2.1 E.2.1 Hyperparameter tuning

Tuning hyperparameters for UDA is tricky as we do not have access to target labels and thus cannot choose parameters minimizing the target risk on a validation set. Several papers set hyperparameters through reverse cross-validation (Ganin and Lempitsky 2015). Other approaches developed for model selection are based on risk surrogates obtained by estimating an approximation of the risk value on the source based on the similarity of source and target distributions (without the labels). In the experiments, we used a recent estimator, Deep Embedded Validation (DEV) (You et al. 2019) for tuning the initial learning rate and for the OT imputation model, tuning \(\lambda _{1}\) and \(\lambda _{OT}\). For other parameters, we used heuristics and typical hyperparameter values from UDA papers (such as batch size) without further tuning. We use a cross entropy link function on the source validation set; this value provides a proxy for the target test risk. Using parameters from the original paper, this estimator helps select parameter ranges which perform reasonably well. We keep the estimator unchanged for our baseline models. In the imputation case, the discriminator used for computing importance sampling weights discriminates between \(\mathbf {\widehat{z}_{S}}\) and \(\mathbf {\widehat{z}_{T}}\) i.e. \(D_1\) (Fig. 2).

1.2.2 E.2.2 Digits

We find that the results are highly dependent on the NN architecture and the training parameter setting. In order to evaluate the gain obtained with Adaptation-Imputation, we use the same NN architecture for all models (ADV and OT) but fine tune the learning rates for each model using the DEV estimator (other parameters do not have a significant impact on the classification performance).

\({{\texttt {\textit{ADV}}}}\) We use an adaptive approach as in Ganin and Lempitsky (2015) for decaying the learning rate lr and updating the gradient’s scale s between 0 and 1 for the domain discriminators. We choose the decay values used in Ganin and Lempitsky (2015) ie. \(s = \dfrac{2}{1 + \exp (-10 \times p)} - 1\) and \(lr = \dfrac{lr_i}{(1 + 10 \times p)^{0,75}}\) where p is ratio of current batches processed over the total number of batches to be processed without further tuning. We tune the initial learning rate \(lr_i\), chosen in the range \(\{10^{-2}, 10^{-2.5}, 10^{-3}, 10^{-3.5}, 10^{-4}\}\) following Section E.2.1 . In practise we take \(lr_i = 10^{-2}\) for ADV Adaptation-Imputation, Adaptation-Full, Adaptation-IgnoreComponent and \(lr_i = 10^{-2.5}\) for ADV Adaptation-ZeroImputation. We use Adam as the optimizer with momentum parameters \(\beta _1=0.8\) and \(\beta _2=0.999\) and use the same decay strategy and initial learning rate for all components (feature extractor, classifier, reconstructor). Batch size is chosen to be 128; we see in practise that initializing the adaptation models with a source model with smaller batch size (such as 32) can be beneficial.

\({{\texttt {\textit{OT}}}}\) We choose parameter \(\lambda _{OT} = 0.1\) in Eq. 16 after tuning in the range \(\{10^{-1}, 10^{-2}, 10^{-3}\}\) using DEV. We weight \(L_1\) in Eq. 8 by \(\lambda _{1} = 0.1\). Following Damodaran and Kellenberger (2018), batch size is taken to be 500 and we use EMD a.k.a. Wasserstein-2 distance. We initialize adaptation models with a source model in the first 10 epochs and divide the initial learning rate by two as adaptation starts for non-imputation models. For Adaptation-Imputation we follow a decaying strategy on the learning rate and on the adaptation weight as explained in the next item. We choose \(lr_i\) in the range \(\{10^{-2}, 10^{-2.5}, 10^{-3}, 10^{-3.5}, 10^{-4}\}\). In practise we fix \(lr_i = 10^{-2}\) for all models.

Imputation parameters Ablation studies are conducted in Sect. 6.6 on weights in Eq. 4; in digits experiments we choose \(L_2 = L_{MSE} + L_{ADV}\) for ADV and OT to reduce the burden of additional feature tuning. For ADV model, we fix \(\lambda _1=\lambda _2=\lambda _3=1\) in Eq. 8. In the OT model, we vary \(\lambda _1\) between 0 and 0.1 and \(\lambda _2\) between 0 and 1 following the same schedule as the gradient scale update for ADV models to reduce variance.

1.2.3 E.2.3 Ads

We use an adaptive strategy for updating the gradient scale and the learning rate with the same parameters as in the \(\texttt {digits}\) dataset. Optimizer is taken to be Adam. Batch size is taken to be big so that target batches include sufficient positive instances.

\({{\texttt {\textit{ads-kaggle}}}}\) The initial learning rate is chosen in the range \(\{10^{-4}, 10^{-5}, 10^{-6}, 10^{-7}\}\) using DEV and fixed to be \(10^{-6}\) for all models. Batch size is taken to be 500 and we initialize models with a simple classification loss for five epochs. We run models for 50 epochs after which we notice that models reach a plateau. We find that adding a weighted MSE term allows to achieve higher stability (as measured by variance) as further studied in Sect. 6.6. In a similar fashion to Pathak et al. (2016), we tune this weight in the range \(\{1, 10^{-1}, 10^{-2}, 7.5 \times 10^{-3}, 5 \times 10^{-3}, 10^{-3}\}\). We find that 0.005 offers the best compromise between mean loss and variance. Moreover on this dataset we use a faster decaying strategy for the discriminator’s \(D_2\) and the reconstructor’s r learning rate, \(lr = \dfrac{lr_i}{(1 + 30 \times p)^{0,75}}\) to achieve higher stability in the training curves while the feature extractor \(g_1\), \(g_2\) and \(D_1\)’s learning rate are unchanged.

\({{\texttt {\textit{ads-real}}}}\) The initial learning rate is chosen in the range \(\{10^{-4}, 10^{-5}, 10^{-6}\}\) and fixed to be \(10^{-6}\) for all models. The learning rate is decayed with the same parameters as digits for all models. We run models for ten epochs which provides a good trade-off between learning time and classification performance. Batch size is taken to be 500. We choose \(L_2 = L_{MSE} + L_{ADV}\) without further tuning; this achieves already good results.

1.3 E.3: Amazon

We use the same hyperparameters as ads-kaggle. \(\lambda _{MSE}\) is set to 1 without further tuning.

F: Latent space visualization on digits

In this section we visualize the embeddings \(\mathbf {\widehat{z}}={\widehat{g}}({\mathbf {z}})\) learned by the various models on digits by projecting the embeddings in a 2D space using \({\widehat{g}}\) with t-SNE (the original embedding size being 2048). Figure 8 represents the embeddings learned for ADV models on MNIST \(\rightarrow \) MNIST-M. Figures 9 and 10 represent these embeddings for OT models respectively on MNIST \(\rightarrow \) MNIST-M and MNIST \(\rightarrow \) USPS. On these figures, we see that Adaptation-Imputation generates feature representations that overlap better between source and target examples per class than the adaptation counterparts (although Adaptation-IgnoreComponent does a good job at overlapping feature representations). This correlates with the accuracy performance on the target test set. Moreover we notice, as expected, that Adaptation-IgnoreComponent and Adaptation-ZeroImputation perform badly compared to Adaptation-Full which justifies the use of Adaptation-Imputation when confronted to missing non-stochastic data.

Fig. 8
figure 8

Embeddings for MNIST \(\rightarrow \) MNIST-M dataset for ADV models on a batch. Figures on the left represent the source (red) and target (blue) clusters; Figures on the right represent the classes on source and target

Fig. 9
figure 9

Embeddings for MNIST \(\rightarrow \) MNIST-M dataset for OT models on a batch. Figures on the left represent the source (red) and target (blue) clusters; Figures on the right represent the classes on source and target

Fig. 10
figure 10

Embeddings for MNIST \(\rightarrow \) USPS dataset for OT models on a batch. Figures on the left represent the source (red) and target (blue) clusters; Figures on the right represent the classes on source and target

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kirchmeyer, M., Gallinari, P., Rakotomamonjy, A. et al. Unsupervised domain adaptation with non-stochastic missing data. Data Min Knowl Disc 35, 2714–2755 (2021). https://doi.org/10.1007/s10618-021-00775-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-021-00775-3

Keywords

Navigation