Spaces:

LvYing2023
/

OPUS-BioLLM

Runtime error

OPUS-BioLLM / doc1.txt

lvying

Add application file

c5a7677 10 months ago

No virus

39 kB

	\documentclass[sigconf,review,anonymous]{acmart}
	\usepackage{amsmath,amssymb,amsfonts}
	\usepackage{graphicx}
	\usepackage{textcomp}
	\usepackage{xcolor}
	\usepackage{multirow}
	\usepackage{subfigure}
	\usepackage{array}
	\usepackage{algorithm}
	\usepackage{algorithmic}

	\BibTeX command to typeset BibTeX logo in the docs
	\AtBeginDocument{
	\providecommand\BibTeX{{
	Bib\TeX}}}

	Rights management information. This information is sent to you
	when you complete the rights form. These commands have SAMPLE
	values in them; it is your responsibility as an author to replace
	the commands and values with those provided to you when you
	complete the rights form.
	\setcopyright{acmcopyright}
	\copyrightyear{2023}
	\acmYear{2023}
	\acmDOI{XXXXXXX.XXXXXXX}

	These commands are for a PROCEEDINGS abstract or paper.
	\acmConference[CIKM]{32nd ACM International Conference on Information and Knowledge Management}{October 21--25,
	2023}{Birmingham, UK}

	Uncomment \acmBooktitle if the title of the proceedings is different
	from Proceedings of ...!

	\acmBooktitle{Woodstock '18: ACM Symposium on Neural Gaze Detection,
	June 03--05, 2018, Woodstock, NY}
	\acmPrice{15.00}
	\acmISBN{978-1-4503-XXXX-X/18/06}



	Submission ID.
	Use this when submitting an article to a sponsored event. You'll
	receive a unique submission ID from the organizers
	of the event, and this ID should be used as the parameter to this command.
	\acmSubmissionID{123-A56-BU3}


	For managing citations, it is recommended to use bibliography
	files in BibTeX format.

	You can then either use BibTeX with the ACM-Reference-Format style,
	or BibLaTeX with the acmnumeric or acmauthoryear sytles, that include
	support for advanced citation of software artefact from the
	biblatex-software package, also separately available on CTAN.

	Look at the sample-*-biblatex.tex files for templates showcasing
	the biblatex styles.



	The majority of ACM publications use numbered citations and
	references. The command \citestyle{authoryear} switches to the
	"author year" style.

	If you are preparing content for an event
	sponsored by ACM SIGGRAPH, you must use the "author year" style of
	citations and references.
	Uncommenting
	the next command will enable that style.
	\citestyle{acmauthoryear}



	end of the preamble, start of the body of the document source.
	\begin{document}


	The "title" command has an optional parameter,
	allowing the author to define a "short title" to be used in page headers.
	\title{Integrating Priors into Domain Adaptation Based on Evidence Theory}


	The "author" command and its associated commands are used to define
	the authors and their affiliations.
	Of note is the shared affiliation of the first two authors, and the
	"authornote" and "authornotemark" commands
	used to denote shared contribution to the research.
	\author{Ying Lv}
	\affiliation{
	\institution{ShangHai University}
	\streetaddress{99 Shangda Rd}
	\city{Baoshan Qu}
	\state{Shanghai Shi}
	\country{China}}
	\email{lvying@pjlab.org.cn}

	\author{Jianpeng Ma}
	\affiliation{
	\institution{ShangHai University}
	\streetaddress{99 Shangda Rd}
	\city{Baoshan Qu}
	\state{Shanghai Shi}
	\country{China}}
	\email{yswanty@shu.edu.cn}


	By default, the full list of authors will be used in the page
	headers. Often, this list is too long, and will overlap
	other information printed in the page headers. This command allows
	the author to define a more concise list
	of authors' names for this purpose.
	\renewcommand{\shortauthors}{Ying Lv et al.}


	The abstract is a short summary of the work to be presented in the
	article.
	\begin{abstract}
	Domain adaptation aims to build up a learning model for target domain by leveraging transferable knowledge from different but related source domains. Existing domain adaptation methods generally transfer the knowledge from source domain to target domain through measuring the consistency between the different domains. Under this strategy, if the data of source domain is not sufficient to guarantee the consistency, the transferable knowledge will be very limited. On the other hand, we often have priors about target domain which facilitate knowledge transfer but are neglected in the extant domain adaptation methods. To tackle the problems, we integrate the priors of target domain into transfer process and propose a domain adaptation method based on evidence evidence theory. We represent the priors with evidential belief function and reformulate the domain adaptation objective based on likelihood principle, in which the priors are used to adjust transferred knowledge to suit for target domain. Based on this, we propose an improved coordinate ascent algorithm to optimize likelihood objective of domain adaption. Experimental results on both text and image datasets validate that the proposed method is effective to improve the knowledge transferability in domain adaptation, especially when the source domain is limited.
	\end{abstract}


	\section{Introduction}\label{sec:Intro}
	In the field of machine learning research, supervised learning methods have already witnessed outstanding performance in many applications. The key point of supervised learning is to collect sufficient labeled data sets for model training, which also limits the usage of supervised learning in the scenarios lack of training data. Furthermore, data annotating is usually a time-consuming, labor-expensive, or even unrealistic task.

	To settle this situation, Domain Adaption (DA) is a promising methodology, which aims to build an efficient model for the target domain by making use of labeled instances from other related source domains \cite{pan2009survey}, \cite{zhuang2020comprehensive}, \cite{zhang2019recent}. Existing DA methods can be divided into four types, namely instance-based \cite{dai2007boosting} \cite{chen2011co} , feature-based \cite{courty2016optimal}\cite{fernando2013unsupervised}, model-based \cite{duan2012domain} \cite{karbalayghareh2018optimal}, and deep learning-based \cite{bengio2012deep} \cite{venkateswara2017deep}. Their fundament thought is to discover transferable knowledge through maximizing the consistency between source domain and target domain, and transfer the knowledge to the model of target domain.

	\begin{figure}
	\centering
	\includegraphics[scale=0.55]{IntroPriors.eps}
	\caption{Integrating priors into domain adaptation for improving the performance of target learner}
	\label{fig:Example}
	\end{figure}

	However, the transferable knowledge is very limited when the number of labeled instances from source domain is small or the distance is far between source domain and target domain. It may lead to limited performance of the target learners and more easily appearing negative transfer. As shown in Figure \ref{fig:Example}, the data of source domain is real image of backpack, and the data of target domain is cartoon image of backpack. If we only utilize transferable knowledge from real images (source domain) to train a target learner for recognizing cartoon images, since the source domain lacks cartoon information, the performance of target learner will be limited. However, the cartoon information is easily obtained from general information. Moreover, the cartoon information can be as priors of target domain. If we integrate cartoon information into domain adaptation for adjusting transferred knowledge, the problem can be alleviated.

	To tackle these problems, we propose a novel method that incorporates priors of target domain into domain adaptation based on evidence theory. Specifically, we first extract priors of target domain from a general information by minimizing discrepancy between the information and target domain in reproducing kernel Hilbert Space and we design an evidential belief function for representing priors based on Dempster's rule. Then, we integrate priors of target domain into domain adaptation based on the likelihood principle, in which the priors adjust transferred knowledge to suit for target domain. Finally, we proposed an improved coordinate ascent algorithm for optimizing objective function with priors. The contributions of our work are summarized as follows. ¼ǵüԓŻ¯´�µń¿±꺯˽

	\begin{itemize}
	\item We propose a method that integrates priors into domain adaptation based on evidence evidence theory, which adjusts transferred knowledge to suit for target domain.
	\item We propose an improved coordinate ascent algorithm for solving the objective function with target-domain priors. ¼ǵüԓŻ¯´�µń¿±꺯˽
	\end{itemize}

	The remainder of the paper is organized as follows. We start by reviewing related works in Section~\ref{sec:RelWork}. Section~\ref{sec:OurMethod} describes our proposed method, which includes obtaining and representing priors, integrating priors into domain adaptation and optimizing algorithm. In Section~\ref{sec:Analysis}, we analyze that our method how to adjust the transferred knowledge to suit for target domain by the priors. Section~\ref{sec:Exp} presents the experimental results to validate the efficiency of the proposed method. The conclusion about our exploratory work is also given in the Section~\ref{sec:Conclusion}.

	\section{Related Work}\label{sec:RelWork}

	Evidence theory can be considered as a generalized probability \cite{dempster1968upper} \cite{shafer2016mathematical}. It can use Dempster's rule to finish possibility reasoning \cite{denoeux1999reasoning}. Let $\Omega=\{z_1,z_2,\ldots,z_n\}$ be a finite set that includes all possible answers in decision problem. In the classification problems, the $\Omega$ can be regarded as the label space. We denote the power-set as ${{2}^{\Omega }}$ and the cardinality of power-set is ${{2}^{\|\Omega\|}}$.

	The mass function $m(\cdot)$ is the Basic Possibility Assignment (BPA) that represents support degree of evidence, and $m(\cdot)$ is a mapping from ${{2}^{\Omega }}$ to the interval [0,1]. It satisfies the condition as follows:
	\begin{equation}
	\sum\limits_{A \in 2^{\Omega}} m(A)=1, \qquad m(\emptyset)=0
	\end{equation}

	Dempster¡¯s rule reflects the combined effect of evidence. Let $m_1$ and $m_2$ be two mass functions induced by independent items of evidence. They can be combined using Dempster¡¯s rule to form a new mass function defined as:
	\begin{equation}
	\left(m_{1} \oplus m_{2}\right)(A)=\frac{1}{1-\kappa} \sum_{B \cap C=A} m_{1}(B) m_{2}(C)
	\end{equation}
	where $A \subseteq \Omega$, $A \neq \emptyset$ and $\left(m_{1} \oplus m_{2}\right)(\emptyset)=0$. $\oplus$ is the
	combination operator of Dempster¡¯s rule. $k$ is the degree of conflict between $m_1$ and $m_2$.
	\begin{equation}
	\kappa=\sum_{B \cap C=\emptyset} m_{1}(B) m_{2}(C)
	\end{equation}

	To each normalized mass function $m$, The belief function $Bel(\cdot)$ and plausibility function $Pl(\cdot)$ are defined as follows:
	\begin{equation}
	\label{defineBel}
	Bel(A)=\sum_{B \subseteq A} m(B)
	\end{equation}
	\begin{equation}
	\label{definePl}
	Pl(A)=\sum_{B \cap A \neq \emptyset} m(B),
	\end{equation}
	where $Bel(\cdot)$ and $Pl(\cdot)$ assign $[0,1]$ to ${{2}^{\Omega }}$. These two function are linked by the relation $Pl(A)=1-Bel(\overline{A})$.

	\section{The Proposed Method}\label{sec:OurMethod}

	In this section, we present a novel method that integrates the priors of target domain into domain adaptation based on evidence theory for target learner.

	In this paper, the target learner can be written as
	\begin{equation}
	\label{eq:targetLearner}
	f({\mathcal{D}^{t}};\theta)\sim P(\mathcal{Z}\|{\mathcal{D}^{t}};\mathcal{D}^{s},\Phi^t).
	\end{equation}
	where ${\mathcal{D}^{s}}$,${\mathcal{D}^{t}}$ and $\Phi^t$ denote labeled source domain, unlabeled target domain and priors of target domain.

	To achieve this, we first extract priors from general information, and utilize the evidential belief function to represent priors. Then, we integrate the priors of target domain into domain adaptation. Finally, we design an improved coordinate ascent algorithm to solve the parameters.

	\subsection{Extracting Priors from General Information}

	The general information can be a set of instances, features or rules. The priors of target domain are extracted from general information through measuring discrepancy of information in a reproducing kernel Hilbert Space. For the sake of interpretation, we assume that the general information $\mathcal{G}$ is a set of instances, in which $\mathcal{G}={(x_1,z_1),...,(x_n,z_n)}$ is fully labeled with $x_k \in \mathbb{R}^{p}$. When the elements of priors $\Phi^t$ are close to the instances $x^t$ of target domain, we consider the discrepancy is small between priors and target domain. To this end, the objective function of obtaining priors is defined as
	\begin{equation}
	\Phi^t = \underset{\Phi}{\operatorname{argmin}} f\left(x^{t}, \Phi \subset \mathcal{G} \right),
	\end{equation}
	$f(\cdot)$ measures the discrepancy between the instance $x^t$ of target domain and the candidate prior $\Phi$ from general information in a reproducing kernel Hilbert Space (RKHS) $\mathcal{H}$,
	\begin{equation}
	f \left(x^{t}, \Phi\right)=\left\\| \varphi\left(x^{t}\right)-\frac{1}{\|\Phi\|} \sum_{x \in \Phi} \varphi(x)\right\\|_\mathcal{H}^{2},
	\end{equation}
	where $\varphi: \mathcal{X} \mapsto \mathcal{H}$ is the feature mapping. $\|\Phi\|$ is the number of elements in prior. In this paper, we adopt the radial basis function kernel into infinite dimensional space
	\begin{equation}
	\begin{array}{l}
	K\left(x, x^{t}\right)=\varphi(x)^{T} \varphi\left(x^{t}\right)
	=\exp \left(-\gamma\left\\|x-x^{t}\right\\|^{2}\right)
	\end{array}
	\end{equation}
	where $\left\\|{x}-{x}^{t}\right\\|$ is the Euclidean distance between two points and $\gamma$ is a scaling parameter. The function $f\left(x^{t}, \Phi\right)$ can be rewritten as follow:
	\begin{equation}
	f\left(x^{t}, \Phi\right)=\left\\| \frac{1}{\|\Phi\|^{2}} \sum_{{x_{j}^{1},x_{j}^{2}} \in \Phi} K\left(x_{j}^{1}, x_{j}^{2}\right)-\frac{2}{\|\Phi\|} \sum_{x \in \Phi} K\left(x^{t}, x\right)\right\\|_{H}^{2}
	\end{equation}
	So, the objective function can be rewritten as follow:
	\begin{equation}
	\label{eq:ObtainPrior}
	\Phi^t = \underset{\Phi}{\operatorname{argmin}} \left\\| \frac{1}{\|\Phi\|^{2}} \sum_{{x_{j}^{1},x_{j}^{2}} \in \Phi} K\left(x_{j}^{1}, x_{j}^{2}\right)-\frac{2}{\|\Phi\|} \sum_{x \in \Phi} K\left(x^{t}, x\right)\right\\|_{H}^{2}
	\end{equation}

	The optimal prior $\Phi^t$ in Equation \eqref{eq:ObtainPrior} can be solved by the greedy search algorithm on general information.

	In addition, according to data type of general information, the strategy of obtaining priors need to be changed. If the elements of general information is rules, we need to design a new strategy that measures the discrepancy between instances of target domain and rules.

	\subsection{Representing Priors with Evidential Belief Function}(Ҫ²»Ҫ´ӹ畲ΆmµĽǶɐ´£¬֤¾ޗ�°Ӳ£¬Ɛ±𾡹�º󺹩

	In evidence theory, the priors $\Phi^t$ of target domain can be viewed as a set of different granular evidence
	\begin{equation}
	\Phi^t=\{\Phi_1^t,\ldots,\Phi_n^t\},
	\end{equation}
	where $\Phi_k^t={\left\{\left({x}_{1}, z_{1}=k\right), \ldots,\left({x}_{n}, z_{n}=k\right)\right\}}$ is a set in which the labels of instances are equal to $k$.

	Then, we adopt the belief function to represent the priors $\Phi^t$, and the belief functions are defined as
	\begin{equation}
	\begin{split}
	\label{PLfunction}
	{Bel({z=k}\|x^t;\Phi^t)=m(z=k\|x^t;\Phi^t)},\\
	{Pl({z=k}\|x^t;\Phi^t)=m(z=k\|x^t;\Phi^t)+m(\Omega\|x^t;\Phi^t)},
	\end{split}
	\end{equation}
	where $Bel(\cdot)$ can be interpreted as the degree which the prior $\Phi^t$ supports that $x^t$ belongs to class $k$. $Pl(\cdot)$ can be interpreted as an upper bound on $Bel(\cdot)$. For improving the robustness of target learner, we use $Pl(\cdot)$ as the belief function. The mass function $m(z=k\|x^t;\Phi^t)$ and $m(\Omega\|x^t;\Phi^t)$ can be calculated by evidence fusion on different granular evidence.

	\begin{equation}
	\begin{aligned}
	\operatorname{m}\left(z={k} \| x^{t} ; \Phi^t\right)=\bigoplus_{\Phi_k^t \subseteq \Phi} m\left(z=k \| x^t ; \Phi_k^t \right)\\
	=\frac{1}{K}{\left(1-\prod_{x \in \Phi_{k}^t} m\left(\Omega \| x^t ; x\right)\right) \prod_{j \neq k}\prod_{x \in \Phi_{j}^t} m\left(\Omega \| x^t ; x\right)}
	\end{aligned}
	\end{equation}
	\begin{equation}
	\operatorname{m}\left(\Omega \| x^{t} ; \Phi^t\right)=\bigoplus_{\Phi_k^t \subseteq \Phi^t} m\left(\Omega \| x^t ; \Phi_k^t \right)
	= \frac{1}{K}{\prod_{k=1}^{n} \prod_{x \in \Phi_{k}^t} m\left(\Omega \| x^t ; x\right)}
	\end{equation}
	\begin{equation}
	\sum_{k \in \Omega} m\left({z}={k} \mid x^{t} ; \Phi^t\right)+m\left(\Omega \mid x^{t} ; \Phi^t\right)=1
	\end{equation}
	where $x^t$ is an instance of target domain. $x$ is an element of priors, $\Omega$ can be considered as a label space in classification task. The orthogonal sum $\bigoplus$ represents the combination operator of Dempster's rule. $K$ is the degree of conflict between evidence. It can be interpreted as a normalizing factor.
	\begin{equation}
	\begin{aligned}
	K=\sum_{k=1}^{n} \left(m\left(z={k} \| x^{t} ; \Phi_{k}^t\right) \prod_{j \neq k} m\left(\Omega \| x^{t} ; \Phi_{k}^t\right)\right)
	+\prod_{k=1}^{n} m\left(\Omega \| x^{t} ; \Phi_{k}^t\right),
	\end{aligned}
	\end{equation}
	the mass function $m(\cdot)$ is the Basic Possibility Assignment (BPA) on priors, and $m(\cdot)$ is a mapping from ${{2}^{\Omega }}$ to the interval [0,1]. Specifically, we adopt distance measure to design the $m(\cdot\|x^t;x)$ in a reproducing kernel Hilbert Space. The mass function $m\left(z={k} \| x^{t} ; x_{kj}\right)$ and ${m}\left(\Omega \| x^{t} ; x_{kj}\right)$ can be defined as
	\begin{equation}
	\mathrm{m}\left(z={k} \| x^{t} ; x_{kj}\right)= \exp \left(d^{2}\left(\varphi\left(x^{t}\right), \varphi\left(x_{k j}\right)\right)\right),
	\end{equation}
	\begin{equation}
	\mathrm{m}\left(\Omega \| x^{t} ; x_{kj}\right)=1-\exp \left(d^{2}\left(\varphi\left(x^{t}\right), \varphi\left(x_{k j}\right)\right)\right),
	\end{equation}
	where $d^2(\cdot)$ is a distance metric. It is defined as follow:
	\begin{equation}
	d^{2}\left(\varphi\left(x^{t}\right), \varphi\left(x_{k j}\right)\right)=\mathrm{K}\left(x^{t}, x^{t}\right)-2 \mathrm{K}\left(x^{t}, x_{k j}\right)+\mathrm{K}\left(x_{k j}, x_{k j}\right)
	\end{equation}
	in which $\varphi\left(\cdot\right)$ is a mapping function that maps the original space to a high-dimensional space. $\mathrm{K}\left(\cdot\right)$ is a kernel function.

	\subsection{Integrating Priors into Domain Adaptation}

	According to Equation \eqref{eq:targetLearner}, we aim to integrate priors into domain adaptation to adjust transferred knowledge to suit for target domain. To end this, we propose a domain adaptation method that incorporates $Pl({z=k}\|x^t;\Phi^t)$ to objective function. The objective function is defined as
	\begin{equation}
	\label{eq:Likelihood}
	\mathcal{L}({{\theta }})={{\alpha }}\mathcal{L}({\mathcal{D}^s};{{\theta }})+{\left(1-{\alpha }\right)}{\mathcal{L}_{e}}({\mathcal{D}^t},\Phi^t;{{\theta }}),
	\end{equation}
	where $\alpha$ is trade-off parameter, ${\mathcal{L}({\mathcal{D}^s};{{\theta }})}$ is a log likelihood function, and it can be formulated as
	\begin{equation}
	\label{eq_likelihood}
	\mathcal{L}({\mathcal{D}^s};{{\theta }})=\sum\limits_{{{x}^{s}}\in {{\mathcal{D}^{s}}}}{\ln \left\{ \sum\limits_{z\in \mathcal{Z}}{p({{x}^{s}},z;\theta )} \right\}},
	\end{equation}
	where ${x^{s}}$ denotes the labeled instance in source domain and $\mathcal{Z}$ is the label set. The function ${\mathcal{L}_{e}}({\mathcal{D}^t},\Phi^t;{{\theta}})$ is the evidential likelihood function \cite{denoeux2011maximum}. It can be formulated as
	\begin{equation}
	\label{EvidenctLikelihoodFunction}
	\mathcal{L}_{e}({\mathcal{D}^t},\Phi^t;{{\theta }})=\sum\limits_{{{x}^{t}}\in {{\mathcal{D}^{t}}}}{\ln \left\{ \sum\limits_{z\in \mathcal{Z}}{p({{x}^{t}},z;\theta )Pl({z}\|x^t;\Phi^t)} \right\}},
	\end{equation}
	where ${x^{t}}$ denotes the unlabeled instance in target domain, and $Pl({z}\|x^t;\Phi^t)$ is used to adjust transferred knowledge to suit for target domain.

	\subsection{Optimizing Algorithm}\label{sec:Optimaze}

	To estimate parameters $\theta$ of target learner, we maximize the $\mathcal{L}(\theta)$.
	\begin{equation}
	\theta^*=\arg \max _{\theta}{\mathcal{L}(\theta)}.
	\end{equation}
	in which maximizing $\mathcal{L}(\theta)$ with respect to $\theta^*$ amounts to minimizing the conflict between the transferred knowledge and priors.

	Then, we assume that the class $z$ is a hidden variable. Consider an arbitrary distribution $q(z)$ over the hidden variable $z$ which is also known as responsibilities. $\mathcal{L}(\theta)$ can be written as
	\begin{equation}
	\label{eq:LikelihoodFunctionWithq}
	\begin{split}
	\mathcal{L}(\theta) \triangleq {{\alpha }}\sum\limits_{{{x}^{s}}\in {{{\mathcal{D}}^{s}}}}{\ln \left\{ \sum\limits_{z\in \mathcal{Z}}{q(z)\frac{p({{x}^{s}},z;\theta )}{q(z)}} \right\}}\\+{\left(1-{\alpha }\right)}\sum\limits_{{{x}^{t}}\in {{{\mathcal{D}}^{t}}}}{\ln \left\{ \sum\limits_{z\in \mathcal{Z}}{q(z)\frac{p({{x}^{t}},z;\theta )Pl(z\|{x}^{t};\Phi^t)}{q(z)}} \right\}}.
	\end{split}
	\end{equation}

	Direct maximization of $\mathcal{L}(\theta)$ is quite difficult by traditional coordinate ascent algorithm, because of the sum of terms with $Pl({z}\|x^t;\Phi^t)$ inside the Equation \eqref{eq:LikelihoodFunctionWithq}. Therefore, we propose an improved coordinate ascent algorithm to seek the optimal $\theta$ that can be considered as Generalized Expectation Maximization algorithm.

	Introducing the \emph{Jensen's Inequation} to Equation \eqref{eq:LikelihoodFunctionWithq}, we obtain the lower bound of $\mathcal{L}(\theta)$.
	\begin{equation}
	\label{LikelihoodfunctionLowerBound}
	\begin{split}
	\mathcal{L}(\theta )\ge {{\alpha }}\sum\limits_{{{x}^{s}}\in {{{\mathcal{D}}^{s}}}}{\sum\limits_{z\in \mathcal{Z}}{\left\{ q(z)\ln \left\{ \frac{p({{x}^{s}},z;\theta )}{q(z)} \right\} \right\}}}\\+{\left(1-{\alpha }\right)}\sum\limits_{{{x}^{t}}\in
	{{{\mathcal{D}}^{t}}}}{\sum\limits_{z\in \mathcal{Z}}{\left\{ q(z)\ln \left\{ \frac{p({{x}^{t}},z;\theta )Pl(z\|{x}^{t};\Phi^t)}{q(z)} \right\} \right\}}} \\
	\end{split}
	\end{equation}

	Let us denote this lower bound as follows:
	\begin{equation}
	\begin{split}
	\mathcal{Q}(\theta,q )\triangleq {{\alpha }}\sum\limits_{{{x}^{s}}\in {{{\mathcal{D}}^{s}}}}{\sum\limits_{z\in \mathcal{Z}}{\left\{ q(z)\ln \left\{ \frac{p({{x}^{s}},z;\theta )}{q(z)} \right\} \right\}}}\\+{\left(1-{\alpha }\right)}\sum\limits_{{{x}^{t}}\in {{{\mathcal{D}}^{t}}}}{\sum\limits_{z\in \mathcal{Z}}{\left\{ q(z)\ln \left\{ \frac{p({{x}^{t}},z;\theta )Pl(z\|{x}^{t};\Phi^t)}{q(z)} \right\} \right\}}} \\
	\end{split}
	\end{equation}

	The above argument holds for any positive distribution $q$. But we should pick the $q$ that yields the tightest lower bound. We find that the tightest lower bound can be obtained when
	\begin{equation}
	\frac{p({{x}^{t}},z;\theta )Pl(z\|{x}^{t};\Phi^t)}{q(z)}= constant
	\end{equation}
	and
	\begin{equation}
	\label{eq:q(z)ofSource}
	\frac{p({{x}^{s}},z;\theta)}{q(z)}= constant
	\end{equation}

	Because the instance $x^s$ of source domain is labeled, the class membership is certain. So we set the class membership $Pl(z\|{x}^{s};\Phi^t)=1$. The Equation \eqref{eq:q(z)ofSource} can be written as follows:
	\begin{equation}
	\frac{p({{x}^{s}},z;\theta)}{q(z)}=\frac{p({{x}^{s}},z;\theta )Pl(z\|{x}^{s};\Phi^t)}{q(z)} = constant
	\end{equation}

	Due to $\sum\limits_{z}{q(z)}=1$, the $q(z)$ can be written as follow:
	\begin{equation}
	\label{eq:q(z)}
	q(z)=\frac{p(x,z;\theta )Pl(z\|{x};\Phi^t)}{\sum\limits_{z}{p(x,z;\theta )Pl(z\|{x};\Phi^t)}}
	\end{equation}
	$q(z;\theta,\Phi^t)$ can be adjusted through $Pl(z\|{x};\Phi^t)$.

	For the convenience of analysis, we assume that the data fits a Gaussian mixture model $\sum_{k=1}^{K} \pi_{k} {N}\left({x} \| \mu_{k}, \Sigma_{k}\right)$ where $\pi_k$ is a mixing coefficient, $\mu_k$ is a mean, $\Sigma_k$ is a covariance matrix. Then $q(z)$ can be rewritten as:
	\begin{equation}
	q(z)=\frac{{{\pi }}N(x\|{{\mu }},{{\Sigma }})Pl(z\|{x};\Phi^t)}{\sum\limits_{z}{{{\pi }}N(x\|{{\mu }},{{\Sigma }})Pl(z\|{x};\Phi^t)}}
	\end{equation}
	The lower bound $\mathcal{Q}(\theta,q )$ can be rewritten as
	\begin{equation}
	\begin{split}
	\mathcal{Q}(\theta,q)= {{\alpha }}\sum\limits_{i=1}^{{{n}^{s}}}{\sum\limits_{k=1}^{K}{{{q}_{i}}({{z}_{i}}=k)} \ln\left\{ \frac{{{\pi }_{k}}{N}\left({x} \| \mu_{k}, \Sigma_{k}\right)}{{{q}_{i}}({{z}_{i}}=k)} \right\}} + \\
	{\left(1-{\alpha }\right)}\sum\limits_{j=1}^{{{n}^{t}}}{\sum\limits_{k=1}^{K}{{{q}_{j}}({{z}_{j}}=k)}}\ln \left\{\frac{{{\pi}_{k}}{N}\left({x} \| \mu_{k}, \Sigma_{k}\right){Pl(z\|{x}^{t};\Phi^t)}}{{{q}_{j}}({{z}_{j}}=k)} \right\} \\
	\end{split}
	\end{equation}

	By maximizing the lower bound $\mathcal{Q}(\theta,q )$ of objective function $\mathcal{L}(\theta)$, we obtain the optimal solution has the following form:
	\begin{equation}
	\begin{split}
	{{\mu }_{k}}=\frac{{{\alpha }}\sum\limits_{i=1}^{{{n}^{s}}}{{{q}_{i}}({{z}_{i}}=k)}x_{i}^{s}+{\left(1-{\alpha }\right)}\sum\limits_{j=1}^{{{n}^{t}}}{{{q}_{j}}({{z}_{j}}=k)x_{j}^{t}}}{{{\alpha }}\sum\limits_{i=1}^{{{n}^{s}}}{{{q}_{i}}({{z}_{i}}=k)}+{\left(1-{\alpha }\right)}\sum\limits_{j=1}^{{{n}^{t}}}{{{q}_{j}}({{z}_{j}}=k)}}
	\end{split}
	\end{equation}

	\begin{equation}
	\begin{split}
	{{\Sigma }_{k}}=\frac{{{\alpha }}\sum\limits_{i=1}^{{{n}^{s}}}{{{q}_{i}}({{z}_{i}}=k)(x_{i}^{s}-{{\mu }_{k}}){{(x_{i}^{s}-{{\mu }_{k}})}^{T}}}}{{{\alpha }}\sum\limits_{i=1}^{{{n}^{i}}}{{{q}_{i}}({{z}_{i}}=k)}+\left(1-{\alpha }\right)\sum\limits_{j=1}^{{{n}^{t}}}{{{q}_{j}}({{z}_{j}}=k)}}\\+\frac{\left(1-{\alpha }\right)\sum\limits_{j=1}^{{{n}^{t}}}{{{q}_{j}}({{z}_{j}}=k){(}x_{j}^{t}-{{\mu }_{k}}){{(x_{j}^{t}-{{\mu }_{k}})}^{T}}}}{{{\alpha }}\sum\limits_{i=1}^{{{n}^{i}}}{{{q}_{i}}({{z}_{i}}=k)}+\left(1-{\alpha }\right)\sum\limits_{j=1}^{{{n}^{t}}}{{{q}_{j}}({{z}_{j}}=k)}}
	\end{split}
	\end{equation}

	\begin{equation}
	\begin{split}
	{{\pi }_{k}}=\frac{{{\alpha }}\sum\limits_{i=1}^{{{n}^{s}}}{{{q}_{i}}({{z}_{i}}=k)+}{\left(1-{\alpha }\right)}\sum\limits_{j=1}^{{{n}^{t}}}{{{q}_{j}}({{z}_{j}}=k)}}{{{{\alpha }}{{n}^{s}}+{\left(1-{\alpha }\right)}{{n}^{t}}}}
	\end{split}
	\end{equation}
	where $n^s=\|\mathcal{D}^s\|$ is the num of the samples in source domain and $n^t=\|\mathcal{D}^t\|$ is the num of the samples in target domain.

	The improved coordinate ascent algorithm includes E-Step and M-Step, in which the E-step maximizes $\mathcal{Q}(\theta,q )$ with respect to $q$ and the M-step maximizes $\mathcal{Q}(\theta,q )$ with respect to $\theta$. The details are presented as follows:\\
	(1) Initialize the parameters ${{\mu }_{k}}$, ${{\Sigma }_{k}}$ and ${{\pi }_{k}}$.\\
	(2) E-Step. Evaluate the belief function $Pl(\cdot)$ according to prior $\Phi^t$ and Evaluate the responsibilities $q(\cdot)$ using the current parameters.
	\begin{equation}
	\label{PLfunctionA}
	{Pl({z=k}\|x^t;\Phi^t)=m(z=k\|x^t;\Phi^t)+m(\Omega\|x^t;\Phi^t)}
	\end{equation}
	\begin{equation}
	\begin{split}
	{{q}}(z=k;\theta,\Phi^t)
	=\frac{{{\pi }_{k}^{l}}N({{x}}\|{{\mu }_{k}^{l}},{{\Sigma }_{k}^{l}})Pl(z=k\|{x}^{t};\Phi^t)}{\sum\limits_{k=1}^{K}{\left\{ {{\pi }_{k}^{l}}N({{x}}\|{{\mu }_{k}^{l}},{\Sigma _{k}^{l}})Pl(z=k\|{x}^{t};\Phi^t) \right\}}}
	\end{split}
	\end{equation}
	(3) M-Step. Re-estimate the parameters using the current responsibilities.
	\begin{equation}
	\begin{split}
	{{\mu }_{k}^{l+1}}=\frac{{{\alpha}}\sum\limits_{i=1}^{{{n}^{s}}}{{{q}_{i}}({{z}_{i}}=k)}x_{i}^{s}+{\left(1-{\alpha }\right)}\sum\limits_{j=1}^{{{n}^{t}}}{{{q}_{j}}({{z}_{j}}=k)x_{j}^{t}}}{{{\alpha }}\sum\limits_{i=1}^{{{n}^{s}}}{{{q}_{i}}({{z}_{i}}=k)}+{\left(1-{\alpha }\right)}\sum\limits_{j=1}^{{{n}^{t}}}{{{q}_{j}}({{z}_{j}}=k)}}
	\end{split}
	\end{equation}
	\begin{equation}
	\begin{split}
	{{\Sigma }_{k}^{l+1}}= \frac{{{\alpha }}\sum\limits_{i=1}^{{{n}^{s}}}{{{q}_{i}}({{z}_{i}}=k)(x_{i}^{s}-{{\mu }_{k}^{l+1}}){{(x_{i}^{s}-{{\mu }_{k}^{l+1}})}^{T}}}}{{{\alpha }}\sum\limits_{i=1}^{{{n}^{i}}}{{{q}_{i}}({{z}_{i}}=k)}+{\left(1-{\alpha }\right)}\sum\limits_{j=1}^{{{n}^{t}}}{{{q}_{j}}({{z}_{j}}=k)}} \\
	+\frac{{\left(1-{\alpha }\right)}\sum\limits_{j=1}^{{{n}^{t}}}{{{q}_{j}}({{z}_{j}}=k)(x_{j}^{t}-{{\mu }_{k}^{l+1}}){{(x_{j}^{t}-{{\mu }_{k}^{l+1}})}^{T}}}}{{{\alpha }}\sum\limits_{i=1}^{{{n}^{i}}}{{{q}_{i}}({{z}_{i}}=k)}+{\left(1-{\alpha }\right)}\sum\limits_{j=1}^{{{n}^{t}}}{{{q}_{j}}({{z}_{j}}=k)}}
	\end{split}
	\end{equation}
	\begin{equation}
	\begin{split}
	{{\pi }_{k}^{l+1}}=\frac{{{\alpha }}\sum\limits_{i=1}^{{{n}^{s}}}{{{q}_{i}}({{z}_{i}}=k)+}{\left(1-{\alpha }\right)}\sum\limits_{j=1}^{{{n}^{t}}}{{{q}_{j}}({{z}_{j}}=k)}}{{{{\alpha }}{{n}^{s}}+{\left(1-{\alpha }\right)}{{n}^{t}}}}
	\end{split}
	\end{equation}
	(4) Evaluate the likelihood $\mathcal{L}(\theta )$ and check for convergence the log likelihood. If the convergence criterion is not satisfied return to step 2, then let
	\begin{equation}
	{\pi _{k}^{l}}\leftarrow {\pi _{k}^{l+1}},{\mu _{k}^{l}}\leftarrow {\mu _{k}^{l+1}},{\Sigma _{k}^{l}}\leftarrow {\Sigma _{k}^{l+1}}
	\end{equation}

	Finally, we can obtain the locally optimal parameter $\theta^*$. The complete procedure of integrating priors into domain adaptation is summarized in the Algorithm 1.

	\begin{algorithm}
	\renewcommand{\algorithmicrequire}{\textbf{Input:}}
	\renewcommand{\algorithmicensure}{\textbf{Output:}}
	\caption{Integrating Priors into Domain Adaptation}
	\label{alg:1}
	\begin{algorithmic}[1]
	\REQUIRE Labeled source domain ${\mathcal{D}^s}$, general information ${\mathcal{G}}$, unlabeled target domain ${\mathcal{D}^t}$.
	\ENSURE Posterior probability $p(z\|x^t;\theta)$, where ${x^t \in \mathcal{D}^t}$.
	\STATE Extracting priors $\Phi^t$ of target domain from general information by Equation \eqref{eq:ObtainPrior};
	\STATE Representing priors $\Phi^t$ by evidential belief function $Pl(\cdot)$;
	\STATE Integrating $Pl(\cdot)$ domain adaptation through objective function $\mathcal{L}(\theta)$;
	\STATE Seeking the optimal parameters $\theta$ by our improved coordinate ascent algorithm to seek the optimal;
	\STATE \textbf{return} $p(z\|x^t;\theta)$
	\end{algorithmic}
	\end{algorithm}

	\section{Analyzing Effectiveness of Priors}\label{sec:Analysis}

	\begin{figure*}
	\centering
	\includegraphics[scale=0.45]{transformspace.eps}
	\caption{The first line shows the traditional method for domain adaptation. The second line shows that the priors adjust the transferred knowledge to suit for target domain through changing projection. Specifically, the belief function $Pl(z\|x)$ adjusts the coordinate system $\{y_i\} \rightarrow \{\widehat{y}_i\}$ for suiting to target domain. In the new coordinate system $\{\widehat{y}_i\}$, the $\widehat{u}_i$ can adjust the directions of ellipse for suiting to target domain, and the $\widehat{\lambda}_{i}^{1 / 2}$ are scaling factors that can adjust the size of ellipse for covering target domain. }
	\label{fig:AnalyzePriors}
	\end{figure*}

	In this subsection, we analyze that our method how to adjust the transferred knowledge to suit for target domain by the priors. In Equation \eqref{eq:Likelihood}, the essence of $p(x,z;\theta)Pl(z\|x;\Phi^t)$ can be viewed as a transformation of space based on $Pl(z\|x;\Phi^t)$ . Specifically, $p(x,z;\theta )$ multiplies by $Pl(z\|x;\Phi)$ can change the projection space of $x$.

	For the sake of interpretation, we assume a Gaussian distribution $x\sim N(\mu,\Sigma )$ for class $k$, $x\in \mathbb{R}^{p}$. Then, $p(x,z;\theta)$ can be written as
	\begin{equation}
	p(x,z;\theta) =\frac{1}{{A}}{\exp \left(-\frac{1}{2}{{(x-\mu )}^{T}}{{\Sigma}^{-1}}(x-\mu )\right)}
	\end{equation}
	where $A={{{(2(\pi) )}^{p/2}}\|\Sigma {{\|}^{1/2}}}$, $\Sigma^{-1}$ is defined as follow
	\begin{equation}
	\Sigma^{-1} =\left( \begin{matrix}
	{{\sigma }_{11}} {{\sigma }_{12}} \cdots {{\sigma }_{1p}} \\
	{{\sigma }_{21}} {{\sigma }_{22}} \cdots {{\sigma }_{2p}} \\
	\vdots \vdots {} \vdots \\
	{{\sigma }_{p1}} {{\sigma }_{p2}} \cdots {{\sigma }_{pp}} \\
	\end{matrix} \right).
	\end{equation}

	Thus, the $p(x,z;\theta)Pl(z\|x;\Phi^t)$ can be written as
	\begin{equation}
	\begin{split}
	p(x,z;\theta )Pl(z\|x;\Phi^t) \\
	=\frac{1}{{A}}{\exp \left(-\frac{1}{2}{{(x-\mu )}^{T}}{{\Sigma}^{-1}}(x-\mu )\right)*Pl(z\|x;\Phi^t)}\\
	=\frac{1}{{A}}{\exp \left(-\frac{1}{2}{{(x-\mu )}^{T}}{{\Sigma}^{-1}}(x-\mu )+\log Pl(z\|x;\Phi^t)\right)}.
	\end{split}
	\end{equation}

	First, according the properties of matrix multiplication, we can find that $Pl(z\|x;\Phi^t)$ exists the following equation.
	\begin{equation}
	\log Pl(z\|x;\Phi^t)=-\frac{1}{2}(x-\mu)^{T} \mathrm{M}(x-\mu)
	\end{equation}
	where
	\begin{equation}
	\mathrm{M}=\left(\begin{array}{ccc}
	\frac{-2\log Pl(z\|x;\Phi^t)}{p\left(x_{1}-\mu_{1}\right)^{2}} \cdots 0 \\
	\vdots \vdots \vdots \\
	0 \cdots \frac{-2\log Pl(z\|x;\Phi^t)}{p\left(x_{p}-\mu_{p}\right)^{2}}
	\end{array}\right).
	\end{equation}

	Then, leveraging the associative law of matrix, the formula $p(x,z;\theta)Pl(z\|x;\Phi^t)$ can be transformed to the following form
	\begin{equation}
	\begin{split}
	p(x, z ; \theta)Pl(z\|x;\Phi^t) \\
	=\frac{1}{A} \exp \left(-\frac{1}{2}(x-\mu)^{T}\left(\Sigma^{-1}+M\right)(x-\mu)\right),
	\end{split}
	\end{equation}
	where $\Sigma^{-1}+M$ is equal to
	\begin{equation}
	\left(\begin{array}{ccc}
	\sigma_{11}+\frac{-2\log Pl(z\|x;\Phi^t)}{p\left(x_{1}-\mu_{1}\right)^{2}} \cdots \sigma_{p 1} \\
	\vdots \vdots \vdots \\
	\sigma_{1 p} \cdots \sigma_{p p}+\frac{-2\log Pl(z\|x;\Phi^t)}{p\left(x_{p}-\mu_{p}\right)^{2}}
	\end{array}\right),
	\end{equation}
	and we denote
	\begin{equation}
	\widehat{\Sigma}^{-1}=\Sigma^{-1}+M.
	\end{equation}

	In order to analyze the influence of $Pl(z\|x;\Phi^t)$, we compare $p(x, z ; \theta)Pl(z\|x;\Phi^t)$ with $p(x, z ; \theta)$ on expression inside the exponent of Gaussian function. The expressions inside the exponent are denoted as follows
	\begin{equation}
	\label{eq:delta}
	\begin{split}
	\Delta ={{(x-\mu )}^{T}}{{\Sigma }^{-1}}(x-\mu ),\\
	\widehat{\Delta} ={{(x-\mu )}^{T}}\widehat{{\Sigma }}^{-1}(x-\mu ).
	\end{split}
	\end{equation}

	We can see that the matrix $\Sigma^{-1}$ and matrix $M$ are symmetric. According to property of symmetric matrix, the matrix $\widehat{\Sigma}^{-1}$ is also symmetric. Moreover, the matrix $\Sigma^{-1}$ and matrix $\widehat{\Sigma}^{-1}$ can be expressed as an expansion in terms of its eigenvectors by eigendecomposition. That is, we denote that
	\begin{equation}
	\Sigma=\mathbf{U} \Lambda \mathbf{U}^{T}, \quad \widehat{\Sigma}=\mathbf{\widehat{U}} \widehat{\Lambda} \mathbf{\widehat{U}}^{T},
	\end{equation}
	where $\mathbf{U}$ and $\mathbf{\widehat{U}}$ are the orthonotmal matrix of eigenvectors satsifying $\mathbf{U}^{T} \mathbf{U}=\mathbf{I}$ and $\mathbf{\widehat{U}}^{T} \mathbf{\widehat{U}}=\mathbf{I}$, and $\Lambda$ and $\widehat{\Lambda}$ are the diagonal matrix of eigenvalues.

	Using the eigendecomposition, we have that
	\begin{equation}
	\label{eq:Exeigenvectors}
	\begin{split}
	{{\Sigma }^{-1}}=\mathbf{U} \Lambda^{-1} \mathbf{U}^{T} =\sum\limits_{i=1}^{p}{{{u}_{i}}\frac{1}{{{\lambda }_{i}}}}u_{i}^{T},\\
	\widehat{{\Sigma }}^{-1}=\mathbf{\widehat{U}} \widehat{\Lambda}^{-1} \mathbf{\widehat{U}}^{T}=\sum\limits_{i=1}^{p}{\widehat{{u}}_{i}\frac{1}{{\widehat{\lambda }_{i}}}}\widehat{u}_{i}^{T},
	\end{split}
	\end{equation}
	where ${{u}_{i}}$ and ${\widehat{u}_{i}}$ are the $i^{th}$ eigenvector of ${{\Sigma }^{-1}}$ and ${\widehat{\Sigma }^{-1}}$, and the
	$\frac{1}{{{\lambda }_{i}}}$ and $\frac{1}{{\widehat{\lambda }_{i}}}$ are the $i^{th}$ eigenvalue of ${{\Sigma }^{-1}}$ and ${\widehat{\Sigma }^{-1}}$.

	Then, we define
	\begin{equation}
	\label{eq:CoordinateSystem}
	{{y}_{i}}={{(x-\mu )}^{T}}{{u}_{i}},\quad {\widehat{y}_{i}}={{(x-\mu )}^{T}}{\widehat{u}_{i}},
	\end{equation}
	where $\{y_i\}$ and $\{\widehat{y}_i\}$ can be interpreted as the coordinate systems defined by the orthonormal vectors $\{u_i\}$ and $\{\widehat{u}_i\}$.

	Substituting Equation \eqref{eq:Exeigenvectors} into Equation \eqref{eq:delta}, we have
	\begin{equation}
	\label{eq:projectionM}
	\Delta =\sum\limits_{i=1}^{p}{{{y}_{i}}\frac{1}{{{\lambda }_{i}}}}y_{i}^{T}, \quad \widehat{\Delta} =\sum\limits_{i=1}^{p}{{\widehat{y}_{i}}\frac{1}{{\widehat{\lambda }_{i}}}}\widehat{y}_{i}^{T}.
	\end{equation}

	According to the property of real symmetric matrix, we know that the sum of eigenvalues of matrix is equal to the sum of the main diagonal elements of matrix, Thus,
	\begin{equation}
	\begin{split}
	\sum_{i=1}^{p} \frac{1}{\lambda_{i}}=\sum_{i=1}^{p} \sigma_{i i},\\
	\sum_{i=1}^{p} \frac{1}{\widehat{\lambda}_{i}}=\sum_{i=1}^{p} \left(\sigma_{i i}+\frac{-2 \log Pl(z\|x;\Phi^t)}{p\left(x_{i}-\mu_{i}\right)^{2}}\right).
	\end{split}
	\end{equation}

	We find the transformation $\frac{1}{\lambda_{i}} \rightarrow \frac{1}{\widehat{\lambda}_{i}}$ is adjusted based on $Pl(z\|x;\Phi^t)$. Because the feature vector can be calculated by the eigenvalues, the transformation $u_{i} \rightarrow \widehat{u}_{i}$ is also adjusted based on $Pl(z\|x;\Phi^t)$. Moreover, according to Equation \eqref{eq:CoordinateSystem}, because the orthonormal vectors $\{\widehat{u}_i\}$ defines the coordinate system $\{\widehat{y}_i\}$, it is obvious that the $\{\widehat{y}_i\}$ generates a new coordinate system based on $Pl(z\|x;\Phi^t)$. The $\{\widehat{y}_i\}$ is shifted and rotated with respect to the original $\{y_i\}$ coordinates based on $\{\widehat{u}_{i}\}$. Thus, the $Pl(z\|x;\Phi^t)$ adjusts the coordinate system $\{y_i\}$ to the new coordinate system $\{\widehat{y}_i\}$ for suiting to target domain.

	For further interpreting, we set p=2, and the Equation \eqref{eq:projectionM} can be rewritten as
	\begin{equation}
	\label{eq:elliptic}
	\frac{{{y}_{1}}^{2}}{{{\lambda }_{1}}}+\frac{{y_2}^{2}}{{{\lambda }_{2}}}=\Delta,\quad \frac{{\widehat{y}_{1}}^{2}}{{\widehat{\lambda }_{1}}}+\frac{\widehat{y_2}^{2}}{{\widehat{\lambda }_{2}}}=\widehat{\Delta}.
	\end{equation}

	It is obvious that Equation \eqref{eq:elliptic} represents the equation of elliptic. Thus, the contours of equal probability density lie along ellipses based on $\Delta$ and $\widehat{\Delta}$. Figure \ref{fig:AnalyzePriors} shows the transformation of the coordinate system based on $Pl(z\|x;\Phi^t)$ in domain adaptation. We can see that the traditional method can not distinguish the class 1 and class 2 in target domain. On the contrary, our method that integrates the prior into domain adaptation can adjust the transferred knowledge for suiting target domain.