Spaces:

auto-academic
/

auto-draft

Running

auto-draft / outputs /outputs_20230421_000752 /related works.tex

sc_ma

Add auto_backgrounds.

238735e about 1 year ago

No virus

4.49 kB

	\section{related works}
	\paragraph{Markov Decision Processes:}
	The study of reinforcement learning is fundamentally rooted in the understanding of Markov decision processes (MDPs). A concise description of stochastic approximation algorithms in reinforcement learning of MDPs is provided by \cite{1512.07669}. The work done in \cite{1511.02377} offers a full characterization of the set of value functions of MDPs, while \cite{1512.09075} specifies a notation for MDPs. The concept of decisiveness in denumerable Markov chains has been extended to MDPs in \cite{2008.10426}, exploring the implications of resolving non-determinism in adversarial or cooperative ways. Additionally, \cite{0711.2185} introduces an embedding technique to produce a finite-state MDP from a countable-state MDP, which can be used as an approximation for computational purposes.

	\paragraph{Q-Learning and Variants:}
	Q-learning is a widely used reinforcement learning algorithm that converges to the optimal solution \cite{2303.08631}. However, it is known to overestimate values and spend too much time exploring unhelpful states. Double Q-learning, a convergent alternative, mitigates some of these overestimation issues but may lead to slower convergence \cite{2303.08631}. To address the maximization bias in Q-learning, \cite{2012.01100} introduces a self-correcting algorithm that balances the overestimation of conventional Q-learning and the underestimation of Double Q-learning. This self-correcting Q-learning algorithm is shown to be more accurate and achieves faster convergence in certain domains.

	\paragraph{Expert Q-Learning:}
	Expert Q-learning is a novel deep reinforcement learning algorithm proposed in \cite{2106.14642}. Inspired by Dueling Q-learning, it incorporates semi-supervised learning into reinforcement learning by splitting Q-values into state values and action advantages. An expert network is designed in addition to the Q-network, which updates each time following the regular offline minibatch update. The algorithm is demonstrated to be more resistant to overestimation bias and achieves more robust performance compared to the baseline Q-learning algorithm.

	\paragraph{Policy Gradient Methods:}
	Policy gradient methods are widely used for control in reinforcement learning, particularly in continuous action settings. Natural gradients have been extensively studied within the context of natural gradient actor-critic algorithms and deterministic policy gradients \cite{2209.01820}. The work in \cite{1811.09013} presents the first off-policy policy gradient theorem using emphatic weightings and develops a new actor-critic algorithm called Actor Critic with Emphatic weightings (ACE) that approximates the simplified gradients provided by the theorem. This algorithm is shown to outperform previous off-policy policy gradient methods, such as OffPAC and DPG, in finding the optimal solution.

	\paragraph{Deep Reinforcement Learning:}
	Deep reinforcement learning (DRL) combines the power of deep learning with reinforcement learning, achieving remarkable success in various domains, such as finance, medicine, healthcare, video games, robotics, and computer vision \cite{2108.11510}. The field has seen significant advancements in recent years, with central algorithms such as the deep Q-network, trust region policy optimization, and asynchronous advantage actor-critic being developed \cite{1708.05866}. A detailed review of DRL algorithms and their theoretical justifications, practical limitations, and empirical properties can be found in \cite{1906.10025}.

	\paragraph{Temporal Networks:}
	Temporal networks, where links change over time, are essential in understanding the ordering and causality of interactions between nodes in various applications. The work in \cite{2111.01334} proposes a temporal dissimilarity measure for temporal network comparison based on the fastest arrival distance distribution and spectral entropy-based Jensen-Shannon divergence. This measure is shown to effectively discriminate diverse temporal networks with different structures and functional distinctions.

	In conclusion, reinforcement learning has seen significant advancements in recent years, with various algorithms and techniques being developed to address the challenges in the field. From understanding the fundamentals of MDPs to developing advanced DRL algorithms, researchers continue to push the boundaries of what is possible in reinforcement learning and its applications.