\section{Backgrounds}

The central problem in the field of decentralized reinforcement learning (RL) is to develop efficient algorithms that can learn optimal policies in multi-agent environments while addressing the challenges of scalability, privacy, and convergence. This problem is of great importance in various industrial applications, such as autonomous vehicles \citep{duan2022autonomous}, traffic signal control \citep{yang2021an}, and edge-computing-empowered Internet of Things (IoT) networks \citep{lei2022adaptive}. Theoretical challenges in this field include the design of algorithms that can handle high-dimensional state and action spaces, non-stationarity, and the exponential growth of state-action space \citep{adams2020resolving}.

\subsection{Foundational Concepts and Notations}

Reinforcement learning is a framework for learning optimal policies through interaction with an environment \citep{sutton2005reinforcement}. In this framework, an agent takes actions in an environment to achieve a goal, and the environment provides feedback in the form of rewards. The objective of the agent is to learn a policy that maximizes the expected cumulative reward over time.

A standard RL problem is modeled as a Markov Decision Process (MDP), defined by a tuple $(\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma)$, where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $\mathcal{P}: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow [0, 1]$ is the state transition probability function, $\mathcal{R}: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ is the reward function, and $\gamma \in [0, 1)$ is the discount factor. The agent's goal is to learn a policy $\pi: \mathcal{S} \rightarrow \mathcal{A}$ that maximizes the expected cumulative reward, defined as $V^\pi(s) = \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t R_t | S_0 = s, \pi\right]$.

In decentralized RL, multiple agents interact with the environment and each other to learn optimal policies. The problem can be modeled as a Decentralized Markov Decision Process (D-MDP) \citep{lu2021decentralized}, which extends the MDP framework to include multiple agents and their local observations, actions, and policies. The D-MDP is defined by a tuple $(\mathcal{S}, \mathcal{A}_1, \dots, \mathcal{A}_n, \mathcal{P}, \mathcal{R}_1, \dots, \mathcal{R}_n, \gamma)$, where $n$ is the number of agents, $\mathcal{A}_i$ is the action space of agent $i$, and $\mathcal{R}_i$ is the reward function of agent $i$. Each agent aims to learn a local policy $\pi_i: \mathcal{S} \rightarrow \mathcal{A}_i$ that maximizes its expected cumulative reward.

\subsection{Decentralized Reinforcement Learning Algorithms}

Decentralized RL algorithms can be broadly categorized into two classes: value-based and policy-based methods. Value-based methods, such as decentralized Q-learning \citep{hasselt2015deep}, aim to learn an action-value function $Q^\pi(s, a)$, which represents the expected cumulative reward of taking action $a$ in state $s$ and following policy $\pi$ thereafter. The optimal policy can be derived from the optimal action-value function, $Q^*(s, a) = \max_\pi Q^\pi(s, a)$, as $\pi^*(s) = \arg\max_a Q^*(s, a)$. Deep Q-Networks (DQNs) \citep{mnih2013playing} extend Q-learning to high-dimensional state spaces by using deep neural networks to approximate the action-value function.

Policy-based methods, such as decentralized policy gradient (Dec-PG) \citep{lu2021decentralized}, directly optimize the policy by following the gradient of the expected cumulative reward with respect to the policy parameters. Actor-critic algorithms \citep{lillicrap2015continuous} combine the advantages of both value-based and policy-based methods by using a critic to estimate the action-value function and an actor to update the policy based on the critic's estimates. Decentralized actor-critic algorithms have been proposed for continuous control tasks \citep{mnih2016asynchronous} and multi-agent collision avoidance \citep{thumiger2022a}.

In this paper, we focus on the application of decentralized RL algorithms to the problem of playing Atari games. We build upon the foundational concepts and algorithms introduced above and develop a novel decentralized RL algorithm that addresses the challenges of scalability, privacy, and convergence in multi-agent Atari environments.

\subsection{Decentralized Learning in Atari Environments}

Atari games provide a challenging testbed for RL algorithms due to their high-dimensional state spaces, diverse game dynamics, and complex scoring systems \citep{mnih2013playing}. Recent advances in deep RL have led to the development of algorithms that can learn to play Atari games directly from raw pixel inputs, outperforming human experts in some cases \citep{mnih2013playing}. However, most of these algorithms are centralized and do not scale well to large multi-agent environments.

In this paper, we propose a novel decentralized RL algorithm for playing Atari games that leverages the advantages of both value-based and policy-based methods. Our algorithm builds upon the decentralized Q-learning and Dec-PG frameworks and incorporates techniques from deep RL, such as experience replay \citep{mnih2013playing} and target networks \citep{hasselt2015deep}, to improve stability and convergence. We also introduce a novel communication mechanism that allows agents to share information and coordinate their actions while preserving privacy and reducing communication overhead. Our experimental results demonstrate that our algorithm achieves competitive performance compared to centralized methods and outperforms existing decentralized RL algorithms in the Atari domain.