**Topic Title: **Multi-Agent Reinforcement Learning for Combinatorial Ranking Policy in Dynamic Environment

**Technical Area: **Recommender systems, Deep learning, Multi-Agent Reinforcement Learning,Combinatorial Ranking

**Background**

We are proposing to study the following two problems which are closely related to Alibaba’s business.

**Combinatorial Ranking: **In e-commerce, optimizing ranking policies for commodity recommendation is important for improving the volume of trading. Nowadays, e-commerce platforms are encouraged to recommend other items (e.g., brands and articles) to users besides commodities at the same page. The ranking policies of different items are independently optimized based on their individual objective functions, which leads to a sub-optimal joint ranking policy. In this project, we aim to develop new algorithms for jointly optimizing the ranking policies of different items, including commodities, brands and articles.

**Policy Convergence in Dynamic Environment:** The variation of environment can be divided into two aspects. On one hand, the characteristics of users are changing based on their behaviors such as clicking or buying commodities. On the other hand, the features of information will be effected by clicking rate or price. These uncertainties and dynamics of environment will disturb the convergence of the ranking policies. To this end, we aim to develop a new reinforcement learning paradigm that converges well in highly dynamic and uncertain environment.

**Target**

In this project, we aim to propose a novel multi-agent reinforcement learning approach to optimize combinatorial ranking and analyze the convergence of our method in dynamic environment. We will elaborate as follows.

**Combinatorial Ranking:** Intuitively, this problem can be alleviated by setting rules to allocate the positions in one page. For example, when we search a commodity in Taobao, the first item will always be an advertisement. However, this method does not take into account the features of information and users. Thus this naive method cannot achieve good performance. To take advantage of all the information from the environment, we resort to multi-agent reinforcement learning to find an optimal ranking policy.

**Policy Convergence in Dynamic Environment: **In a static environment, reinforcement learning algorithms usually can converge to an optimal or sub-optimal policy. However, this property will not hold when the environment is changing. To ensure the performance of our proposed method, we need to analyze or measure the influence caused by the dynamic of environment, and furthermore, improve the adaption of our method including convergence rate, stability and effectiveness.

**Combinatorial Ranking:** By treating a ranking policy as an independent agent, multi-agent reinforcement learning approaches facilitate the cooperation between different ranking policies (agents) and improve the total reward of the system. Since these ranking policies are essentially to maximize the profit of the company, the competition can be avoided by designing a uniform metric to guide these agents to cooperate.

There are two kinds of possible approaches that can be applied in our problem: centralized training framework and distributed training framework.

For the centralized framework, all the agents can communicate with each other and share an identical reward function R(S,a),where s is the state containing the user’s properties (gender, age, etc.), the features of searching word (type, brand, etc.) and other useful information. Action would be updating the parameters of different ranking policies. The reward function R(S,a) will be designed to contain important factors such as clicking rate, staying time and the probability of dealing. These factors can be combined by some method like weighting to obtain the uniform reward function. After that, we can use a multi-agent reinforcement learning method such as actor-critic to train these agents. Since the reward function is consistent, the items of different types can be compared directly through the reward and then be ranked.

For the distributed framework, the agents can only obtain a local observation and cannot communicate with other agents. Their reward functions independently depend on the features of their own tasks. The objective of one agent is outputting a ranking list containing items that can optimize their reward. Since one agent can only determine the order of one type of information, a reinforcement learning algorithm will be utilized to integrate the ranking lists generated by different agents and finally output a combinatorial ranking.

**Policy Convergence in Dynamic Environment: **To adapt to the dynamic environment, two training methods will be helpful according to the details of the environment. On one hand, the online training can update the ranking policy in time, and thus the performance will be optimal in the ideal situation. However, when the environment changes dramatically, the convergence of online training would be difficult leading to poor performance. On the other hand, the offline training can improve the convergence stability by updating the ranking policy periodically. In one period, the policy used to determine the action is consistent to reduce the variance. Many prevalent reinforcement learning algorithms adopt the offline training method to guarantee the convergence such as DQN. The disadvantage of this method is also obvious. The policy cannot be modified to obtain maximum reward if the environment changes drastically in some time slot. Considering the advantages and disadvantages of these two training methods, we will analyze and evaluate these approaches on the convergence rate, stability and effectiveness in the dynamic environment.

**Related Research Topics**

Related research may arise in several aspects, following lists several examples:

- Innovation of agent modeling method: effective, efficient and practical formulation based on our system and business demands.
- Deep learning based agent modeling: deep neural network structures and their implementations.
- Exploitation & exploration, parameter optimization algorithms: parameters learning and hyper-parameters tuning paradigm research.