Multi-Objective Decision Making. Diederik M. Roijers

Читать онлайн.
Название Multi-Objective Decision Making
Автор произведения Diederik M. Roijers
Жанр Программы
Серия Synthesis Lectures on Artificial Intelligence and Machine Learning
Издательство Программы
Год выпуска 0
isbn 9781681731827



Скачать книгу

abbreviation is also used for mixed-observability MDPs [Ong et al., 2010], which we do not consider here; we use the abbreviation MOMDPs solely for multi-objective MDPs.

      CHAPTER 2

       Multi-Objective Decision Problems

      In this chapter, we introduce the concept of a multi-objective decision problem. Then we describe two concrete classes of multi-objective decision problems that we use throughout the book: multi-objective coordination graphs and multi-objective Markov decision processes. However, before introducing the concrete multi-objective decision problems, we first introduce their single-objective counterparts.

      In this book, we focus on different (cooperative) multi-objective decision problems. Multi-objective decision problems contrast single-objective decision problems, which are more common in the literature. In their most general form, single-objective decision problems can be defined as a set of policies and a value function that we wish to maximize:

      • a set of allowed (joint) policies Π,

      • a value function that assigns a real numbered value, Vπ ∈ ℝ, to each joint policy π ∈ Π, corresponding to the desirability, i.e., the utility, of the policy.

      an SODP, but

      • there are d ≥ 1 objectives, and

      • the value function assigns a value vector, Vπ ∈ ℝd, to each joint policy π ∈ Π, corresponding to the desirability of the policy with respect to each objective.

      We denote the value of policy π in the i-th objective as Image.

      Both Vπ and Π may have particular forms that affect the way they should be computed, as we discuss in Chapter 3. For example, there may be multiple agents, each with its own action space. In such settings, a solution consists of a joint policy containing a local policy for each agent. We assume that Π is known and that it is in principle possible to compute the value of each (joint) policy. Furthermore, we assume that, if there are multiple agents, these agents are cooperative.

      This definition of cooperative is common in the field of decision theory, e.g., in multi-agent MDPs [Boutilier, 1996, Scharpff et al., 2016] and Dec-POMDPs [Oliehoek and Amato, 2016]. However, the term “cooperative” is used differently in cooperative game theory [Chalkiadakis et al., 2011, Igarashi and Roijers, 2017], in which agents form coalitions on the basis of their individual utilities. In this book, we consider only decision problems that are cooperative according to Definition 2.3.

      In an SODP, the value function provides a complete ordering on the joint policies, i.e., for each pair of policies π and π′, Vπ must be greater than, equal to, or less than Image. By contrast, in an MODP, the presence of multiple objectives means that the value function Vπ is a vector rather than a scalar. Such value functions supply only a partial ordering. For example, it is possible that, Image but Image. Consequently, unlike in an SODP, we can no longer determine which values are optimal without additional information about how to prioritize the objectives, i.e., about what the utility of the user is for different trade-offs between the objectives.

      In the unknown weights and decision support scenarios (Figure 1.1), the parameters of the scalarization function w, or even f itself, are unknown during the planning or learning phases. Therefore, an algorithm for solving an MODP should return a set of policies that contains an optimal policy for each possible w. Given such a solution set, the user can pick the policy that maximizes her utility in the selection phase. We want the solution set to contain at least one optimal policy for every possible scalarization (in order to guarantee optimality), but we also want the solution set to be as small as possible, in order to make the selection phase as efficient as possible. We discuss which solution sets are optimal, and how this can be derived from different assumptions about the scalarization function f (Definition 1.1), and the set of permitted policies Π in the MODP in Chapter 3. In the rest of this section, we introduce two different MODP problem classes.

      The first class of MODPs that we treat is the multi-objective coordination graph (MO-CoG).1 In a MO-CoG, multiple agents need to coordinate their actions in order to be effective. For example, in the mining problem of Figure 1.2, each agent represents a van with workers from a single village. Each of these vans can go to different mines within reasonable traveling distance, leading to a set of different possible actions for each agent. Each mine yields a different expected amount of gold (the first objective) and silver (the second objective). Because mining can be done more efficiently when more workers are present at a mine, it is vitally important that the different agents (i.e., vans) coordinate which mines they go to.

      Other examples of problems that can be modeled as a MO-CoG are: risk-sensitive combinatorial auctions, in which we want to maximize the total revenue, while minimizing the risk for the auctioneer [Marinescu, 2011], and maintenance scheduling for offices in which the energy consumption, costs, and overtime for the maintenance staff must all be minimized [Marinescu, 2011].

      Before we formally define MO-CoGs, we first define the corresponding single-objective problem, i.e., coordination graphs (CoGs) [Guestrin et al., 2002, Kok and Vlassis, 2004]. In the context of coordination graphs, the notion of reward is typically referred to as payoff in the literature. Payoff is usually denoted u (for utility). We adopt this terminology and notation.

      • D = {1,…, n} is the set of n agents,

      • A = Ai × … × An is the joint action space: the Cartesian product of the finite action spaces of all agents. A joint action is thus a tuple containing an action for each agent a = 〈a1,…, an〉, and

      • U = {u1,…, uρ} is the set of ρ scalar local payoff functions, each of which has limited scope, i.e., it depends on only a subset of the agents. The total team payoff is the sum of the local payoffs: Image.

      In