IV. SYNTHESIS OF THE AUTONOMY PROTOCOL

A. Strategy blending

Given the human strategy ?h ? StrMr and the autonomy

strategy ?a ? StrMr , a blending function computes a weighted

composition of the two strategies by favoring one or the other

strategy in each state of the MDP 16, 6, 7.

7 argues that the weight of blending shows the confidence

in how well the autonomy protocol can assist to perform the

human’s task. Put differently, the blending function should

assign a low confidence to the actions of the human, if the

actions may lead to a violation of the specifications. Recall

Fig. 1 and the example in the introduction. In the cells of the

gridworld where some actions may result in a collusion with

the vacuum cleaner with high probability, it makes sense to

assign a high weight in the autonomy strategy.

We pick the blending function as a state-dependent function

that weighs the confidence in both the human strategy and the

autonomy strategy at each state of an MDP Mr 16, 6, 7.

Definition 5 (Linear blending). Given an MDP Mr =

(S, sI ,A,P), two strategies ?h, ?a ? StrMr , and a blending

function b : S ? 0, 1, the blended strategy ?ha ? StrMr for

all states s ? S, and actions ? ? A is

?ha(s, ?) = b(s) · ?h(s, ?) + (1? b(s)) · ?a(s, ?) .

For each s ? S, the value of b(s) represents the “weight” of

?h at s, meaning how much emphasis the blending function

puts on the human strategy at state s. For example, referring

back to Fig. 1, the critical cells of the gridworld correspond

to certain states of the MDP Mr. At these states, we assign

a very low confidence in the human strategy. For instance at

such a state s ? S, we might have b(s) = 0.1, meaning the

blended strategy in state s puts more emphasis on the autonomy

strategy ?a.

B. Strategy repair

In this section, we describe our approach to the shared control

synthesis problem. We formulate the specific problem of repair-

ing the human strategy based on quasiconvex programming,

which can be solved by checking feasibility of a number of

convex (here even linear) programming problems. We show

that the repaired strategy satisfies the task specifications while

deviating minimally from the human strategy. Finally, we

compute the autonomy strategy ?a that may then be subject

to blending.

We start by formalizing the concept of strategy perturbation.

Then, we introduce the dual linear programming (LP) formula-

tion to synthesize a strategy in a MDP. Finally, we present an

LP formulation of the strategy repair problem and show that

the solution indeed satisfies the specifications while deviating

from the human’s strategy minimally.

1) Perturbation of strategies: As mentioned in the introduc-

tion, the blended strategy should deviate minimally from the

human strategy. To measure the quantity of such a deviation,

we introduce the concept of perturbation, which was used

in 5. To modify a (randomized) strategy, we employ additive

perturbation by increasing or decreasing the probabilities of

action choices in each state. We also ensure that for each state,

the resulting strategy is a well-defined distribution over the

actions.

Definition 6 (Strategy perturbation). Given an MDP Mr and

a strategy ? ? StrMr , an (additive) perturbation ? is a function

? : S ×A? ?1, 1 with?

??A

?(s, ?) = 0 ?s ? S.

The perturbation value at state s for action ? is ?(s, ?).

Overloading the notation, the perturbed strategy ?(?) is given

by

?(?)(s, ?) = ?(s, ?) + ?(s, ?) ?s ? S.? ? A. (2)

2) Dual linear programming formulation for MDPs: In this

section, we refer to the LP formulation to compute a policy

that maximizes the probability of satisfying a reachability

specification P??(?T ) in a MDP 24, 10.

The LP formulation builds on the set Var of variables

including the following:

• x(s, ?) ? 0,?) for each s ? S \ T and ? ? A defines

the occupancy measure of a state-action pair.

maximize

?

s?S\T

?

??A

?(s, ?)x(s, ?) (3)

subject to

?s ? S \ T.?

??A

x(s, ?) =

?

s??S\T

?

??A

P(s?, ?, s)x(s?, ?) + ?s (4)?

s?S\T

?

??A

?(s, ?)x(s, ?) ? ? (5)

where ?(s, ?) =

?

s??T

P(s, ?, s?), and ?s = 1 if s = sI and

?s = 0 if s 6= sI .

For any optimal solution x to the LP in (3)–(5), then

?(s, ?) =

x(s, ?)?

??A

x(s, ?)

(6)

is an optimal strategy, and x is the occupancy measure of ?,

see 24 and 10 for details.

After finding an optimal strategy to the LP in (3)–(5), we

can compute the probability of satisfying a specification by

?

s?S\T

?

??A

?(s, ?)x(s, ?). (7)

3) Strategy repair using quasiconvex programming: Given

the human strategy, ?h ? StrMr , the aim of the autonomy

protocol is to compute the blended strategy, or the repaired

strategy ?ha that induces a similar behavior to the human

strategy while satisfying the specifications. We compute the

repaired strategy by perturbing the human strategy, which is

introduced in Definition 6. As in Definition 6, we perturb the

human strategy ?h to the repaired strategy ?ha by

?s ? S.? ? A. ?ha(s, ?) = ?h(s, ?) + ?(s, ?). (8)

We cannot add this constraint to the LP in (3)–(5) because

the variables in the LP are for the occupancy measures for

the strategy. To perturb the strategy, we use the definition of

occupancy measure to translate the constraint in (8) into the

constraint

?s ? S.? ? A. xha(s, ?)?

??A

xha(s, ?)

= ?h(s, ?) + ?(s, ?) (9)

or equivalently to the constraint

?s ? S.? ? A.

xha(s, ?) =

?

??A

xha(s, ?) (?h(s, ?) + ?(s, ?)) . (10)

Note that the constraint in (10) is not a linear constraint

because of multiplication of the perturbation variables ? and

occupancy measure variables x. Since we are interested in

minimizing the maximal deviation, we can assign a common

variable ?? ? 0, 1 for all state-action pairs to put an upper

bound on the deviation by

?s ? S.? ? A.

|xha(s, ?)?

?

??A

xha(s, ?)?h(s, ?)| ? ??

?

??A

xha(s, ?).

(11)

The constraint in (11) is also not linear constraint. In fact, it

is a quadratic constraint due to multiplication of ?? and xha.

However, the feasible set of policies scales monotonically with

??, i. e., more policies are feasible with increasing ??. Note that

for a fixed ??, the above constraint is linear, and therefore it is

a convex constraint.

We show that the formulation for the shared control synthesis

problem is given by a quasiconvex optimization problem (QCP).

The QCP formulation incorporates the occupancy variables

for the repaired strategy ?ha, and an upper bound for the

perturbation ?? of the human strategy. We show the formulation

for a single reachability specification ? = P??(?T ). Then,

we discuss how to add additional safety and expected cost

specifications in the QCP formulation.

The QCP formulation builds on the set Var of variables

including the following:

• xha(s, ?) ? 0,?) for each s ? S \ T and ? ? A

defines the occupancy measure of a state-action pair for

the repaired strategy.

• ?? ? 0, 1 gives the upper bound of the perturbation of

the strategy ?h.

The shared control synthesis problem can be formulated as the

following QCP:

minimize ?? (12)

subject to

?s ? S \ T.?

??A

xha(s, ?) =

?

s??S\T

?

??A

P(s?, ?, s)xha(s?, ?) + ?s

(13)

?s ? S.? ? A.?

s?S\T

?

??A

?(s, ?)xha(s, ?) ? ? (14)

|xha(s, ?)?

?

??A

xha(s, ?)?h(s, ?)| ? ??

?

??A

xha(s, ?).

(15)

Let us walk through the optimization (12)–(15). We minimize

the infinity norm of the perturbation in (12). The constraints

in (13) ensures that the resulting strategy ?ha from the

occupancy variables xha is well-defined. The constraint in (14)

constrains the probability of reaching the set of target states

T is smaller than or equal to ? to satisfy the specification

P??(?T ). We perturb the human strategy ?h to the repaired

strategy ?ha as in Definition 6 in (15) by putting an upper

bound on the maximal perturbation.

The problem in (12)–(15) not a convex optimization problem

because of the nonconvex constraint in (15). However, for a

fixed ??, the problem in (12)–(15) is a LP and the set of feasible

policies scales monotonically in ??. Therefore, the problem

in (12)–(15) is a quasiconvex optimization problem. We can

solve the QCP in (12)–(15) by employing a bisection method

on ??. A method to solve quasiconvex optimization problems is

given in 4, Algorithm 4.1. We adopt the Algorithm 4.1 in 4

as follows:

Algorithm 1: Bisection method for quasiconvex optimiza-

tion problem.

given l = 0, u = 1, tolerance ? > 0.

repeat

1. Set ?? = (l + u)/2.

2. Solve the convex feasibility problem in (13)–(15).

3. if the problem in (13)–(15) is feasible, u := ??; else

l := ??.

until u? l ? ?.

The above algorithm requires exactly

?

log2(

1

?

)

?

iterations,

and we can compute a policy that induces a minimal deviation

up to any given accuracy of ?.

Theorem 1. The strategy obtained from the QCP in (12)–(15)

satisfies the task specifications and it is optimal with respect

to the maximal deviation.

Proof. From a satisfying assignment to the constraints in (12)–

(15), we can compute a policy that satisfies the specification

using (6). Using Algorithm (1), we can compute the repaired

policy ?ha that deviates minimally from the human strategy

?h up to ? accuracy in

?

log2(

1

?

)

?

iterations. Therefore, by

solving the QCP (12)–(15), we can compute an optimal policy

for the shared control synthesis problem.

We note that the solution given by the QCP in (12)–(15)

computes the minimally deviating repaired strategy ?ha using

strategy repair. On the one hand, reference 15 considers

repairing the strategy with a greedy approach. Their approach

requires solving possibly an unbounded number of LPs to

compute a feasible strategy that is not necessarily optimal. On

the other hand, we only need to check feasibility of a bounded

number of LPs to compute an optimal strategy. Note that

we do not compute the autonomy strategy ?a with the QCP

in (12)–(15) directly. After computing the repaired strategy

?ha, we compute the autonomy strategy ?a according to the

Definition 5.

4) Additional specifications: The QCP in (12)–(15) com-

putes an optimal policy for a single reachability specification.

Suppose that we are given another reachability specification

?? = P???(?T ?) with T ? 6= T in addition to the reachability

specification ? = P??(?T ). We can handle this specification

by the adding ?

s?S\T ?

?

??A

??(s, ?)x(s, ?) ? ? (16)

to the QCP in (12)–(15), where ??(s, ?) =

?

s??T ?

P(s, ?, s?).

The constraint in (16) ensures that the probability of reaching

T ? is less than ??.

We handle an expected cost specification E??(?G) for G ?

S, by adding ?

s?S\G

?

??A

C(s, ?)x(s, ?) ? ? (17)

to the QCP in (12)–(15).

Similar to the probability computation for reachability

specifications, the expected cost for reaching G is given by?

s?S\G

?

??A

C(s, ?)x(s, ?). (18)

Copyright 2019 - Education WordPress Theme.

x

Hi!

I'm Kim!

Would you like to get a custom essay? How about receiving a customized one?

Check it out