IV. SYNTHESIS OF THE AUTONOMY PROTOCOL
A. Strategy blending
Given the human strategy ?h ? StrMr and the autonomy
strategy ?a ? StrMr , a blending function computes a weighted
composition of the two strategies by favoring one or the other
strategy in each state of the MDP 16, 6, 7.
7 argues that the weight of blending shows the confidence
in how well the autonomy protocol can assist to perform the
human’s task. Put differently, the blending function should
assign a low confidence to the actions of the human, if the
actions may lead to a violation of the specifications. Recall
Fig. 1 and the example in the introduction. In the cells of the
gridworld where some actions may result in a collusion with
the vacuum cleaner with high probability, it makes sense to
assign a high weight in the autonomy strategy.
We pick the blending function as a state-dependent function
that weighs the confidence in both the human strategy and the
autonomy strategy at each state of an MDP Mr 16, 6, 7.
Definition 5 (Linear blending). Given an MDP Mr =
(S, sI ,A,P), two strategies ?h, ?a ? StrMr , and a blending
function b : S ? 0, 1, the blended strategy ?ha ? StrMr for
all states s ? S, and actions ? ? A is
?ha(s, ?) = b(s) · ?h(s, ?) + (1? b(s)) · ?a(s, ?) .
For each s ? S, the value of b(s) represents the “weight” of
?h at s, meaning how much emphasis the blending function
puts on the human strategy at state s. For example, referring
back to Fig. 1, the critical cells of the gridworld correspond
to certain states of the MDP Mr. At these states, we assign
a very low confidence in the human strategy. For instance at
such a state s ? S, we might have b(s) = 0.1, meaning the
blended strategy in state s puts more emphasis on the autonomy
B. Strategy repair
In this section, we describe our approach to the shared control
synthesis problem. We formulate the specific problem of repair-
ing the human strategy based on quasiconvex programming,
which can be solved by checking feasibility of a number of
convex (here even linear) programming problems. We show
that the repaired strategy satisfies the task specifications while
deviating minimally from the human strategy. Finally, we
compute the autonomy strategy ?a that may then be subject
We start by formalizing the concept of strategy perturbation.
Then, we introduce the dual linear programming (LP) formula-
tion to synthesize a strategy in a MDP. Finally, we present an
LP formulation of the strategy repair problem and show that
the solution indeed satisfies the specifications while deviating
from the human’s strategy minimally.
1) Perturbation of strategies: As mentioned in the introduc-
tion, the blended strategy should deviate minimally from the
human strategy. To measure the quantity of such a deviation,
we introduce the concept of perturbation, which was used
in 5. To modify a (randomized) strategy, we employ additive
perturbation by increasing or decreasing the probabilities of
action choices in each state. We also ensure that for each state,
the resulting strategy is a well-defined distribution over the
Definition 6 (Strategy perturbation). Given an MDP Mr and
a strategy ? ? StrMr , an (additive) perturbation ? is a function
? : S ×A? ?1, 1 with?
?(s, ?) = 0 ?s ? S.
The perturbation value at state s for action ? is ?(s, ?).
Overloading the notation, the perturbed strategy ?(?) is given
?(?)(s, ?) = ?(s, ?) + ?(s, ?) ?s ? S.? ? A. (2)
2) Dual linear programming formulation for MDPs: In this
section, we refer to the LP formulation to compute a policy
that maximizes the probability of satisfying a reachability
specification P??(?T ) in a MDP 24, 10.
The LP formulation builds on the set Var of variables
including the following:
• x(s, ?) ? 0,?) for each s ? S \ T and ? ? A defines
the occupancy measure of a state-action pair.
?(s, ?)x(s, ?) (3)
?s ? S \ T.?
x(s, ?) =
P(s?, ?, s)x(s?, ?) + ?s (4)?
?(s, ?)x(s, ?) ? ? (5)
where ?(s, ?) =
P(s, ?, s?), and ?s = 1 if s = sI and
?s = 0 if s 6= sI .
For any optimal solution x to the LP in (3)–(5), then
?(s, ?) =
is an optimal strategy, and x is the occupancy measure of ?,
see 24 and 10 for details.
After finding an optimal strategy to the LP in (3)–(5), we
can compute the probability of satisfying a specification by
?(s, ?)x(s, ?). (7)
3) Strategy repair using quasiconvex programming: Given
the human strategy, ?h ? StrMr , the aim of the autonomy
protocol is to compute the blended strategy, or the repaired
strategy ?ha that induces a similar behavior to the human
strategy while satisfying the specifications. We compute the
repaired strategy by perturbing the human strategy, which is
introduced in Definition 6. As in Definition 6, we perturb the
human strategy ?h to the repaired strategy ?ha by
?s ? S.? ? A. ?ha(s, ?) = ?h(s, ?) + ?(s, ?). (8)
We cannot add this constraint to the LP in (3)–(5) because
the variables in the LP are for the occupancy measures for
the strategy. To perturb the strategy, we use the definition of
occupancy measure to translate the constraint in (8) into the
?s ? S.? ? A. xha(s, ?)?
= ?h(s, ?) + ?(s, ?) (9)
or equivalently to the constraint
?s ? S.? ? A.
xha(s, ?) =
xha(s, ?) (?h(s, ?) + ?(s, ?)) . (10)
Note that the constraint in (10) is not a linear constraint
because of multiplication of the perturbation variables ? and
occupancy measure variables x. Since we are interested in
minimizing the maximal deviation, we can assign a common
variable ?? ? 0, 1 for all state-action pairs to put an upper
bound on the deviation by
?s ? S.? ? A.
xha(s, ?)?h(s, ?)| ? ??
The constraint in (11) is also not linear constraint. In fact, it
is a quadratic constraint due to multiplication of ?? and xha.
However, the feasible set of policies scales monotonically with
??, i. e., more policies are feasible with increasing ??. Note that
for a fixed ??, the above constraint is linear, and therefore it is
a convex constraint.
We show that the formulation for the shared control synthesis
problem is given by a quasiconvex optimization problem (QCP).
The QCP formulation incorporates the occupancy variables
for the repaired strategy ?ha, and an upper bound for the
perturbation ?? of the human strategy. We show the formulation
for a single reachability specification ? = P??(?T ). Then,
we discuss how to add additional safety and expected cost
specifications in the QCP formulation.
The QCP formulation builds on the set Var of variables
including the following:
• xha(s, ?) ? 0,?) for each s ? S \ T and ? ? A
defines the occupancy measure of a state-action pair for
the repaired strategy.
• ?? ? 0, 1 gives the upper bound of the perturbation of
the strategy ?h.
The shared control synthesis problem can be formulated as the
minimize ?? (12)
?s ? S \ T.?
xha(s, ?) =
P(s?, ?, s)xha(s?, ?) + ?s
?s ? S.? ? A.?
?(s, ?)xha(s, ?) ? ? (14)
xha(s, ?)?h(s, ?)| ? ??
Let us walk through the optimization (12)–(15). We minimize
the infinity norm of the perturbation in (12). The constraints
in (13) ensures that the resulting strategy ?ha from the
occupancy variables xha is well-defined. The constraint in (14)
constrains the probability of reaching the set of target states
T is smaller than or equal to ? to satisfy the specification
P??(?T ). We perturb the human strategy ?h to the repaired
strategy ?ha as in Definition 6 in (15) by putting an upper
bound on the maximal perturbation.
The problem in (12)–(15) not a convex optimization problem
because of the nonconvex constraint in (15). However, for a
fixed ??, the problem in (12)–(15) is a LP and the set of feasible
policies scales monotonically in ??. Therefore, the problem
in (12)–(15) is a quasiconvex optimization problem. We can
solve the QCP in (12)–(15) by employing a bisection method
on ??. A method to solve quasiconvex optimization problems is
given in 4, Algorithm 4.1. We adopt the Algorithm 4.1 in 4
Algorithm 1: Bisection method for quasiconvex optimiza-
given l = 0, u = 1, tolerance ? > 0.
1. Set ?? = (l + u)/2.
2. Solve the convex feasibility problem in (13)–(15).
3. if the problem in (13)–(15) is feasible, u := ??; else
l := ??.
until u? l ? ?.
The above algorithm requires exactly
and we can compute a policy that induces a minimal deviation
up to any given accuracy of ?.
Theorem 1. The strategy obtained from the QCP in (12)–(15)
satisfies the task specifications and it is optimal with respect
to the maximal deviation.
Proof. From a satisfying assignment to the constraints in (12)–
(15), we can compute a policy that satisfies the specification
using (6). Using Algorithm (1), we can compute the repaired
policy ?ha that deviates minimally from the human strategy
?h up to ? accuracy in
iterations. Therefore, by
solving the QCP (12)–(15), we can compute an optimal policy
for the shared control synthesis problem.
We note that the solution given by the QCP in (12)–(15)
computes the minimally deviating repaired strategy ?ha using
strategy repair. On the one hand, reference 15 considers
repairing the strategy with a greedy approach. Their approach
requires solving possibly an unbounded number of LPs to
compute a feasible strategy that is not necessarily optimal. On
the other hand, we only need to check feasibility of a bounded
number of LPs to compute an optimal strategy. Note that
we do not compute the autonomy strategy ?a with the QCP
in (12)–(15) directly. After computing the repaired strategy
?ha, we compute the autonomy strategy ?a according to the
4) Additional specifications: The QCP in (12)–(15) com-
putes an optimal policy for a single reachability specification.
Suppose that we are given another reachability specification
?? = P???(?T ?) with T ? 6= T in addition to the reachability
specification ? = P??(?T ). We can handle this specification
by the adding ?
??(s, ?)x(s, ?) ? ? (16)
to the QCP in (12)–(15), where ??(s, ?) =
P(s, ?, s?).
The constraint in (16) ensures that the probability of reaching
T ? is less than ??.
We handle an expected cost specification E??(?G) for G ?
S, by adding ?
C(s, ?)x(s, ?) ? ? (17)
to the QCP in (12)–(15).
Similar to the probability computation for reachability
specifications, the expected cost for reaching G is given by?
C(s, ?)x(s, ?). (18)
Copyright 2019 - Education WordPress Theme.