< previous page page_56 next page >

Page 56
that each C0021-02.gifbe assigned a unique cost µE(A). To satisfy this requirement we can let C0072-05.gif where C0072-06.gif is the set of natural numbers {1, 2, 3, . . .}. Then unique elements of C0021-03.gif, namely (1A, t1), (1A, t2), . . ., (1A, tk), correspond to the successive trials of 1A and the cost Q(ti) of trial tican be assigned as required,
C0072-01.gif
An adaptive plan t will modify the policy at intervals on the basis of observed costs. With the definition of C0021-03.gif just given this means that, if 1A is tried at time t and is to be retained for trial at time t + 1,
C0072-02.gif
on the other hand, if a new policy 1A' is to be tried,
C0072-03.gif
A sophisticated adaptive plan will probably retain a measure of the average performance of various policies tried so that C0021-03.gif would be further extended by a component C0039-02.gif (see section 2.2) to C0072-07.gif. A still more sophisticated plan will progressively reduce uncertainty about the environment by deliberately selecting elements of C to elicit critical information, perhaps constructing a model of fE. Then by exploiting predictions of the model t can adjust the sequence C0072-08.gif to better performance as measured by the function J. At this level the illustration concerning searches, pattern recognition, and statistical inference applies in toto. If the plan is to be a payoff-only plan, then
C0072-04.gif
and C0039-02.gif(t + 1) is updated by using Q(t)in a recalculation of the average performance of C0039-11.gif(t).
Finally the function J determines a ranking for every control sequence C0072-08.gif, whether or not it is generated by a single policy. That is, an adaptive plan t confronted with a law of motion fE may try several policies, thereby generating a control sequence which no single 1A Î C0039-11.gif could generate. However every control action C(t) has a definite cost Q(t). Thus the trajectory C0072-08.gif through C generated by t can be ranked according to J. In this way J determines a criterion for ranking any C0072-09.gifin any C0021-01.gif. As a specific example, consider the case where the object is minimization of cumulative error. By assigning maximum payoff to the target region and reducing the payoff of other states in proportion to the associated error, the performance of a plan t can be measured in terms of the cumulative payoff function UE(t, t). The greater UE(t, t) the less the cumulative error to time t.

 
< previous page page_56 next page >

If you like this book, buy it!