|
|
|
|
|
|
|
that each be assigned a unique cost µE(A). To satisfy this requirement we can let where is the set of natural numbers {1, 2, 3, . . .}. Then unique elements of , namely (1A, t1), (1A, t2), . . ., (1A, tk), correspond to the successive trials of 1A and the cost Q(ti) of trial tican be assigned as required, |
|
|
|
|
|
|
|
|
An adaptive plan t will modify the policy at intervals on the basis of observed costs. With the definition of just given this means that, if 1A is tried at time t and is to be retained for trial at time t + 1, |
|
|
|
|
|
|
|
|
on the other hand, if a new policy 1A' is to be tried, |
|
|
|
|
|
|
|
|
A sophisticated adaptive plan will probably retain a measure of the average performance of various policies tried so that would be further extended by a component (see section 2.2) to . A still more sophisticated plan will progressively reduce uncertainty about the environment by deliberately selecting elements of C to elicit critical information, perhaps constructing a model of fE. Then by exploiting predictions of the model t can adjust the sequence to better performance as measured by the function J. At this level the illustration concerning searches, pattern recognition, and statistical inference applies in toto. If the plan is to be a payoff-only plan, then |
|
|
|
|
|
|
|
|
and (t + 1) is updated by using Q(t)in a recalculation of the average performance of (t). |
|
|
|
|
|
|
|
|
Finally the function J determines a ranking for every control sequence , whether or not it is generated by a single policy. That is, an adaptive plan t confronted with a law of motion fE may try several policies, thereby generating a control sequence which no single 1A Î could generate. However every control action C(t) has a definite cost Q(t). Thus the trajectory through C generated by t can be ranked according to J. In this way J determines a criterion for ranking any in any . As a specific example, consider the case where the object is minimization of cumulative error. By assigning maximum payoff to the target region and reducing the payoff of other states in proportion to the associated error, the performance of a plan t can be measured in terms of the cumulative payoff function UE(t, t). The greater UE(t, t) the less the cumulative error to time t. |
|
|
|
|
|