5. GENERALIZED INFORMATION CRITERION FOR COMMUNICATION OPTIMIZATION

From the view-point of the generalized communication model, systems for weather forecasts, disease diagnoses, pattern recognition, and the like are all communication systems. To optimize these systems, we need proper assessing criterion. The assessing criterion that first calls to mind is probably the correctness rate. However, this criterion is generally unreasonable. For example, if one always predict “Tomorrow will not be rainy” or “You do not have AIDS”, the correctness rate of his predictions will be more than 70% and 99% respectively. However, this type of prediction has no value. If the stock index goes up today, and the prediction “The stock index will go up tomorrow” is offered, the correctness rate would be over 60% based on historical data. However, to the investor this prediction carries little value. To predict a numeric value like a stock index, squared-error criterion is often used; yet, it has similar defect.

From general perception of the term “information”, using information measure as the assessing criterion is reasonable. However, Shannon's information measure is also not suitable as the assessing criterion. It is for this reason that classical information theory does not adopt information criterion for assessing detection. From Figure 4 and the following examples the generalized information measure is shown to be an effective assessing criterion most cases.

5.1 Pattern Recognition and Weather Forecasts

Classical pattern recognition theory has not provided a suitable assessing criterion for two cases: one is where patterns are divided by fuzzy boundaries; another is where patterns are not necessarily disjoint from each other. However, the two cases are frequently encountered by researchers. Weather forecast has similar problems. For example, there are predictions “Tomorrow will be rainy” and “Tomorrow will be very rainy”. The latter is a fuzzy judgment and not disjoint with the former. The generalized information measure can be a more objective and more universal assessing criterion for pattern recognition and weather forecast.

Let A={x₁, x₂, ..., x_m}, B={y₁, y₂, ..., y_n} , and C={z|z is a observed datum}, where z may be continuous, represent respectively the set of objects, the set of pattern judgments, and the set of features vectors or observed data. When Z=z’ÎC is given and the subjective probability-forecast Q(X|z’) is assumed to be equal to the fact P(X|z’), how is the best judgment to be selected? If the selected judgment is too fuzzy, there will be very little information; if it is too clear then the information might become negative. Using the following equation:

, (36)

we can calculate the expected average amount of information provided by judgment y_j. We select different sentences to see which sentence provides the most average information. The sentence with the most information is most acceptable.

It can be proven that I(X;y_j) will reach the maximum when equation (37) below is tenable.

Q(x_i|A_j)=Q(x_i|z), i=1, 2, ..., n (37)

From this equation, we can infer that a sentence can provide the most information if its logical probability function or belief-degree function satisfies Equation (38) below.

Q(A_j|x_i)=C’Q(z’|x_i), i=1,2,...,n (C’ is a constant), (38)

Equation (38) means that the two function curves Q(A_j|X) and Q(z’|X) are similar to each other in shapes. On this conclusion, when forecasted rainfall is rather certain, an extension-smaller sentence will be the best selection; yet, when forecasted rainfall is very uncertain, an extension-fuzzier sentence, like “There will be light or moderate rain”, will be the best selection.

We may regard the curve Q(z’|X) as a value taken by the source and curve Q(A_j|X) as a value taken by the destination. In communication that the classical theory deals with, the values taken by source and destination are points; however, in the generalized communication we consider now, outputs from source and inputs to destination are lines . The more identical the areas the two curves cover and the smaller the areas, the more the information. The above method of selecting sentences is seemingly suitable for quantum detection in photon communication (Helstron, 1976).

5.2 Prediction for Coding and Prediction of Stock Index

In communication, such as in speech communication, we need to code a sequence of numbers in code length as short as possible so that a receiver can decode it without (or with less) distortion. An effective coding is predictive coding. With this method, we predict X=x_t with according to x_t_-₁, x_t_-₂,..., x_t_-k ( is the predictive value of x_t), then transmit Dt=x_t- instead of . At the receiving end, the receiver can calculate in the same way and have x_t=D_t+ . Since D_t has smaller uncertain extent than x_t_, its Shannon's entropy must be smaller than the Shannon entropy of x_t, hence only shorter average code length is needed.

Up to now, the assessing criterion of the prediction or predictive rules is subjectively selected. A popular criterion is the squared-error criterion. For example, in linear prediction:

=a₁x_t_-1+a₂x_t_-2+...+a_kx_t_-k , (39)

a group of coefficients a₁, a₂, ..., a_k is most acceptable if it makes Equation (40) below to reach a minimum:

. (40)

In Equation (40), T is the length of the sequence of numbers and t varies from 1 to T. Use Lagrangian multiplier method and let the partial derivative of f with respect to each a_r for r=1, 2, ..., k be zero; we can obtain a group of equations:

r=1, 2, ..., k, (41)

from which we can produce solution for a₁, a₂, ..., a_k.

With the above criterion, whether to correctly predict an occasional event or a recurring event, the assessment will be the same. This criterion is very similar to the criterion of personal selection: “no error means virtue”. It encourages conservative predictions rather than information-rich predictions. In relation to predictions of weathers and stock indices such a criteria lacks usefulness. The criterion of personal selection we need should be that of “more attainments and fewer errors mean virtue”. The generalized information criterion exhibits such quality (Figure 4).

In predictive coding, we need to measure information conveyed by about x_t. Let X stand for x_t, Z for a vector X^k=(x_t_-₁, x_t_-₂, ..., x_t_-k), Y for ; then the predictive information is the generalized mutual information between Y and X.

From Section 4.2, we can see that it is consistent with the aim of compressing average code length to use the generalized information as the criterion of assessing the prediction. This criterion needs fuzzy prediction. For example, we use y_j to denote a sentence “X approximately equals x_j” or “X belongs to fuzzy set A_j”. We can first derive Q(X|A_j) from Q(A_j|X) and Q(X) and then code X according to Q(X|A_j) so that the average code length is near the generalized condition entropy H(X|Y). Considering sequential information as t varies from 1 to T we have the generalized mutual information

. (42)

Assume Q(A_j|X) is a hill-like function:

, (43)

where y_t is defined by Equation (39); d_t is predicted by equation

d_t=b₁x_t_-₁+b₂x_t_-₂+...+b_kx_t_-k. (44)

Let the partial derivative of I(X;Y) with respect to a_r, b_r for r=1, 2, ..., k be zero; we have 2k equations

(45) (46)

From this group of equations we can deduce the solution for the 2k coefficients. If d_t takes one of m' values, then we only need m' coding dictionaries for coding X. Yet, it is generally impossible to code X directly according to P(X|X^k), for which m^k coding dictionaries are needed. It can be proven that if d_t is constant and Q(X|A_t) is normal function, then Equation (42) reduces to Equation (40), which means the squared-error distortion criterion is a special case of the generalized information criterion.

For the prediction of a stock index, we may use similar method to select the information-richest prediction. For the predictions of other sequential signals, such as rainfall, quality of products, etc. the generalized information criterion would yield more satisfying results than the average squared-error criterion.

5.3 Signal Detection and Disease Diagnosis

In the classical communication theory, detection is to make a judgment y_j= =X=x_j according to Z=z’. Generally, X and Y are discrete while Z is continuous (Figure 6 shows binary detection, a special case of detection). How to determine the line of demarcation of judgments depends on what detection criterion is used. In the classical communication theory, men-defined profit and loss (simply called loss) is used as the criterion. That is, for each pair of x_iand y_j, we define a loss function c(x_i, y_j). The detection that makes the average loss, Equation (47), to reach a minimum is the most acceptable:

. (47)

The function c(x_i,y_j) only can be determined from experience or subjective selection. In most cases, the squared-error (x_i-y_j)² is used as loss function.

The generalized information I(x_i; y_j) can be used as an assessing function for the detection instead of the loss function in most cases where information value has not been taken into consideration. That is to say, a judgment y_j that makes Equation (48) below to reach a maximum is the most acceptable.

. (48)

In Equation (48), Q(A_j|x_i) is the probability of confusing x_iwith x_j(refer to Figure 1).

Disease diagnosis is a similar case. For example, a medical assayer judges whether a patient’s tested datum showing some disease, say AIDS, is positive. In this case, XÎA={x₀=no-AIDS, x₁=AIDS}; YÎB={y₀=“The assay is negative”, y₁=“The assay is positive”}; Z is the tested datum (Figure 6). This binary detection is very similar to the detection of 0-1 codes in electric communication.

Figure 6 Binary detection for digital communication and disease diagnosis

Now consider the Bayesian detection in the generalized information theory with a binary source as an example. In classical communication theory, Bayesian detection derived from complicated deduction is in this way (Rosie, 1978): if

, c_ij=c(x_i, y_j), i, j=0,1, (49)

then we judge “ X=x₀” or let Y=y₀; otherwise judge “X=x₁” or let Y=y₁. Now we use the generalized information measure as criterion for the detection. From Equation (48), the judgment should be that if I(X; y₀)>I(X; y₁), i. e.

, I_ij=I(x_i;y_j), i, j=0, 1 (50)

then we let Y=y₀; otherwise, let Y=y₁.

How do we determine I_ij? One way is first to determine the confusion probability Q(A_j|x_i) for i, j=0, 1. Another way is that, for example, from the statistics of physicians, to whom the assaying results will be sent. We get past objective probability P*(X) and objective condition probability P*(X|y_j). L et Q(X)=P*(X) and Q(X|A_j)=P*(X|y_j), from which we can calculate I_ij.

Generally, people think that the absolute value of the loss from a wrong judgment is greater than the absolute value of the profit from a correct judgment. The generalized information criterion happens to have this asymmetrical feature (see Section 6.2). Hence, in most cases, the generalized information criterion is enough for the assessments and the consideration of profit and loss or information value is unnecessary.

If A and B are also continuous, then the detection will become an estimation. We can regard estimation as a special case of detection as m approaches infinity.