5.       GENERALIZED INFORMATION CRITERION FOR COMMUNICATION OPTIMIZATION

 

From the view-point of the generalized communication model, systems for weather forecasts, disease diagnoses, pattern recognition, and the like are all communication systems. To optimize these systems, we need proper assessing criterion. The assessing criterion that  first calls to mind is probably the correctness rate. However, this criterion is generally unreasonable. For example, if one always predict ¡°Tomorrow will not be rainy¡± or ¡°You do not have AIDS¡±,  the correctness rate of his predictions will be more than 70% and 99% respectively. However,  this type of prediction has no value. If the stock index goes up today, and the prediction ¡°The stock index will go up tomorrow¡± is offered, the correctness rate would be over 60% based on historical data.  However, to the investor this prediction carries little value. To predict a numeric value like a stock index, squared-error criterion is often used; yet, it has similar defect.

 

From general perception of the term ¡°information¡±, using information measure as the assessing criterion is reasonable. However,  Shannon's information measure is also not suitable as the assessing criterion. It is for this reason that  classical information theory  does not adopt information criterion for assessing detection.  From Figure 4 and the following examples the generalized information measure is shown to be an effective assessing  criterion most cases.

5.1    Pattern Recognition and Weather Forecasts

  Classical pattern recognition theory has not provided a suitable assessing criterion for two cases: one is where patterns are divided by  fuzzy boundaries; another is where patterns are not necessarily disjoint from each other. However, the two cases are frequently  encountered by researchers. Weather forecast has similar problems. For example, there are predictions ¡°Tomorrow will be rainy¡± and ¡°Tomorrow will be very  rainy¡±. The latter is a fuzzy judgment  and not disjoint with the former. The generalized information measure can be a more objective and more universal assessing criterion for pattern recognition and weather forecast.

  Let A={x1, x2, ..., xm}, B={y1, y2, ..., yn} , and C={z|z is a observed datum}, where z may be continuous, represent respectively the set of objects, the set of pattern judgments, and the set of features vectors or observed data. When  Z=z¡¯ÎC is given and the subjective probability-forecast Q(X|z¡¯) is assumed to be equal to the fact P(X|z¡¯),  how is the best judgment to be selected? If the selected judgment is too fuzzy, there will be very little information; if it is too clear then the information might  become negative.  Using the following equation:

    

, (36)

  we can  calculate the expected average amount of information provided by judgment yj .  We select different sentences to see which sentence provides the most average information. The sentence with the most information is most acceptable.

  It can be proven that I(X;yj ) will reach the maximum when equation  (37) below is tenable.

  Q(xi|Aj )=Q(xi|z), i=1, 2, ..., n            (37)

   From this equation, we can infer that a sentence can provide the most information if its logical probability function or belief-degree function satisfies Equation (38) below.

 

Q(Aj |xi)=C¡¯Q(z¡¯|xi), i=1,2,...,n  (C¡¯ is a constant),   (38)

 Equation (38) means that the two function curves Q(Aj |X) and Q(z¡¯|X) are  similar to each other in shapes. On this conclusion, when forecasted rainfall is rather certain, an extension-smaller sentence will be the best selection; yet, when forecasted rainfall is very uncertain, an extension-fuzzier sentence, like ¡°There will be light or moderate rain¡±, will be the best selection. 

 We may regard the curve Q(z¡¯|X) as a value taken by the source and curve Q(Aj |X) as a value taken by the destination. In communication that the classical theory deals with, the values taken by  source and destination are points; however, in the generalized communication we consider now, outputs from source and inputs to destination are lines . The more identical the areas the two curves cover  and the smaller the areas, the more the information. The above method of selecting sentences is seemingly suitable for quantum detection in  photon communication (Helstron, 1976).

5.2       Prediction for Coding and Prediction of Stock Index

 In communication, such as in  speech communication, we need to code a sequence of numbers in code length as short as possible so that a receiver can decode it  without (or with less) distortion. An effective coding is predictive coding. With this method, we predict X=xt with   according to xt-1, xt-2,..., xt-k    ( is the predictive value of xt), then transmit Dt=xt- instead of . At the receiving end, the receiver can calculate  in the same way and have xt=Dt+ .  Since Dt has smaller uncertain extent than xt, its Shannon's entropy must be smaller than the Shannon entropy of xt, hence only shorter average code length is needed.   

Up to now, the assessing criterion of the prediction or predictive rules is subjectively selected. A popular criterion is the squared-error criterion. For example, in linear prediction:

=a1xt-1+a2 xt-2+...+ak xt-k ,   (39)

 a group of coefficients a1, a2, ..., ak is most acceptable if it makes Equation (40) below to reach a minimum:

.      (40)

   In Equation (40), T is the length of the sequence of  numbers and t varies from 1 to T.  Use Lagrangian multiplier method and let  the partial derivative of f with respect to each ar for r=1, 2, ..., k be zero; we can obtain a group of equations:

   r=1, 2, ..., k,   (41)

 from which we can produce solution for a1, a2, ..., ak.

 With  the above criterion, whether  to correctly predict an occasional event or a  recurring event, the assessment will be the same. This criterion is very similar to the criterion of personal selection: ¡°no error means virtue¡±. It encourages conservative  predictions rather than information-rich predictions. In relation to predictions of weathers and stock indices such a criteria lacks usefulness. The criterion of personal selection  we need should be that of  ¡°more attainments and fewer errors mean virtue¡±. The generalized information criterion  exhibits such quality (Figure 4).

In predictive coding, we need to measure information conveyed by  about xt. Let X stand for xt , Z for a vector Xk=(xt-1, xt-2, ..., xt-k), Y  for ;  then the predictive information is the generalized mutual information between Y and X.

From Section 4.2, we can see that it is consistent with the aim of compressing average code length to use the generalized information as the criterion of assessing the prediction. This criterion needs fuzzy prediction. For example, we use  yj  to denote a sentence ¡°X approximately equals xj¡± or  ¡°X belongs to fuzzy set Aj¡±. We can first derive Q(X|Aj) from Q(Aj|X) and Q(X) and then  code X according to Q(X|Aj ) so that the average code length is near the generalized condition entropy H(X|Y). Considering  sequential information as t  varies from 1 to T we have  the generalized mutual information

  .    (42)

Assume Q(Aj|X) is a hill-like function:  

,     (43)

where yt is  defined by Equation (39); dt is predicted by equation

dt=b1xt-1+b2xt-2+...+bk xt-k.  (44)

Let the partial derivative of I(X;Y) with respect to ar, br for r=1, 2, ..., k be zero; we have 2k equations

   (45)   (46)

From this group of equations we can deduce the solution for the 2k coefficients. If dt takes one of m' values, then we only need m'  coding dictionaries for coding X. Yet, it is generally impossible to code X directly according to P(X|Xk), for which mk coding dictionaries are needed. It  can be proven that if dt is constant and Q(X|At) is normal function, then Equation (42)  reduces to Equation (40), which means the squared-error distortion criterion is a special case of the generalized information criterion.

For the prediction of a stock index, we may use similar method to select the information-richest prediction. For the predictions of other sequential signals, such as rainfall, quality of products, etc. the generalized information criterion would yield more satisfying results than the average squared-error criterion.

 

5.3       Signal Detection and Disease Diagnosis 

In the classical communication theory, detection is to make a judgment  yj = =X=xj  according to Z=z¡¯. Generally,  X and Y are discrete while Z is continuous (Figure 6 shows binary detection, a special case of detection). How to determine the line of demarcation of judgments depends on what detection criterion is used. In the classical communication theory, men-defined profit and loss (simply called loss) is used as the criterion. That is, for each pair of xi and yj , we define a loss function c(xi, yj ). The  detection that makes the average loss, Equation (47), to reach a minimum is the most acceptable:           

.       (47)

The function c(xi,yj ) only can be  determined from  experience or subjective selection. In most cases, the squared-error (xi-yj)2 is used as loss function.

The generalized information I(xi; yj ) can be used as an assessing function for the detection instead of the loss function in most cases where information value has not been taken into consideration. That is to say, a judgment  yj  that makes Equation (48) below to reach a maximum is the most acceptable.

  .    (48)  

In Equation (48), Q(Aj|xi) is the probability of confusing xi with xj (refer to Figure 1).

Disease diagnosis is a similar case. For example, a medical assayer judges whether a patient¡¯s tested datum showing some disease,  say AIDS, is positive. In this case, XÎA={x0=no-AIDS, x1=AIDS}; YÎB={y0=¡°The assay is negative¡±, y1=¡°The assay is positive¡±}; Z is the tested datum (Figure 6). This binary detection is very similar to the detection of 0-1 codes in electric communication. 


Figure 6  Binary detection for digital communication and disease diagnosis

 
Now consider the Bayesian detection in the generalized information theory with a binary source as an example. In classical communication theory, Bayesian detection  derived from  complicated deduction is in this way (Rosie, 1978): if 

 ,  cij=c(xi, yj), i, j=0,1,  (49)

then we judge ¡° X=x0¡± or let Y=y0; otherwise judge ¡°X=x1¡± or let  Y=y1. Now we use the generalized information measure as criterion for the detection. From Equation (48), the judgment should be that if  I(X; y0)>I(X; y1), i. e.

 , Iij =I(xi;yj ),  i, j=0, 1   (50)

then we let Y=y0; otherwise, let Y=y1.

How do we determine Iij ? One way is first to determine the confusion probability Q(Aj|xi) for i, j=0, 1. Another way is that, for example, from the statistics of  physicians, to whom the assaying results will be sent.  We get past objective probability P*(X) and objective condition probability P*(X|yj ).  L et Q(X)=P*(X) and Q(X|Aj )=P*(X|yj ), from which we can calculate Iij.

Generally, people think that the absolute value of  the loss from a wrong judgment is greater than the absolute value of the  profit from a correct judgment. The generalized information criterion happens to have this asymmetrical feature (see Section 6.2). Hence, in most cases, the generalized information criterion is enough for the assessments and the consideration of  profit and loss or information value is unnecessary.

If A and B are also continuous, then the detection will become an estimation. We can regard estimation as a  special case of detection as m approaches infinity.