Copyright ACM, 2000
Modeling of Time and Document Aging for
Request Prediction - One Step Further

Ernst-Georg Haffner, Uwe Roth, Thomas Engel, Christoph Meinel

Institute of Telematics
Bahnhofsstr. 30-32
54292 Trier, Germany
+49 (0) 651-97551-0

Email: {haffner|roth|engel|meinel}@ti.fhg.de

ABSTRACT

An important problem of prediction algorithms is the treatment of document aging and the change in user request behavior. This paper proposes a possibility to model time in a standard prediction scenario and analyses the results in a test environment.
The knowledge of characteristics in the generated test sessions is also applied to real user logs to determine explicit values of implicit access log information. With this technique, client server communication on the Internet should be improved.

Keywords
Prediction, Internet performance, document aging.
 

1. INTRODUCTION

As reaction to the steady growth of the WWW, several methods to decrease user-perceived latency have been introduced. Prediction on server-side or client-side - carefully applied - is one possibility to improve Internet-application performance.
The idea of predicting future events originates from compiler construction and "Very large database"-design (VLDB) and is now being applied to Internet applications [12]. Many attempts were made to find the best and most efficient algorithms for fulfilling prediction needs. Discrete Markov-Chain-models were the basis for the first prediction algorithms for the Web ([1], [2]). The principle here is to remember the frequency of user requests and to apply the adequate statistical model. Later, these ideas were extended to a continuous chain approach [11] and to path-profiling [14] which focuses on the order of document demands and the resulting request path. By using prediction, latency and system load may be reduced but several risks may result from inaccurate data prediction. The negative effects of incorrect prediction are discussed in [3] and [4]. Commonly accepted methods for guessing future data requests consider the difficulty of document aging   implicitly and indirectly, since the aging of files will affect the request frequencies of the approaches mentioned above. Padmanabhan and Mogul mention the need for (explicit) time modeling but show no solution for the problem [13]. Some general considerations about document aging can also be found in [8].

The prediction scenario presented in this paper is based on our former approach described in [9]. The main prediction ideas and algorithms are similar to the scheme of [10]. We will strive here to refine the test scenario in order to present an adequate and explicit modeling of time and of the aging of documents. For a general discussion of performance modeling see also [7].
After describing the new model and the extended prediction algorithm (section 2) we will present and analyze the test results (section 3). Section 4 verifies the results while using the prediction algorithm and checks its results  a posteriori on real user logs. Summary and outlook (section 5) conclude the paper.
 

2. REFINEMENT OF THE PREDICTION MODEL

2.1 Approach in general

A With regard to our former prediction scenario, we will focus on document requests without considering either the order of requests or multiple requests per session [9]. Thus, a user-session can be regarded as a binary vector. The whole test scenario therefore consists of two parts, a  generation module that is able to generate session vectors randomly and a   prediction module that needs uncompleted session vectors (at the beginning of a session) and that is able to predict the missing requests of the currently running user session.

The most interesting part of the generation module is the setting of the parameters for controlling the structure of the session vectors. It must be possible to model both a very stringent user behavior and a very random one. Also, the average number of requests per session must be variable to receive general and useful results. But the most important idea for the refinement of the prediction scenario that we present here is an explicit modeling of time and of document aging.

It is not only the generation module that has to be adapted to implement time modeling. The prediction module has to be refined as well. Therefore, the prediction module works in two different ways:

The elements of the Memory Matrix represent frequencies of former requests and weights for future requests. To model time and document aging only the learning phase must be refined. In section 2.2, we present a method to do this by "forgetting" former requests step by step. The longer the time span since the last request, the higher the degree of "forgetting".

2.2 Derivation of the time formulas

2.2.1 Extensions for the generation of session vectors

To extend the scenario of [9] as described in 2.1, we will first repeat the main ideas up to the following aspect:

Every user session S is described as a binary A-vector of requests with: S = (r1,..,rM, rM+1, rA) where the first M elements of S are predictable or semi-random, the other elements are random requests and are not predictable.

To control the generation of session vectors we use two important parameters.

Accordingly, the probability ffor elements of S to be set (=requested) can be denoted as (1):


 

This formula contains no time modeling factor. To implement a  time factor TF in a prediction algorithm that is supposed to make any sense for an automatically generated number of user requests in a testing scenario we need an additional factor  RCP, called the "request change probability" for the generation of data. The  RCPÎ[0..0,5] simulates the different possibilities for user behavior. To express that there is no constant pattern in the request demands,  RCP should be set to 0,5. This means that the conditional probability to request a document d in a session is independent from any requests that have taken place before. If  RCP is set to 0 then a change in the request behavior is out of the question and if a document  d is requested in one session, it must be requested in all subsequent sessions too.

A straightforward integration of the new factor RCP into the formula (1) is shown in the following. As abbreviation we set (2)

The new probabilities now depend on the values of the previous session (3):

The random part of the probability setting  (M+1 £ £  A) has not changed. To derive the formula for the first  M probability values f of S we consider the case where the former value of ri = 1. The new probability must be greater if  RCP=0, but it has to be the same as before if  RCP=0.5:

f (rj | rj-1=1) = (probability to be "1") + (probability to be "0")× (RCP dependent factor) = d +(1-d )× (1-2×RCP) = 1 + 2×RCP×(d -1)   (4)

Thus we get:

f (rj | rj-1=1) Î [d ..1]   (5)

Analogous we find:

f (rj | rj-1=0) = 2×RCP×d (6)

f (rj | rj-1=1) Î [0.. d ]   (7)

The overall sum of the probabilities must be the same as before, thus it is easy to see that the following equation (8) is true:

d×f (rj | rj-1=1) + (1-d )×f (rj | rj-1=0) = d   (8)

With the new probability values we can generate time dependent binary vectors for the prediction test scenario.

2.2.2. Adaptation of the prediction algorithm

To describe the extensions for the prediction algorithms of [9] we compare the components of the program with the new extensions.
The core of the prediction algorithm is a symmetric  Memory Matrix  j . Every row and column represents an available document on the (web-)server.
The classification of an (incomplete) session s vector can be obtained by multiplying the matrix  j with s resulting in a vector that represents the probabilities of the according documents to be requested. Only values exceeding a system-dependent threshold will lead to a prediction of the corresponding document request (details to determine this threshold can be found in [9]).

A new session vector S = (si) is "learnt" by increasing the corresponding values ("weights") in  =  (jij):  (9)


 

To model time and aging of documents we extent the algorithm by introducing the time factor  TF Î [0..1] (first mentioned in 2.2.1). This number describes the absolute decrease of the probability values not requested for every time unit since the last request was made. We denote the time unit since the last request of a document   dj as TUj. For web applications, a typical time unit consists of one day. Other time units are possible, but to model time in our proposed model it is not important to know the exact duration of one unit. The value of a document not requested is decreased by the time factor  TF multiplied with the number of time units since the last request was made. Thus we get (10):


 

The resulting Memory Matrix is not symmetric any more.
According to [9], the Memory Matrix j serves as base for the prediction of documents. When j is multiplied with a session vector we will obtain the conditional probabilities of all documents available. Usually, one will start with a session vector with only one entry which does not equal zero. After that, the next and most probable documents may be predicted. Even though  is not symmetric any more, the prediction algorithm works quite the same way as before.

The learning phase and formula (10) should be illustrated by a simple example. Let us consider the following settings:

A=3, TF=0.1, TU1=TU2=TU3=1 day (e.g. the last request on all documents is one day old).

Let us suppose the current Memory Matrix j to be:

The session vector is given as s = (1 0 1).In that case, the resulting matrix after a learning step would look like:

Only the first and the last row have changed. Every element in the first and the last row that corresponds to a "1" in the session vector has been increased by 1. Every element that corresponds to a "0" has been decreased by  TF=0.1. If such a value were to drop below 0, the new value would remain 0. The reason for this lies in the fact that the relevance of a document which has not yet been requested should not influence the prediction of other requests in the classification phase, since such a request could still take place later during the same session.
 

3. RESULTS OF THE PREDICTION SCENARIO

3. 1 Introduction to the test scenario

We started millions of test runs with different settings as for session generation and prediction parameters. To be able to compare the results, we will, in this section, present only settings with a session vector length of   A=125, a predictable part of  M=50 and a density value of  D=25 while the other parameters R,  RCP, the threshold and the time factor  TF were variable. In the next section 4 we will work with real logs and additional test settings to reflect the proposed ideas. Due to the time modeling factors TF and  RCP it has been necessary to re-start every test setting several times. The reason for this becomes clear when we consider a  RCP-value of 0. Evidently the first  M values of all session vectors within one test run are the same. We will only be able to achieve the overall statistical distribution of session values adequately, if the same test settings are run several times.

As a means of measuring the accuracy of the prediction algorithm we proposed the   prediction quality  PQ as quotient of correctly predicted and wrongly predicted values. PQ > 5 is a supposition for applying prediction as a possibility to improve the performance of the Client/Server communication.

3.2 Analysis of the results

3.2.1 The standard parameters A, M, D and R

In [9] the settings of  A,  M,  D and  R  are being analyzed. In general the absolute value of A is not important for testing the relative effect of the other parameters, especially with regard to the  PQ-factor. Naturally, a high value of  M leads to better prediction results, since the share of random values decreases. In our tests, we analyzed different settings of  M and quotients of  M/A, but in the results presented below we restrict ourselves to  M/A = 0.2. The threshold value for probabilities that lead to a prediction of the corresponding document is calculated on the basis of system-dependent factors and will be analyzed in [9]. For our simulation we have tested several settings. If not described otherwise, the results presented in this section are based on a threshold value of 0.95.

It is clear that for high values of the random factor R a high  M/A value will obtain much better results than for lower ones. It is also important to understand the influence of the density D on the  PQ-factor. In general,  D controls the absolute number of correctly and of wrongly predicted requests, because prediction guesses are made only for the "1"'s in the session vector and  D determines the share of such settings. We can also confirm the observation that the prediction quality decrease is more than linear with the value of the random factor  R growing.

3.2.2 The time factor TF

The modeling of time and document aging by using the time factor   TF resulted in an improved performance of the prediction qualities. Figure 1 shows a strongly increasing  PQ with an increasing time factor for low values of    RCP. For high values of  RCP the influence of  TF disappears. The figure is based on a fixed random value of   R=0.2.

Nevertheless, it should be mentioned that the absolute number of predictions is much lower with high values of  TF. This can be explained by improved prediction accuracy: even though there are less prediction tries, the resulting quality is much better. This means that the time factor is able to control the prediction strategy. On the one hand, similar to the threshold, a high setting of  TF leads to a pessimistic algorithm with only a few but very good predictions. On the other hand, a low value of  TF or of the threshold allows much more prediction proposals which are then of a low quality. We will come back to this problem in 3.2.4.

The advantage of the time factor in contrast to the threshold value is evident: the first can be set individually for every document available while the threshold must be calculated from system-dependent resources (see [9] for a discussion of threshold dependencies).

3.2.3 The request change probability RCP

It was clear from the very beginning that the role of the  RCP is as important as the random factor is. Both parameters control the structure and entropy of the generated session vectors. A low RCP value can even improve the prediction quality much more. While decreasing values of R increase the PQ factor polynomial, the RCP can improve the quality exponentially. Figure 2 shows the prediction qualities of some settings of RCP for a random value of R=0.3.

In both figures (1 and 2) we have omitted the case where RCP=0.0 because the prediction quality is naturally highly increasing for such a value and even a logarithmic integration of the values is not sufficient to obtain an acceptable graph. Figure 3 shows several time factor and random settings for a vanishing RCP value.

Here again, the influence of the time factor decreases with high values of the random factor.


 

3.2.4 The threshold value

Every prediction algorithm must deal with a sort of threshold: What is a minimum value of the probability for future user requests to process a pre-calculation or a pre-fetching? We proposed a model where thresholds result directly from system- and document-dependent parameters [9]. This means that the threshold can not be set deliberately. Nevertheless, we will see that the influence of this variable is enormous. In the next paragraph we will present a possibility for overcoming the problem of fixed threshold-settings.

The next figure 4 signals prediction qualities in dependence of time factors and threshold settings (R=0.1).

Figure 4 above shows the absolute PQ-values and it is interesting to see that high thresholds lead to better prediction results, but only for time factors of TF>0.0. What the graphics do not show is the absolute number of correctly predicted tries (figure 5, the variable parameters are the same as in figure 4, 100.000 test runs for each setting).
 

Here the graphical results are inverse to figure 4. High threshold values prevent the prediction algorithm from making a good - or bad - guess at future requests. Thus the costs for good prediction quality consist in comparatively low numbers of correct guesses. This aspect must be kept in mind when proposing certain settings for prediction algorithms. If the costs for wrongly predicted pre-fetches are too high, even good prediction qualities would not be acceptable because of the number of wrong guesses.
 

3.2.5 Inferences and dependencies of the controlling parameters R, TF, RCP

There are some interesting observations beyond the isolated analysis of the single parameters. One aspect is the growing importance of TF and R while RCP decreases near zero (see figure 1 in 3.2.2).

The question is: What can be done with such information? Our idea was to re-investigate those parameters from prediction results of real user logs. In 4.1 we will present those results. It is easier to find the best settings for TF if an exact or even only a vague impression of factors like R or like the RCP can be assumed. This leads to a kind of hybrid approach: While trying to predict user requests as accurately as possible the same requests are analyzed  a posteriori to find the best parameters-settings of TF and RCP. Anticipating that user behavior will be, in general, statistically constant if the server information offer has not changed fundamentally these parameter-settings can be used to predict the upcoming sessions. Otherwise, if the parameter results are too bad and do not show any solid, constant user behavior, the prediction can be stopped completely.

4. VERIFICATION OF THE MODEL

4.1 Evaluation of real user logs

We extracted user requests from some log files of a web-server to verify our model. The first thing we did with the logs was a transformation to session vectors without considering the request order. We randomly selected A=125 server documents in relation to the simulations of section 3 with an unknown share of potentially predictable data, but we set M=125 to receive results with high importance of RCP. The average density we found was D=6 with a high variance. To simulate different system-environment characteristics we applied several settings of the threshold in the prediction model. Figure 6 shows a selection of the a posteriori prediction quality in dependence of the threshold.

The most obvious observation is the irregular behavior of the curves. For some time factors (e.g. TF=0.5) a lower threshold resulted in a better prediction quality, while normally high thresholds obtain the best results. For all these threshold settings the condition of PQ>5 was not fulfilled.

It is interesting to look at the relationship between PQ and TF for a fixed threshold of 0.6. Figure 7 can thus be compared to figure 2 in 3.2.3 with an unknown implicit RCP value in the data.
 

The progression of the curve in figure 7 corresponds to a low RCP value. If the implicit RCP value was higher, the curve would be more similar to a straight line. In the real logs, under the conditions described, the best prediction results can be achieved with a time factor of TF=0.1.

4.2 Investigation of unknown parameters

The combination of R and RCP makes it very difficult to determine their values directly from the log files. Both parameters influence the prediction quality enormously, though not in the same way. Even though it is easy to determine the average number of requests in a log file, the standard deviation or the empirical variance, there is no easy or evident way to determine the settings of R or RCP without conducting test runs in the prediction scenario.

We know from section 3.2.4 that high thresholds show the most significant quality differences corresponding to the setting of TF. Therefore, we used a threshold value of 0.95 to determine RCP and R, without even considering system-dependent costs. Certainly this works only a posteriori and not while the prediction algorithm is actually running.

We started our test runs with the settings described in 4.1 to find the best correspondence to the results found for the real logs. As a rule of thumb one can say that for every fixed value of RCP there can be found a corresponding value of R to obtain a predefined prediction quality - and vice versa. But if we try to match also the absolute number of right (or wrong) guesses, only very few settings of R and RCP can be used.

Beginning with a great grid of values (RÎ{0.0,0.1,..,1.0} and RCPÎ{0.0,0.1,..,0.5}) it becomes clear very fast that the best values to meet our conditions lay in the area of RÎ[0.8..0.9] and RCP<0.1. The next step is the exact determination of both factors. Here, the really obtained prediction quality is the decisive factor. As a result we find RCP to be 0.0703 and R to be 0.885 The time factor dependent curve we found in the test scenario and the corresponding values of the real logs are shown in figure 8.

The resulting difference between the test setting and the real logs was smaller than 0,5 PQ. Several further tests must be made to classify user behavior automatically and dynamically for using this technique in prediction algorithms. Herewith we come one step further to the aim of all prediction ideas: reducing user perceived latency.

5. SUMMARY AND OUTLOOK

In this paper we presented a possibility to model time and document aging in a prediction environment. We extended a straightforward approach to treat time factors and request change probabilities and received some interesting test results.

One main idea of our study and the implementation of the test scenario was the investigation of implicit parameters in real user logs. While analyzing structures and characteristics in session vectors of a test scenario we aimed to get explicit variable settings of real log characteristics - dependent on the data presented on the server side and the individual user request behavior.

With this knowledge it should be simpler to design prediction algorithms that take care of the specific characteristics of the data on the server-side and of special user treatment. Also, the presented technique can help to classify access statistics for improving the information offer on the server side even without using prediction tools.

Our current work focuses on means to combine both system parameters and request characteristics to improve the Client/Server communication.

REFERENCES

  1. Azer Bestavros, Using Speculation to Reduce Server Load and Service Time on the WWW, Proceedings of CIKM95, November 1995.
  2. Azer Bestavros, Speculative Data Dissemination and Service to Reduce Server Load, Network Traffic and Service Time in Distributed Information Systems, Proceedings of ICDE'96, New Orleans, Louisiana, March 1996.
  3. Mark Crovella, Paul Barford, The Network Effects of Prefetching, IEEE Infocom 1998.
  4. Ramón Cáreres, Fred Douglis, Anja Feldmann, Gideon Glass, Michael Rabinovich, Web Proxy Caching: The devil is in the details, Workshop on Internet Server Performance, June 1998, Madison, WI
  5. Pei Cao, Sandy Irani, Cost-Aware Proxy Caching Algorithms, 1998.
  6. Edith Cohen, Balanchander Krishnamurthy, Jennifer Rexford, Improving End-to-End Performance of the Web Using Server Volumes and Proxy Filters, 1998.
  7. Stephen Erickson, Danying Yi, Modelling the performance of a Large Multi-Tiered Application, American Management Systems, AMS Center For Advanced Technology, 1999.
  8. James Griffioen, Randy Appleton, The design, Implementation, and Evaluation of a Predictive Caching File System, CS-Department University of Kentucky, CS-264-96, 1996.
  9. Haffner, Roth, Engel, Meinel, A Semi-Random Prediction Scenario for User Requests, Proceedings of the Asia Pacific Web Conference, 1999, Online proceedings http://www.comp.polyu.edu.hk/APWEB99.
  10. Zhimei Jiang, Leonard Kleinrock, An Adaptive Network Prefetch Scheme, IEEE J Sel Areas Commun, 1998.
  11. Achim Kraiss, Gerhard Weikum, Vertical Data Migration in Large Near-Line Document Archives Based on Markov-Chain Predictions, Proceedings of the 23rd VLDB Conference, 1997.
  12. Achim Kraiss, Gerhard Weikum, Integrated document caching and prefetching in storage hierarchies based on Markov-chain predictions, The VLDB Journal, Springer-Verlag, 1998.
  13. V.N. Padmanabhan, J.C. Mogul, Using Predictive Prefetching to Improve World Wide Web Latency, Computer Communication Review, 1996.
  14. Schechter, Krishnan, Smith, Using path profiles to predict HTTP requests, Proceedings of the World Wide Web Conference, 1998.

 
 

Copyright 2000 ACM

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and or fee.
SAC 2000 March 19-21 Como, Italy
(c) 2000 ACM 1-58113-239-5/00/003>...>$5.00