The Influence Of Caching On Web Usage Mining


J. Huysmans, B. Baesens & J. Vanthienen


Most web servers collect lots of data during their daily operation. Information, such as which pages are requested and who is responsible for these requests, is stored in log files. The analysis of these log files may yield worthwhile information on how to adapt the site to improve the user experience. However, the data in the log files is usually not stored in a format suited to perform analyses. Many operations are needed to transform the logs in a format that is convenient for the chosen type of analysis. After an overview of these operations, we will discuss how caching of pages can skew the results of studies. We will show how caching can be detected and how one can deal with it. Afterwards, the techniques are applied to the data of a European online wine shop. Keywords: web usage mining, data pre-processing, data cleaning, caching, robot detection. 1 Introduction More and more organizations are dependent on the web for the sale and marketing of their products, for informing customers, for contacting suppliers, … In consequence, to measure the effectiveness of an advertising campaign or to make forecasts on various business variables, they can no longer rely only on traditional sources to acquire the required information. Some of the newer data sources that contain worthwhile information are the log files that are generated by web servers. Every request made to the server is stored in these log files. An analysis of these requests can provide information on how to adapt the site. This will improve the user experience and, in consequence, the profitability of the site is likely to increase.


