Probabilistic Modelling For Clickstream Analysis
Free (open access)
L Di Scala & L La Rocca
Our aim is to explore the possibility of performing site-centric clickstream analysis by means of probabilistic modelling. We consider the clickstream originating from a given Web site as a Markovian sequence taking values in the site’s page-space. An extra page is added which represents the rest of the Web and is used to determine clickstream fractures (i.e. multiple visits). Different models for the memory of Web surfers and for their heterogeneity are investigated. As an example, the methodology is then applied to data originating from an e-commerce site. 1 Introduction Almost every computer which hosts Web sites (namely a Web server) keeps track, in its log-files, of the page requests received from all over the Internet. This results in a huge amount of raw data, possibly containing relevant information about the way people surf throughout the hosted Web sites. We aim at showing that probabilistic modelling can be helpful in analysing such data and, to this regard, we consider a variety of models, first tackling the methodological issues they raise and then fitting them to real data. The typical information contained in a log-file, according to the widespread Common Log Format, is a list of file requests; each entry consists of (at least) the Internet Protocol address of the computer making the connection, the date of the request (complete with hour, minutes and seconds) and the Uniform Resource Locator of the file accessed. It is commonly accepted that, even before the simplest descriptive statistics are carried out, the log-file should undergo some kind of data pre-processing, depending on the goals of the analysis. In this paper, we focus our attention on the surfing paths within a given Web site, in order to both predict and classify the surfing behaviour. This naturally leads to (at least) three pre-processing steps: pruning of requests, surfer and session identification.