The Internet Encyclopedia (Volume 3)

P1: c-143Braynov-2

Braynov2 WL040/Bidgoli-Vol III-Ch-05 July 11, 2003 11:43 Char Count= 0

WEBUSAGEANALYSIS FORPERSONALIZATION 55

requests.” Cookies help keep track of several visits by the same customer in order to build his profile. Some marketing networks, such as DoubleClick, use cookies to track customers across many Web sites. Users can and often do disable cookies by changing the configuration parameters of their Web browsers. Another way to cope with marketing networks’ cookies is by reg- ularly checking the cookie files and deleting them. Some utilities, such as CookieCop, let users automatically ac- cept or reject certain cookies. The program runs as a proxy server and monitors all cookie-related events. It should be pointed out, however, that rejecting cookies may disable e-commerce transactions. Many re- tail Web sites, for example, use cookies for shopping- cart implementation, user identification, passwords, etc. Rejecting cookies may also cause problems to companies attempting customization and personalization. In gen- eral, there is a tradeoff between privacy and personalization. The more information a user reveals, the more personalized services he obtains. Other tracking devices, currently producing much con- troversy, are Web bugs, or clear GIFs. A Web bug is a hidden (or very small) image in a Web page that acti- vates a third-party spying device without being noticed by the Web page visitors. Web bugs are usually used to track users’ purchasing and browsing habits.

Mechanisms for User Identification Server log data contains information about all users visit- ing a Web site. To associate the data with a particular user, user identification is performed. The simplest form of user identification is user registration, in which the user is usually asked to fill out a questionnaire. Registration has the advantage of being able to collect rich demographic information, which usually does not exist in servers’ logs. However, due to privacy concerns, many users choose not to browse sites requiring registration, or may provide false or incomplete information. Another method for user identification is based on log file analysis. Log-based user identification is performed by partitioning the server log into a set of entries belonging to the same user. Accurate server log partitioning, however, is not always feasible due to rotating IP addresses at ISPs, missing reference due to local or proxy server cash- ing, anonymizers, etc. For example, many users can be mistakenly classified as a single user if they use a common ISP and have the same IP address. Several heuristics can be used to differentiate users sharing an IP address (Pirolli, Pitkow, & Rao, 1996). One may look for changes in the browser or operating system in the server log file. Because a user is expected to keep the same browser or operating system during his visit to a Web site, a change could mean that another visitor (with a different browser or operating system) uses the same IP address. Another technique for user identification uses software agents loaded into browsers, which send back data. Due to privacy concerns, however, such agents are very likely to be rejected by users. The most reliable mechanisms for automatic user identification are based on cookies. Whenever a browser con- tacts a Web site, it will automatically return all cookies

associated with the Web site. Cookie-based user identification is reliable if the user launches each URL request from the same browser. Another problem with user identification is that Web sites typically deal with both direct and indirect users (Ardissono & Goy, 2001). A customer is an indirect user if he visits a Web site on behalf of someone else. For example, a user may visit a Web store in order to buy a gift for a relative. In this case, the Web store must personalize gift suggestions and recommendations to the preferences of the intended beneficiary (the relative) and not to the preferences of the visitor. To overcome this problem, CDNOW (http://www.cdnow.com) offers a “gift advisor” which dif- ferentiates between direct and indirect users.

Session Identification A user session consists of all activities performed by a user during a single visit to a Web site. Because a user may visit a Web site more than once, a server log may con- tain multiple sessions for a given user. Automatic session identification can be performed by partitioning log entries belonging to a single user into sequences of entries cor- responding to different visits of the same user. Berendt, Mobasher, Spiliopoulou, and Wiltshire (2001) distinguish between time-oriented and navigation-oriented sessionizing. Time-oriented sessionizing is based on timeout. If the duration of a session or the time spent on a particular Web page exceeds some predefined threshold, it is assumed that the user has started a new session. Navigation-based sessionizing takes into account the links between Web pages, the Web site topology, and the referrer information in a server log. A Web page P 1 is a referrer to another page P 2 if the URL request for P 2 was is- sued by P 1 , i.e., the user came to P 2 by clicking on a link on P 1. A common referrer heuristic is based on the assump- tion that a user starts a new session whenever he uses a referrer different from or not accessible from previously visited pages. For example, if a user comes to page P 2 with a referrer page P 1 and P 2 is not accessible from P 1 given the Web site topology, then it is reasonable to assume that the user has started a new session. This heuristic, however, fails when the user uses the “Back” button or chooses a recent link kept by the browser.

Clickstream Analysis Clickstream analysis is a special type of Web usage min- ing that provides information essential to understanding users’ behavior. The concept of clickstream usually refers to a visitor’s path through a Web site. It contains the sequence of actions entered as mouse clicks, keystrokes, and server responses as the visitor navigates through a Web site. Clickstream data can be obtained from a Web server log file, commerce server database, or from client- side tracking application. Most efforts in Web usage analysis are focused on dis- covering users’ access patterns. Understanding users’ navigation through a Web site can help provide customized content and structure tailored to their individual needs. Chen, Park, and Yu (1996) proposed an algorithm for min- ing maximal forward reference, where forward reference is defined as a sequence of pages requested by a user up to

The Internet Encyclopedia (Volume 3)

Get our desktop app

Company

Features

Documentation

Resources