The Internet Encyclopedia (Volume 3)

(coco) #1

P1: c-143Braynov-2


Braynov2 WL040/Bidgoli-Vol III-Ch-05 July 11, 2003 11:43 Char Count= 0


WEBUSAGEANALYSIS FORPERSONALIZATION 55

requests.” Cookies help keep track of several visits by the
same customer in order to build his profile. Some market-
ing networks, such as DoubleClick, use cookies to track
customers across many Web sites.
Users can and often do disable cookies by changing the
configuration parameters of their Web browsers. Another
way to cope with marketing networks’ cookies is by reg-
ularly checking the cookie files and deleting them. Some
utilities, such as CookieCop, let users automatically ac-
cept or reject certain cookies. The program runs as a proxy
server and monitors all cookie-related events.
It should be pointed out, however, that rejecting
cookies may disable e-commerce transactions. Many re-
tail Web sites, for example, use cookies for shopping-
cart implementation, user identification, passwords, etc.
Rejecting cookies may also cause problems to companies
attempting customization and personalization. In gen-
eral, there is a tradeoff between privacy and personal-
ization. The more information a user reveals, the more
personalized services he obtains.
Other tracking devices, currently producing much con-
troversy, are Web bugs, or clear GIFs. A Web bug is a
hidden (or very small) image in a Web page that acti-
vates a third-party spying device without being noticed
by the Web page visitors. Web bugs are usually used to
track users’ purchasing and browsing habits.

Mechanisms for User Identification
Server log data contains information about all users visit-
ing a Web site. To associate the data with a particular user,
user identification is performed. The simplest form of user
identification is user registration, in which the user is usu-
ally asked to fill out a questionnaire. Registration has the
advantage of being able to collect rich demographic in-
formation, which usually does not exist in servers’ logs.
However, due to privacy concerns, many users choose not
to browse sites requiring registration, or may provide false
or incomplete information.
Another method for user identification is based on log
file analysis. Log-based user identification is performed by
partitioning the server log into a set of entries belonging
to the same user. Accurate server log partitioning, how-
ever, is not always feasible due to rotating IP addresses at
ISPs, missing reference due to local or proxy server cash-
ing, anonymizers, etc. For example, many users can be
mistakenly classified as a single user if they use a com-
mon ISP and have the same IP address. Several heuristics
can be used to differentiate users sharing an IP address
(Pirolli, Pitkow, & Rao, 1996). One may look for changes
in the browser or operating system in the server log file.
Because a user is expected to keep the same browser or
operating system during his visit to a Web site, a change
could mean that another visitor (with a different browser
or operating system) uses the same IP address.
Another technique for user identification uses software
agents loaded into browsers, which send back data. Due
to privacy concerns, however, such agents are very likely
to be rejected by users.
The most reliable mechanisms for automatic user iden-
tification are based on cookies. Whenever a browser con-
tacts a Web site, it will automatically return all cookies

associated with the Web site. Cookie-based user identifi-
cation is reliable if the user launches each URL request
from the same browser.
Another problem with user identification is that Web
sites typically deal with both direct and indirect users
(Ardissono & Goy, 2001). A customer is an indirect user if
he visits a Web site on behalf of someone else. For exam-
ple, a user may visit a Web store in order to buy a gift for a
relative. In this case, the Web store must personalize gift
suggestions and recommendations to the preferences of
the intended beneficiary (the relative) and not to the pref-
erences of the visitor. To overcome this problem, CDNOW
(http://www.cdnow.com) offers a “gift advisor” which dif-
ferentiates between direct and indirect users.

Session Identification
A user session consists of all activities performed by a user
during a single visit to a Web site. Because a user may
visit a Web site more than once, a server log may con-
tain multiple sessions for a given user. Automatic session
identification can be performed by partitioning log entries
belonging to a single user into sequences of entries cor-
responding to different visits of the same user. Berendt,
Mobasher, Spiliopoulou, and Wiltshire (2001) distinguish
between time-oriented and navigation-oriented session-
izing. Time-oriented sessionizing is based on timeout. If
the duration of a session or the time spent on a partic-
ular Web page exceeds some predefined threshold, it is
assumed that the user has started a new session.
Navigation-based sessionizing takes into account the
links between Web pages, the Web site topology, and the
referrer information in a server log. A Web page P 1 is a re-
ferrer to another page P 2 if the URL request for P 2 was is-
sued by P 1 , i.e., the user came to P 2 by clicking on a link on
P 1. A common referrer heuristic is based on the assump-
tion that a user starts a new session whenever he uses a
referrer different from or not accessible from previously
visited pages. For example, if a user comes to page P 2 with
a referrer page P 1 and P 2 is not accessible from P 1 given the
Web site topology, then it is reasonable to assume that the
user has started a new session. This heuristic, however,
fails when the user uses the “Back” button or chooses a
recent link kept by the browser.

Clickstream Analysis
Clickstream analysis is a special type of Web usage min-
ing that provides information essential to understanding
users’ behavior. The concept of clickstream usually refers
to a visitor’s path through a Web site. It contains the se-
quence of actions entered as mouse clicks, keystrokes,
and server responses as the visitor navigates through a
Web site. Clickstream data can be obtained from a Web
server log file, commerce server database, or from client-
side tracking application.
Most efforts in Web usage analysis are focused on dis-
covering users’ access patterns. Understanding users’ nav-
igation through a Web site can help provide customized
content and structure tailored to their individual needs.
Chen, Park, and Yu (1996) proposed an algorithm for min-
ing maximal forward reference, where forward reference
is defined as a sequence of pages requested by a user up to
Free download pdf