Evaluating Pricing Strategy Using e-Commerce Data: Evidence and Estimation Challenges

AyeshaShaikh83 · May 10, 2015

Description
As Internet-based commerce becomes increasingly widespread,
large data sets about the demand for and pricing of a wide variety of products
become available. These present exciting new opportunities for empirical
economic and business research, but also raise new statistical issues and challenges.
In this article, we summarize research that aims to assess the optimal-
ity of price discrimination in the software industry using a large e-commerce
panel data set gathered from Amazon.com. We describe the key parameters
that relate to demand and cost that must be reliably estimated to accomplish
this research successfully, and we outline our approach to estimating these
parameters. This includes a method for “reverse engineering” actual demand
levels from the sales ranks reported by Amazon, and approaches to estimat-
ing demand elasticity, variable costs and the optimality of pricing choices
directly from publicly available e-commerce data. Our analysis raises many
new challenges to the reliable statistical analysis of e-commerce data and we
conclude with a brief summary of some salient ones.

Statistical Science
2006, Vol. 21, No. 2, 131–142
DOI: 10.1214/088342306000000187
© Institute of Mathematical Statistics, 2006
Evaluating Pricing Strategy Using
e-Commerce Data: Evidence and
Estimation Challenges
Anindya Ghose and Arun Sundararajan
Abstract. As Internet-based commerce becomes increasingly widespread,
large data sets about the demand for and pricing of a wide variety of products
become available. These present exciting new opportunities for empirical
economic and business research, but also raise new statistical issues and chal-
lenges. In this article, we summarize research that aims to assess the optimal-
ity of price discrimination in the software industry using a large e-commerce
panel data set gathered from Amazon.com. We describe the key parameters
that relate to demand and cost that must be reliably estimated to accomplish
this research successfully, and we outline our approach to estimating these
parameters. This includes a method for “reverse engineering” actual demand
levels from the sales ranks reported by Amazon, and approaches to estimat-
ing demand elasticity, variable costs and the optimality of pricing choices
directly from publicly available e-commerce data. Our analysis raises many
new challenges to the reliable statistical analysis of e-commerce data and we
conclude with a brief summary of some salient ones.
Key words and phrases: Electronic commerce, pricing strategy, price dis-
crimination, versioning, quality differentiation, sales rank.
1. INTRODUCTION
The adoption of Internet-based commerce has pro-
vided academic researchers with a wealth of new data
on demand and pricing across a number of industries.
The availability of these data and their growing use in
empirical studies of electronic commerce raises a num-
ber of new statistical and econometric issues. In this
article, we describe how we empirically analyze and
evaluate pricing strategy in the consumer software in-
dustry using a large-scale e-commerce data set from
Amazon.com. We describe some of the methods we
have applied to our analysis of these data, how we have
adapted them to address issues unique to e-commerce
Anindya Ghose is Assistant Professor of Information,
Operations and Management Sciences, and Arun
Sundararajan is Assistant Professor of Information Systems
and Director (IT Economics) of the Center for Digital
Economy Research, Leonard N. Stern School of Business,
44 West 4th Street, New York, New York 10012, USA
(e-mail: [email protected]; [email protected]).
data, and we summarize open challenges whose resolu-
tion will help facilitate more robust empirical research
in electronic commerce.
Pricing strategy in the consumer software indus-
try (and in many other industries) often involves the
use of price discrimination, which, broadly, aims to
identify (directly or otherwise) customers who are
willing to pay more for a product and to charge
them a higher price. Beyond the notion of “?rst-
degree” price discrimination which involves charging
different consumers different prices for an identical
good (Aron, Sundararajan and Viswanathan, 2006;
Choudhary et al., 2005), there are a variety of ways
that ?rms price-discriminate. For example, a seller may
price differently depending on whether a consumer
has purchased from the ?rm before (these are typi-
cally called introductory offers). A seller may vary
the price of a product depending on how many units
of the product are purchased by an individual con-
sumer; this is commonly referred to as nonlinear pric-
ing (Sundararajan, 2004b). A seller may base the price
131
132 A. GHOSE AND A. SUNDARARAJAN
of a product on whether other related products are also
purchased from the same ?rm: this is called bundling
(Bakos and Brynjolfsson, 1999), and a seller may
choose to implement either pure bundling, under which
a set of products are sold only as a bundle, or mixed
bundling, under which both the bundle and individual
products are sold (Ghose and Sundararajan, 2005b). As
an example of the latter, Microsoft sells its Of?ce suite
of software as a bundle of Word, Excel and PowerPoint
in addition to selling each of these products individu-
ally. A seller may create different but related versions
of a product (typically one of higher quality or with
more features) and price them differently. This is re-
ferred to as versioning and aims to price-discriminate
by exploiting differences in how much different cus-
tomers value product quality. There are multiple ver-
sions of a large number of popular desktop software
titles that differ only in their quality or number of fea-
tures (rather than in their development or release date)
and that are sold at different prices. Current examples
include Adobe Acrobat, TurboTax, Microsoft Money
and Norton AntiVirus. These are examples of software
titles for which a ?rm has developed a ?agship ver-
sion, disabled a subset of the features or modules of
this version, and released both the higher quality ver-
sion and one or more lower quality versions simultane-
ously. A related form of price discrimination is based
on releasing successive generations of the same prod-
uct in multiple periods, with a period of time where
the old and new generations overlap; since each new
generation represents an improvement in the overall
performance of the product, the simultaneous presence
of two or more successive generations is analogous to
the presence of two or more related products of varying
quality (Ghose, Huang and Sundararajan, 2005).
The objective of a software company that price-
discriminates is to maximize the pro?ts it generates
from the sale of its products. However, price discrimi-
nation can often have countervailing effects on a ?rm’s
pro?ts. For instance, two consequences of introducing
a lower quality version of an existing product to price-
discriminate are the loss of pro?ts from customers
who switch from purchasing the higher quality version
to purchasing the lower quality version (commonly
termed cannibalization) and a gain in pro?ts from new
customers, for the lower quality version, who either did
not purchase the product earlier or who purchased a
competing product. [In many ways this is similar to the
cannibalization that occurs when used products com-
pete simultaneously with new products (Ghose, Telang
and Krishnan, 2005; Ghose, Smith and Telang, 2006).]
The interplay between these consequences eventually
determines the optimality of versioning. A similar pair
of consequences, with opposing effects, characterizes
the eventual pro?tability of bundling. Similarly, non-
linear pricing that discounts high usage levels too ex-
tensively can reduce a seller’s pro?ts.
Thus, to pro?t from price discrimination, a software
company must make an appropriate choice of the form
of price discrimination; it must choose its prices opti-
mally and sometimes it must determine optimal qual-
ity levels for an inferior (related) set of products or
the size of a bundle. There is no published research
with evidence that software companies in fact make
these price-discrimination choices optimally; however,
the availability of detailed price and demand data from
e-commerce sites like Amazon now makes it feasible
to empirically assess the optimality of their choices.
A ?rst goal of our research program is therefore to
use these data to evaluate the optimality of such price-
discrimination strategies in the software industry em-
pirically. This is a problem of signi?cant economic
importance.
To do so, one ?rst needs a method to convert “sales
ranks” reported by Amazon.com into actual demand
levels. Amazon publishes a sales rank for each prod-
uct it sells, which is the rank of the product within its
category based on recent demand (more on this later).
Next, the demand system associated with our products
(i.e., how the variation in prices is associated with vari-
ation in demand) needs to be estimated. Amazon.com
does not provide any data about the variable cost of the
products it sells: we therefore also need to infer these
costs from our data (since the pro?t to a seller is deter-
mined not just by price charged and quantity sold, but
also by its cost per unit). We describe our approach to
accomplish each of these goals. We brie?y summarize
other estimates that contribute to our research program
and we conclude with some of the key statistical chal-
lenges that emerge from our analysis. The ?ow chart in
Figure 1 indicates the steps we prescribe for estimating
the optimality of pricing.
2. SUMMARY OF DATA
Our data are compiled from publicly available in-
formation on new software prices and sales rankings
at Amazon.com, the largest on-line retailer of con-
sumer software. Our data are gathered using automated
Java programs to access and parse HTML and XML
pages downloaded from its web site, three times each
PRICING STRATEGY USING E-COMMERCE DATA 133
FIG. 1. Sequence of steps to determine the optimality of software pricing from e-commerce data.
day, at equally spaced intervals. Our sample contains
330 products randomly selected from each of four ma-
jor categories—business and productivity, security and
utilities, graphics and development and operating sys-
tems software. (Our random sample was created by
?rst compiling a list of software products that were sold
on Amazon during the year and then using Excel’s ran-
dom number generator to choose from them. We chose
a sample size of 330 since this yielded what we felt was
a suf?cient number of distinct titles within each ma-
jor category.) We collect all relevant data on list prices
(the manufacturer’s suggested price), new prices (the
price charged by Amazon.com), sales ranks (to be dis-
cussed further soon), product release date, average cus-
tomer reviewand number of reviewers who contributed
to this average. To facilitate a clearer understanding of
how each of these pieces of information is reported to
a consumer on Amazon.com’s web site, a screen shot
of an Amazon page is illustrated in Figure 2.
Our sample consists of products that have different
versions as well as products that are sold as bundles
(in addition to the individual components). We are able
to determine this because software manufacturers use
terms like “premier,” “deluxe” or “standard” to denote
versions of the same title that vary in quality (which is
typically measured by the number of features). Simi-
larly, a product suite that contains individual products
has the term “bundle” associated with it. The details
of individual components within each bundle are pro-
vided in a bundle’s product description. Based on the
release date of the product, we can infer if it is the cur-
rent or the previous year’s edition.
We also collect data on secondary market activity, in-
cluding used prices (prices charged by sellers who have
FIG. 2. How the data we gather from Amazon.com are displayed on its web site.
134 A. GHOSE AND A. SUNDARARAJAN
TABLE 1
Summary statistics
Variable Mean Std. dev. Min Max
Sales rank 1649.61 1971.26 1 11622
List price 69.16 226.17 19.95 1799.99
Amazon price 65.53 208.57 14.95 1699.99
New non-Amazon price 17.74 23.08 10.01 209.99
Customer rating 3.14 0.99 1 5
Number of reviewers 25.72 66.3 1 606
Days release 717.7 1336.22 0 1750
posted second-hand copies of the product for sale) and
new prices from non-Amazon sellers (these are sellers
who are not af?liated with Amazon but are allowed to
sell goods on Amazon in exchange for a commission
on the transaction price). Table 1 provides summary
statistics of our data.
We have categorized our software titles in three
ways: (i) based on those titles that have just two ver-
sions and those that have more than two versions,
(ii) based on whether the title is sold as part of a bun-
dle of other products or as a stand-alone product, and
(iii) based on whether the title is from the most recent
generation or from a previous generation. This catego-
rization is summarized in Table 2. Thus, for example,
our sample contains 32 unique titles which each have
two versions (a higher quality and a lower quality ver-
sion). Similarly, our sample contains 56 unique titles
which each have both the current and the previous gen-
eration available simultaneously. The other rows can be
interpreted in a similar way.
Our data were collected between January 2005 and
November 2005. (For the duration of our study there
were a few instances during which the Java program
was unable to collect data all three times during the
day. In most cases this happened if the Amazon server
was not functioning properly during the time the data
were being gathered. However, this does not affect our
analysis, primarily because of the low frequency with
TABLE 2
Various product categories in sample
Number of Total number
Product category unique titles of products
Bundles 68 136
Versions (2) 32 64
Versions (>2) 19 57
Successive generation 56 112
which prices are changed by Amazon. In general, we
?nd that there are far fewer price changes than changes
in sales rank; thus any missing information on prices
within the same day will have almost no impact on
our estimates of price elasticities.) The distributions of
sales ranks and retail (Amazon) prices across our prod-
ucts are summarized in Figure 3(a) and (b). We also
provide a scatterplot of prices and sales ranks in Fig-
ure 3(c).
3. ESTIMATION AND PRICING
3.1 Demand Estimation
Amazon.com does not report its periodic demand
levels. Instead, it reports a sales rank for each product
sold on its site, which ranks the demand for a prod-
uct relative to other products in its category. Thus, the
lower the cardinal value of the sales rank, the higher the
demand for that particular item. Prior research (e.g.,
Chevalier and Goolsbee, 2003; Brynjolfsson, Hu and
Smith, 2003) has associated these sales ranks with de-
mand levels. To do so, the authors assume that the
rank data have a Pareto distribution (i.e., a power law).
They then convert sales ranks into periodic demand
levels by conjecturing the Pareto relationship log[Q] =
? +? log[rank], where Q is the (unobserved) demand
for a product, rank is the (observed) sales rank of
the product and ?, ? are industry or category-speci?c
parameters. [Chevalier and Goolsbee (2003) reported
that evidence that the Pareto distribution ?ts well can
be found using the weekly Wall Street Journal book
sales index, which, unlike other bestseller lists, gives
an index of the actual quantity sold. This index is
constructed by surveying Amazon.com, BN.com, and
several large brick and mortar book chains. For discus-
sions on the use of power law distributions to describe
rank data, see Pareto (1896/1897) and Quandt (1964).]
A number of recent studies (e.g., Ghose, Smith and
Telang, 2006) pertaining to the book industry have
used estimates of ? and ? from this prior literature.
However, these are industry-speci?c parameters and,
to our knowledge, there are no corresponding estimates
available for software. Furthermore, in summer 2004,
Amazon altered its sales rank system in the follow-
ing way: it eliminated its three-tier system, updating
ranks each hour for most products (rather than merely
for the top products), and it moved to a system that
uses exponential decays to give more weight in the
sales rank to newer purchases. The exact details of the
calculation are proprietary to Amazon (e.g., the half-
life of the decay). In the earlier three-tier system, there
PRICING STRATEGY USING E-COMMERCE DATA 135
(c)
FIG. 3. The distribution of (a) sales ranks and (b) prices across observations in our data set. The histograms are based on a single entry
per product. (c) Scatterplot of retail prices and sales ranks for all products in our sample.
were three distinct ranking schemes on Amazon: one
for the top selling 10,000 products, another for the
products between 10,000 and 100,000, and a third for
ranks above 100,000. Products with sales ranks be-
tween 1 and 10,000 were reranked every hour, prod-
ucts in the range from 10,000 to 100,000 were reranked
once a day and products with ranks greater than a
100,000 were updated once a month (Chevalier and
Goolsbee, 2003). The current system involves rerank-
ing all products every hour.
Toward a more current and accurate reverse engi-
neering of the ranking system to infer actual periodic
demand, we have designed and conducted an inde-
pendent analysis to convert measured sales ranks into
demand data. Retaining the assumption of a Pareto re-
lationship between demand and sales rank, we combine
a series of purchase experiments with the analysis of a
time series of sales ranks of all the 330 products in our
sample to estimate both ? and ?.
Our purchasing experiment proceeded as follows.
Over a two-week period in mid-June 2005, we col-
lected hourly sales rank data for each of the products in
our panel, yielding a time series of 336 observations for
each product. For products not ranked too high, a gen-
eral trend in these time series is an extended downward
drift in the rank value over many hours (i.e., the rank
becomes a progressively larger number), followed by
intermittent spikes which result in a large upward shift
in rank (i.e., the rank became a smaller number sud-
denly). This is illustrated for two candidate products in
Figure 4. We interpret these spikes as re?ecting time
periods in which one or more purchases have occurred.
This procedure yielded a data set of a certain num-
ber of observations, which associated a weekly demand
136 A. GHOSE AND A. SUNDARARAJAN
FIG. 4. The variation of sales rank with time for the higher and lower quality versions of one of the software titles we track. The charts
on top illustrate sales ranks gathered during successive 8-hour periods, charted for a 6-month window of our data. The charts below graph
hourly sales rank data for the speci?c shorter time window (between the vertical lines on the upper charts) and illustrate the sales rank
spikes associated with sales. The ?at portions of these lower charts re?ect (short) intervals where Amazon did not update the sales ranks that
it published on its web site.
level with each average sales rank, for two successive
weeks. Weekly unit sales ranged from 0 to 16. Using
the implied pairs of average weekly demand and av-
erage sales rank, we then estimated the ordinary least
squares (OLS) equation
log[Q+1] = log[?] +? log[sales rank], (1)
where Q is average weekly demand, and sales rank
is the corresponding average sales rank. [Similar to
Brynjolfsson, Hu and Smith (2003), we used White’s
heteroskedasticity-consistent estimator (see Greene,
2000, page 463) to estimate both parameters.] The
results of these experiments yielded ? = 8.352 and
? = ?0.828. The following list provides a sense of
what these estimates imply:
Weekly sales of two units correspond to an average
sales rank of about 3100.
Weekly sales of 10 units correspond to an average
sales rank of about 440.
Weekly sales of 25 units correspond to an aver-
age sales rank of about 150. [The standard errors for
? and ? were 0.042 and 0.032, resp., and the estimates
were signi?cant at the 1% level. Further details of this
experiment and its results are presented in Ghose and
Sundararajan (2005a).]
An interesting aspect of our approach is that it allows
one to characterize a number of economic measures of
interest (price markups and demand elasticities) based
purely on sales rank ratios and the parameter ?. This
provides a framework for a wider range of future em-
pirical research in e-commerce and also reduces the ex-
tent to which one’s results are affected by the error in
estimating ?.
Our sample encompasses multiple products with
observations collected over time, and our data set
therefore has elements of both cross-sectional and
time-series data. Consistent with existing published
and current research, we analyze our observations as
PRICING STRATEGY USING E-COMMERCE DATA 137
panel data (for a detailed treatment of the econometric
analysis of panel data, see Wooldridge, 2002).
3.2 Estimates of Price Elasticity
Given the price variation across products and across
time, and the measures of quantity in each period that
come from the sales ranks, we can infer the price sen-
sitivity of consumers to on-line product sales. This
would require estimating own- and cross-price elastic-
ities of the products in our sample. The price elasticity
of demand is a measure of the sensitivity of demand
to price changes. Speci?cally, the own-price elasticity
of demand is calculated as the percentage change in de-
mand caused by a unit percentage change in a product’s
own price, and the cross-price elasticity of demand is
calculated as the percentage change in demand caused
by a unit percentage change in another product’s price.
In our context, the other product could be either a dif-
ferent version (high or low quality) of the same good
or a component of a bundle. Own-price elasticities are
generally negative—the quantity of a product sold de-
creases as its price increases. On the other hand, cross-
price elasticities can be either positive or negative. If
X and Y are substitute goods (e.g., the two versions of
a product), the cross-price elasticity of demand is posi-
tive; that is, the quantity of good X varies directly with
a change in the price of good Y. If X and Y are comple-
mentary goods (e.g., computer hardware and software),
the cross-price elasticity of demand may be negative;
that is, the quantity of good X varies inversely with a
change in the price of good Y. Figure 5 shows a plot of
the changes in prices and sales ranks for the two ver-
sions (of high and low quality) of a speci?c product for
a speci?c period of time.
To compute own-price and cross-price elasticities,
we estimate OLS regressions which control for unob-
served heterogeneity across products and across cate-
gories (we use the ?xed effects transformation). Based
on these estimates, we subsequently compute the own-
and cross-price elasticities by weighting them with
the appropriate Pareto mapping parameter ?, which
was estimated earlier. In other words, estimating the
equations using log ranks, rather than actual quanti-
ties, yields the correct elasticities, but they are scaled
up by the Pareto mapping parameter. This is simi-
lar to the approach used in prior literature (Chevalier
and Goolsbee, 2003; Ghose, Smith and Telang, 2006).
These regressions have the general form
log(rank
it
) = a +? log(p
it
) +

j?S
i
?
j
log(p
jt
)
(2)
+?log( ˆ p
it
) +?X +?
it
,
where i indexes the product in question (e.g., the high-
quality version of a speci?c title), t indexes the date,
S
i
is the set of products whose prices affect the demand
for product i (e.g., the price of a lower quality version
corresponding to a high-quality version or the price
of a bundle which contains product i as one compo-
nent), ˆ p
it
is the lowest price posted for the correspond-
ing non-Amazon marketplace product (the best price
across all conditions by competing sellers on Ama-
zon’s secondary market) and X is a vector of control
variables. Our control variables include the time since
the product was released (days release), the average
FIG. 5. The variation of sales rank with retail price for two versions of a speci?c software title in our data set. A line between two points
indicates that the ( price, sales rank) had changed from one of the points to the other in successive periods.
138 A. GHOSE AND A. SUNDARARAJAN
customer rating (customer rating) and the number of
reviewers (number of reviewers) who have reviewed
the product. [We use the ?xed effects (within) trans-
formation (Wooldridge, 2002, Chapter 10.5) to control
for unobserved heterogeneity across products.]
We can use the results of this regression to cal-
culate the relevant own- and cross-price elasticities
?
ii
and ?
ij
, respectively. Note that ? is a measure of
how sensitive the product’s own sales rank is to the
product’s own price, while ?
j
is a measure of how sen-
sitive the product’s own sales rank is to the price of the
substitute product. Let (p
i
, Q
i
) represent the quantity
and price of product i and let (p
j
, Q
j
) represent the
quantity and price of product j. One can easily show
that having estimated the parameters ? and each of
the ?
j
’s from the above regression, the own-price elas-
ticity of demand for product i (Mas-Collel, Whinston
and Green, 1995) is given by
?
ii
= ?? =
?Q
i
?p
i
×
p
i
Q
i
, (3)
where ? was estimated in Section 3.1 and the cross-
price elasticity of demand for product i with respect to
product j is
?
ij
= ??
j
=
?Q
i
?p
j
×
p
j
Q
i
. (4)
These elasticity estimates describe how demand varies
with price, and form the basis for analyzing the op-
timality of a ?rm’s chosen price-discrimination strat-
egy, since they enable us, for example, to assess how
demand would vary if the ?rm altered its price dis-
crimination by removing a version or discontinuing a
bundle. They also are inputs to the estimation process
for variable costs, as described in the following section.
(In contrast to some other demand estimation models,
we do not have data on other inputs to the marketing
mix, such as advertising. It is conceivable that sales
ranks may also be affected by off-line advertising. In
the absence of data, we are unable to capture this effect
in our model.)
3.3 Cost Estimation
Many products in information technology (IT) in-
dustries have an unusual cost structure: high ?xed costs
of production, but near-zero or zero variable costs of
production. This cost structure characterizes a class
of technology products which are collectively termed
information goods. Put differently, the cost of produc-
ing the ?rst unit of an information good is very high,
yet the cost of producing each additional unit is vir-
tually nothing. For instance, Microsoft spends hun-
dreds of millions of dollars to develop each version of
its Windows operating system. Once this ?rst copy of
the operating system has been developed, however, it
can be replicated costlessly, which leads to widespread
piracy, a factor which can be incorporated into pric-
ing following Sundararajan (2004a), but which we do
not explicitly model in this study. Early examples of in-
formation goods were computer-based information ser-
vices and software; currently, a wide variety of diverse
products—video, music, textbooks, digital art, to name
a few—share this unique cost structure.
Contrary to what is commonly assumed in the IT
economics literature, packaged consumer software is
not an “information good.” It has positive variable costs
associated with its production, packaging and distribu-
tion, and these may represent a substantial fraction of
the price of such software, especially since a number
of titles are priced under ?fty dollars. Therefore, to as-
sess the optimality of a seller’s choice of price discrim-
ination, we need estimates of the variable costs of the
software titles in our data set. We estimate the vari-
able costs by inferring the Lerner index for each prod-
uct version i, de?ned as the ratio of the markup to the
price, that is, ((p
i
?c
i
)/p
i
), where p
i
is the retail price
and c
i
is the variable cost of product i. Markup is sim-
ply de?ned as price minus marginal cost, that is, the
margin on each unit of sale.
To do this estimation reliably using e-commerce
data, we have developed an extension of the model
of Hausman (1994) that provides a way to estimate
markups using just sales rank data and prices. We begin
with the approach of Hausman (1994), who provides
the following equation to estimate the markups for
products sold by multiproduct oligopolists, weighted
by their market share. Since software ?rms generally
sell multiple products and compete with multiple ?rms
in the market, it is important to consider this for-
mulation of the Lerner index. Consider a set of re-
lated products indexed by i. The ?rst-order conditions
for oligopoly pro?t maximization yield the system of
equations
s
j
+

i

p
i
?c
i
p
i

s
i

?
ij
= 0,
(5)
j = 1, 2, . . . , n.
Here, s
i
is the demand share of product i (demand
share is the ratio of revenues from product i to the
PRICING STRATEGY USING E-COMMERCE DATA 139
total revenues from all related products), ?
ii
is prod-
uct i’s elasticity of demand with respect to its own price
and ?
ij
is the cross-price elasticity of demand with re-
spect to the price of product j. We therefore have a
system of linear equations
s +N

m =0, (6)
where s is the vector of revenue shares, N is the matrix
of cross-price elasticities [?
ij
] and m = [m
0
, m
1
, . . . ,
m
n
], where
m
i
=

p
i
?c
i
p
i

s
i
(7)
is the Lerner index of product i multiplied by its prod-
uct share. The marginal costs c
i
of each individual
product can then be estimated by inverting N to solve
the system of equations (6).
Our extension of this approach allows the estimation
of variable costs using just sales ranks, the parameter ?
which we estimate in Section 3.1, and observed re-
tail prices. Our system of equations for a set of related
products 0, 1, . . . , n with prices p
i
and sales ranks R
i
is derived from the above equations as
s
j
= ?

i

p
i
?c
i
p
i
s
i

?
ij
, j = 1, 2, . . . , n, (8)
which implies that
s
j
=

i

p
i
?c
i
p
i
s
i

??
j
, (9)
which in turn implies that
s
j
=

i

p
i
?c
i
p
i

?

p
i
R
j
dR
j
dp
i

s
i
(10)
or
s
j
= ?

i

(p
i
?c
i
)
s
i
R
j
dR
j
dp
i

, (11)
where
1
s
i
= 1 +
p
i
p
j

R
j
R
i

?
. (12)
3.4 Optimality of Pricing
In this section, we summarize how we can test the
optimality of pricing strategies by software manufac-
turers. Consider the case when the software ?rm is
producing two versions of the product—a high-quality
version and a low-quality version. The total pro?t from
a pair of versions i and j is thus
? = k(p
i
?c
i
)Q
i
+k(p
j
?c
j
)Q
j
. (13)
First-order conditions for pro?t maximization with re-
spect to prices yield the partial derivatives
??
?p
i
= kQ
i
+k(p
i
?c
i
)
?Q
i
?p
i
(14)
+k(p
j
?c
j
)
?Q
j
?p
i
,
??
?p
j
= k(p
i
?c
i
)
?Q
i
?p
j
(15)
+kQ
j
+k(p
j
?c
j
)
?Q
j
?p
j
.
If the products are priced optimally, these partial
derivatives should be equal to zero. Note that we
can evaluate each term on the right-hand side in the
above equations empirically based on our data set and
the intermediate steps described in Sections 3.1, 3.2
and 3.3. Speci?cally, from the data we can infer what
the prices (p
i
) as well as what the quantities (Q
i
) are.
Based on the cost estimation procedure outlined above,
we can infer the marginal costs (c
i
) of each product. Fi-
nally, from the demand estimation procedure, we can
derive the price elasticities and, consequently, impute
what the speci?c derivatives ?Q
i
/?p
i
are. Thus, based
on the signs of these partial derivatives, we can empir-
ically test if the ?rm’s prices are optimal, underpriced
or overpriced.
3.5 Examples
In this section we provide the parameter estimates
for two products in our sample: Adobe Photoshop and
Premier bundle, and Microsoft Of?ce (our results are
quite robust: the estimates from bootstrapping with dif-
ferent repetitions are the same in magnitude and direc-
tion as the original estimates), summarized in Tables
3 and 4.
Note that, as expected, the sign of the own-price
elasticity is positive while the signs of the cross-
price elasticities are negative (recall that an increase
in sales rank implies a decrease in sales). The other
control variable (days release) suggests that, as ex-
pected, sales of products decrease over time. The
numbers indicate that own-price elasticity is signi?-
cantly higher than cross-price elasticities for the Adobe
bundle with respect to each of its two components
(p
component1
and p
component2
). Interestingly, we ?nd
that the cross-price elasticity of the high-quality ver-
sion of Microsoft Of?ce with respect to the low-quality
version (p
standard
), is actually higher than the own-
price elasticity. This highlights that consumer demand
140 A. GHOSE AND A. SUNDARARAJAN
TABLE 3
Parameter estimates for Adobe Photoshop and premier bundle and
its components
a
Variable Estimates (standard error)
Constant ?0.05(1.58)
ln(p
bundle
) 2.22
???
(0.48)
ln(p
component1
) ?0.19
???
(0.06)
ln(p
component2
) ?0.12
???
(0.03)
ln( ˆ p
bundle
) ?0.24
???
(0.07)
ln(days release) 0.18
?
(0.1)
R
2
0.42
a
Standard errors are given in parentheses. The dependent variable
is ln(sales rank) of the bundle.
***, ** and * denote signi?cance at the 0.01, 0.05 and 0.1 levels,
respectively.
for Microsoft Of?ce Professional is very sensitive to
the price of Microsoft Of?ce Standard. Also we do not
see much variance in the extent to which competing
prices from non-Amazon sellers matter in in?uencing
demand at Amazon; in both cases, the cross-price elas-
ticities (from competing sellers) are signi?cantly lower
than the own-price elasticities. [These parameter es-
timates are obtained from OLS regression models of
the kind mentioned earlier. Also, because of the struc-
ture of this industry, quantity and price are not jointly
determined; thus we do not face the endogeneity con-
cerns that would normally arise in demand regressions.
With regard to Amazon’s own price, because software
titles are produced in large quantities prior to going
to market, the quantity of new products Amazon can
sell is predetermined (and usually virtually in?nite) at
the time Amazon sets its price. This follows similar
approaches taken in the literature for demand estima-
TABLE 4
Parameter estimates for versions of Microsoft Of?ce
a
Variable Estimates (standard error)
Constant 7.77(7.21)
ln(p
professional
) 1.91
???
(0.58)
ln(p
standard
) ?2.54
???
(?0.97)
ln( ˆ p
professional
) ?0.36
???
(?0.11)
ln(days release) 0.01
???
(0.003)
R
2
0.32
a
Standard errors are given in parentheses. The dependent variable
is ln(sales rank) of the high-quality version which is Microsoft Of-
?ce Professional.
***, ** and * denote signi?cance at the 0.01, 0.05 and 0.1 levels,
respectively.
tion of Internet product sales using the above assumed
functional form for demand (Ghose, Smith and Telang,
2006; Ghose and Sundararajan, 2005b; Ghose, Huang
and Sundararajan, 2005).]
As an example of a test for the optimality of pricing,
we take the case of Microsoft Of?ce. Using the esti-
mates for own- and cross-price elasticities derived for
both the high- and low-quality versions (see, e.g., Ta-
ble 4 which reports the estimates for the high-quality
version) in equations (14) and (15), we ?nd that the es-
timated derivative of pro?ts with respect to p
professional
is ?75.2 and with respect to p
standard
is ?334.9. The
actual magnitudes of these estimates do not lend them-
selves easily to interpretation. However, their signs
suggest that both versions of Of?ce are overpriced,
since they are priced at a point where the slope of the
pro?t function is negative. [These estimates are based
on maximizing the pro?ts of the channel as a whole
(i.e., the sum of the pro?ts of the retailer and the soft-
ware manufacturer). Since we do not separate the opti-
mization problems of each of these ?rms, our estimates
do not identify whose actions need to be changed to
rectify this mispricing.]
4. CONCLUSION
Our objective in this paper is to outline analyzing the
optimality of pricing strategy of software ?rms using
e-commerce panel data. While we have shown how the
widespread availability of e-commerce data presents
a number of novel empirical research opportunities,
it is important to point out that there are signi?cant
new challenges faced by researchers who aim to ana-
lyze these data in a statistically valid and economically
meaningful way. Although our context is price discrim-
ination in software, the methods we use apply equally
well to e-commerce data about any consumer product
category.
A key statistical challenge in demand estimation of
this kind is that the time structure of e-commerce data
is not well understood. Granted, one can control for
systematic seasonal effects (such as time of the day
or month of the year that the data were collected) and
for major event effects (such as the release of a new
version of Windows), and check one’s data for au-
tocorrelation. However, e-commerce is still at a rela-
tively early stage of its evolution and the fraction of
retail demand ful?lled by e-commerce sites continues
to grow over time. This is driven by an increase in
both the number of consumers who shop on-line and
in the fraction of their purchases made on-line. Each
PRICING STRATEGY USING E-COMMERCE DATA 141
of these factors may affect the relationship of observed
e-commerce demand and price, which in turn suggests
that e-commerce data may have a complex underlying
time structure.
Furthermore, new theory that models the time struc-
ture of such e-commerce data in a more precise way,
and techniques that identify and account for time vari-
ation, may enable future research to assess whether the
demand process that generates such observations is sta-
tionary and whether the e-commerce market in ques-
tion is, in fact, in equilibrium. This is a challenge not
just for retail panel data, but for other forms of data
generated by consumers who interact with e-commerce
sites, such as bidding/reputation data from on-line auc-
tions. Current research that studies the time structure
of bid paths on eBay (Bapna, Jank and Shmueli, 2004;
Jank and Shmueli, 2006) may be a ?rst step toward un-
derstanding similar data generation processes.
A different challenge relates to the extent to which
one can conclude that inferences from data sets such
as ours are representative of the characteristics of an
industry (in our case, consumer software) in general.
Clearly, this is likely to be less of an issue as a larger
fraction of commerce is conducted electronically. We
have benchmarked our price and demand distributions
with a comparable data set from Buy.com, another
large software retailer. However, the frequency with
which the latter site updates its sales ranks is different
from that of Amazon, and statistical techniques that en-
able one to assess how representative our intraday data
are based on benchmark data with a different granular-
ity would be helpful.
In addition to the demand and cost estimates we
have described in this paper, our research program
also involves developing econometric estimates of how
consumers perceive the relative quality levels of re-
lated products and compare them to estimates based on
self-reported quality assessments from Amazon.com
and subjective assessments by CNET editors. Since
aggregate customer feedback measures from eBay,
Amazon.com and various other review sites are fre-
quently used in e-commerce research as measures of
some form of quality, statistical techniques that facili-
tate assigning appropriate cardinal values to
e-commerce ratings data generated by consumers and
editors would contribute to the foundations of this line
of research. [The details of this study are available in
Ghose and Sundararajan (2005a).]
To summarize, we have described a sequence of
related studies that use e-commerce panel data to eval-
uate the optimality of different forms of price dis-
crimination in the software industry. By describing our
data, detailing our approach to estimating some impor-
tant parameters and summarizing some of the issues
that researchers face when conducting such statistical
analyses on e-commerce data, we have aimed to stim-
ulate thought about statistical challenges that arise
when conducting research based on these increasingly
widely used data sets. We hope that this summary will
encourage future work that identi?es and addresses
these challenges, thereby strengthening the statistical
foundations of this exciting and rapidly evolving new
research area.
ACKNOWLEDGMENTS
We thank seminar participants at New York Uni-
versity and participants at the First Annual Sympo-
siumon Statistical Challenges in E-Commerce for their
comments, Wolfgang Jank and Galit Shmueli for their
detailed feedback on an earlier draft of this paper,
and Rong Zheng for outstanding research assistance
in data collection. This research was partially sup-
ported by a Summer 2005 grant from the NET Institute
(www.NETinst.org).
REFERENCES
ARON, R., SUNDARARAJAN, A. and VISWANATHAN, S. (2006).
Intelligent agents in electronic markets for information goods:
Customization, preference revelation and pricing. Decision Sup-
port Systems 41 764–786.
BAKOS, Y. and BRYNJOLFSSON, E. (1999). Bundling informa-
tion goods: Pricing, pro?ts and ef?ciency. Management Sci. 45
1613–1630.
BAPNA, R., JANK, W. and SHMUELI, G. (2004). Price forma-
tion and its dynamics in online auctions. Working paper, Smith
School of Business, Univ. Maryland. Available at www.smith.
umd.edu/faculty/wjank/auctionDynamics.pdf.
BRYNJOLFSSON, E., HU, Y. and SMITH, M. (2003). Consumer
surplus in the digital economy: Estimating the value of in-
creased product variety at online booksellers. Management Sci.
49 1580–1596.
CHEVALIER, J. and GOOLSBEE, A. (2003). Measuring prices and
price competition online: Amazon.com and Barnes and No-
ble.com. Quantitative Marketing and Economics 1 203–222.
CHOUDHARY, V., GHOSE, A., MUKHOPADHYAY, T. and
RAJAN, U. (2005). Personalized pricing and quality differen-
tiation. Management Sci. 51 1120–1130.
GHOSE, A., HUANG, K. and SUNDARARAJAN, A. (2005). Ver-
sions, successive generations and pricing strategies in software
markets: Theory and evidence. Working paper, Stern School of
Business, New York Univ.
GHOSE, A., SMITH, M. and TELANG, R. (2006). Internet ex-
changes for used books: An empirical analysis of product can-
nibalization and welfare impact. Information Systems Research
17 3–19.
142 A. GHOSE AND A. SUNDARARAJAN
GHOSE, A. and SUNDARARAJAN, A. (2005a). Software version-
ing and quality degradation? An exploratory study of the evi-
dence. Working Paper CeDER-05-20, Stern School of Business,
New York Univ.
GHOSE, A. and SUNDARARAJAN, A. (2005b). Pricing security
software: Theory and evidence. In Proc. Fourth Workshop on
the Economics of Information Security. Harvard Univ.
GHOSE, A., TELANG, R. and KRISHNAN, R. (2005). Effect of
electronic secondary markets on the supply chain. J. Manage-
ment Information Systems 22(2) 91–120.
GREENE, W. H. (2000). Econometric Analysis, 4th ed. Prentice-
Hall, Upper Saddle River, NJ.
HAUSMAN, J. (1994). Valuation of new goods under perfect and
imperfect competition. Working paper 4970, National Bureau
Economic Research.
JANK, W. and SHMUELI, G. (2006). Functional data analysis in
electronic commerce research. Statist. Sci. 21 155–166.
MAS-COLLEL, A., WHINSTON, M. and GREEN, J. (1995). Mi-
croeconomic Theory. Oxford Univ. Press, New York.
PARETO, V. (1896/1897). Cours d’économie politique professé a
l’Université de Lausanne 1, 2. F. Rouge, Lausanne.
QUANDT, R. E. (1964). Statistical discrimination among alterna-
tive hypotheses and some economic regularities. J. Regional Sci.
5 1–23.
SUNDARARAJAN, A. (2004a). Managing digital piracy: Pricing
and protection. Information Systems Research 15 287–308.
SUNDARARAJAN, A. (2004b). Nonlinear pricing of information
goods. Management Sci. 50 1660–1673.
WOOLDRIDGE, J. (2002). Econometric Analysis of Cross Section
and Panel Data. MIT Press, Cambridge, MA.

doc_530271286.pdf

Evaluating Pricing Strategy Using e-Commerce Data: Evidence and Estimation Challenges

Attachments