Description
Tests of significance are regarded as essential for the establishment of facts. This paper questions this, and
an alternative approach is presented which centres on replication. Replication provides the crucial test of the
reliability and validity of facts, hypotheses and theories. It leads, when successful, to generaiizable and
predictable results. The new criterion is based on obtaining “significant sameness” in related studies, in
contrast to obtaining a signilicant difference in a single isolated study. This means determining whether the
same model holds over many sets of data, and not what model fits best for one particular data set.
Pergamon
Accounting Ohganizatfons and Society. Vol. 20, No. 1, pp. 35-53, 1995
Ekvier Scicncr Ltd
Printed in Great Britain
0361-36B2/95 19.50+0.00
0361~3682(93)E0004-Z
RECONSIDERING THE STATUS OF TESTS OF SIGNIFICANCE:
AN ALTERNATNE CRITERION OF ADEQUACY*
R. MURRAY LINDSAY
University of Saskatchewan
Abstract
Tests of significance are regarded as essential for the establishment of facts. This paper questions this, and
an alternative approach is presented which centres on replication. Replication provides the crucial test of the
reliability and validity of facts, hypotheses and theories. It leads, when successful, to generaiizable and
predictable results. The new criterion is based on obtaining “significant sameness” in related studies, in
contrast to obtaining a signilicant difference in a single isolated study. This means determining whether the
same model holds over many sets of data, and not what model fits best for one particular data set.
A major conclusion of a study of all empirical
With the growing emphasis on the behavioral and
social sciences and given the great dependence of
these sciences upon statistical methods, one must
take seriously the claim, from respectable quarters,
that the statistical methods currently employed are
fundamentally misconceived (Giere, 1969, p. 372).
budgeting and control articles published
in three accounting journals, Accounting,
Organizations and Society, The Accounting
Review and J ournal of Accounting Research,
during the 197Ck1987 period was that running
a test of significance (hereinafter ToS) has
become equated with scientific rigour and the
touchstone for establishing knowledge, i.e. a
methodological criterion (Lindsay, 1988). This
conclusion was not unexpected. It confirmed
for management accounting what had already
been observed elsewhere: that the To!3 is the
most important method for inference in the
social sciences (‘Johnstone, 1986; see also
Walster & Cleat-y, 1970; Acree, 1978; Guttman,
1985; Oakes, 1986; Gigerenzer, 1987). The
procedure determines how we form hypotheses,
conduct experiments and analyze results. It is
treated as the sine qua non of the “scientific
method’ (Chow, 1988, p. 105; Gigerenzer et al.,
1989). In short, tests of signilicance opera-
tionalize what Subotnik (1988, p. 97) calls the
“Principle of Quantitative Unassailability”, the
positivistic principle underlying our conception
of knowledge.
The aim of the present analysis is to question
the appropriateness of using the procedure in
management accounting research, particularly
in its methodological capacity of supposedly
granting special status to “statistically significant”
results. Progress is being impeded as a result (cf.
Mintzberg, 1979). To this end, a number of
arguments, derived from the survey of past
research as well as from statistical, methodo-
logical, and philosophical sources, are presented.
Thereafter, a different (but not new) approach
to conducting‘research is discussed. It is argued
that replication (to establish whether the result
holds under different conditions, leading to
generalization) must become the critical cri-
terion of adequacy. Specifically, rather than
focusing on obtaining significant differences in
isolated studies, accounting researchers must
* The author has benefited fromcomments received frompresenting the paper at the Perspectives in Accounting Conference,
University of Rhode Island, and the University of Calgary. In addition, helpful comments were received from Cohn Boyd,
George Murphy, Ken Peasnell, Robert Yaansah and, especially, Andrew Ehrenberg.
35
36 R. MURRAY LINDSAY
concentrate on finding “significant sameness”
from a series of rel ated studies.
THE CASE FOR RETHINKING THE STATUS
OF TESTS OF SIGNIFICANCE
There are a number of points which either
speak directly against the use of the ToS
procedure or clarify its role and practical
difhculties. Together, these not only raise doubt
as to whether ToS are generally useful, but even
lead to the view that their use can positively
impede progress.
The role TOS pl ay and do not pl ay
ToS can play a role in the acquisition of new
facts, but it is limited.’ Running a ToS serves
only to ask one question: “How easy is it to
explain the difference between the data and
what is expected on the null hypothesis, on the
basis of chance variation alone” (Freedman et
al , 1978, p. 502). Most people agree that the
level of significance p derived from a ToS
represents a measure of the sample evidence
against the null hypothesis, whereby a low p-
value (typically 0.05 or smaller) suggests that
the nulI is not credible (Barnett, 1982; Cox,
1982; Johnstone, 1986).2 To use Fisher’s (1959,
p. 75) rationale: either an exceptionally rare
chance has occurred, or the null hypothesis is
not true. In conclusion, a ToS tells us whether
the observed result is in fact probably real, as if
the whole population in question had been
measured, and that it is unlikely to have been
the result of obtaining a biased sample due to
chance or random error.
Now to what a ToS does not do. A ToS does
not indicate:
the probability of the null hypothesis
(Kempthorne, 1952, p. 12; Fisher, 1959,
p. 43; Freedman et al , 1978, p. 442; Mayo,
1985, p. 503); nor
the probability of the researcher’s alternative
hypothesis of interest (in the sense of 1 -
p) (Oakes, 1986, chapter 1 >; nor
which hypothesis is better supported by the
data (Anscombe, 1963; Kyburg, 1974, pp.
58-59; Spielman, 1974, p. 213; Barnard,
1986, p. 500); nor
whether there is a high probability of the
results repeating (Carver, 1978; Guttman,
1985; Oakes, 1986, pp. 18-19); nor
whether the result is generahzable (Lindsay
& Ehrenberg, 1993); nor
whether the result is of scientific importance
(Cox, 1977, p. 61; Freedman et al , 1978,
chapter 29; Lindsay, 1993a).
AI1 of these interpretations are fallacies
underlying the flourishing of T,oS in the social
sciences (Finifter, 1972, pp. 155-166; Kyburg,
’ It needs to be made explicit that the reference to ToS in this paper refers to the widely practised null hypothesis testing
variant, The null represents the hypothesis of no relationship or diierence, with any observed difference simply reflecting
chance variation rather than a real diRerence. Typically, the researcher’s aim is to obtain evidence to demonstrate that
the null is inadequate to explain the observations, so that a case can be made to support the researcher’s hypothesis of
interest (denoted as the alternative hypothesis).
a This evidential interpretation, tirst promoted by R. A. Fisher, is denied by the Neyman-Pearson formulation of ToS
(normally referred to as a “hypothesis test”). For Jerzy Neyman, in particular, the result of a ToS is a deci si on between
two alternative courses of action, to either “accept” or “reject” HO, following rules of behaviour which provide
mathematically determined limits for making errors over the “long run” (i.e. not in the single case). That is, we are
provided with a guide on how to act, not on what to believe. In practice, a hybrid theory of statistical inference is typically
used, whereby ToS follow Neyman formally, but Fisher philosophically (Johnstone, 1986, p. 496; Gigerenzer, 1987,
pp. 18ff.). Bayesians, however, argue that the ToS procedure (on either interpretation) is fundamentally misconceived and
inappropriate for basic scientific research. This brief synopsis is presented because it is Gigerenzer’s (1987) thesis that
the vi rtual neglect of almost all statistical textbooks discussing unresolved extant controversies and alternative theories
of inference, along with the anonymous presentation of statistical ideas, has led to the illusion that the hybrid theory is
an immutable truth of mathematics, providing for the mechanization of inference from data to hypothesis.
RECONSIDERING THE STATUS
1974, p. 58; Cronbach & Snow, 1977, p, 52;
Acree, 1978; Carver, 1978; Guttman, 1985;
Mayo, 1985, p. 503; Oakes, 1986, chapter 1;
Gigerenzer, 1987). Moreover, they have helped
to institutionalize ToS as something we cannot
do without (Gigerenzer et al ., 1989, p. 209).
Yet they are all incorrect, and we would do well
to heed Lykken’s reminder of the limited role
played by ToS.
The moral of this story is that the Iinding of statistical
significance is perhaps the least important attribute of a
good experiment: It is neOer a suificient condition for
concluding that a theory has been corroborated, that a
useful empirical fact has been established with reason-
able confidence-or that an experimental report ought
to be published (1968/1970, p. 278; emphasis in original).
Stati sti cal consi derati ons associ ated wi th the
use of ToS
Stati sti cal si gni fi cance i s anci l l a y to para-
meter esti mati on. Regardless of whether our
interest is in theory construction, treatment
comparison, or determining the practical import-
ance of a result, parameter estimation (and its
interval) is the primary statistic of interest;
statistical significance is ancillary (Cochran &
Cox, 1950; Kempthome, 1952, pp. 24,27;Yates,
1964, p. 320; O’Grady, 1982, p. 775).
This point has an immediate practical implica-
tion. The pernicious effects of statistical power
can result in the null hypothesis being rejected,
when for all practical purposes it is true
(Lindsay, 1993a). This occurs because the level
of significance p is a function of both the size
of effect obtained and the sample size. There-
fore, without consideration of the size of effect
obtained, there is the possibility of building a
structure of scientific conclusions on a founda-
tion of triviality (see Dunnette, 1966). Harsha
& Knapp (1990, pp. 53-54) provide a quantita-
tive example of this danger as it relates to
auditing research.
The requi rement of random sampl i ng.
Samples must be randomly selected in order to
use ToS validly. Determining a precise prob-
OF TESTS OF SIGNIFICANCE 37
ability calculation (p-value) requires that the
test statistic be considered in relation to the
sample space of all possible sample outcomes
based on (hypothetical)) “repeated sampling”
from the same population. Without some form
of random sampling, whereby each element in
the population has a known probability for
being selected, the stochastic process associated
with each observation cannot be defined and
therefore the sampling distribution is indeter-
minate, or at least uncertain. In other words,
without introducing a known chance mechanism
into a study’s design, there is no way of defining
what chance means (Freedman et al., 1978).
Random sampling methods provide the founda-
tion for classical statistical inference (i.e. based
on a relative frequency conception of prob-
ability). This is a point on which all master
statisticians agree. For example, consider the
remarks of Alan Stuart, co-author (with Maurice
Kendall) of the authoritative exposition on
classical statistics:
If you feel at times that the statistician, in his insistence
upon random sampling methods, is merely talking
himself into a job, you should chasten yourself with the
reflection that in the absence of random sampling, the
whole apparatus of inference from sample to population
falls to the ground (1984, p. 23).
R. A. Fisher, founder of the modem theory
of experimentation (Kempthorne, 1983) is
equally adamant in stating that samples must be
randomly selected, otherwise “our tests of
significance would be worthl ess” (1947, p. 435,
emphasis added).
Nonetheless, despite such admonitions, investi-
gators frequently perform ToS on samples which
are not randomly chosen (Aldag & Stearns,
1988, p. 259) and, lacking knowledge to the
contrary, assume that their observations are
“like” a random sample from some population,
however ill-specified. Properly speaking, only a
(subjective) Bayesian framework can incorporate
such an assumption into its logic (Kendall &
Stuart, 1977, p. 226).3
1 Reichenbach’s ( 1949, p. 354) comments capture the author’s sentiments as to the propriety of the notorious principle
of indiiference: “To transform the absence of a reason [for thinking that the sample is biased] into a positive reason
represents a feat of oratorical art that is worthy of an attorney for the defense but is not permissible in the court of logic.”
38 R. MURRAY LINDSAY
The virtue of random sampling is not that it
provides unbiased samples on each and every
occasion, this it cannot do (see Urbach, 1985)
but rather that it can deliver an objective
promise for doing so in the “long run” (Stuart,
1984, p. 23). This promise is supported by both
mathematical theory and the abundance of
empirical checks testifying to the fact that it
works (Freedman et al ., 1978, p. 415; Mayo,
1987). In contrast, non-random sampling
methods are notorious for producing biased
samples. According to Yule & Kendall (1950):
Experience has shown that the human being is an
extremely poor instrument for the conduct of a random
selection. Wherever there is any scope for personal
choice or judgment on the part of the observer, bias is
almost certain to creep in. Nor is this a quality which
can be removed by conscious effort or training. Nearly
every human being has, as part of his psychological make-
up, a tendency away from true randomness in his choices
(p. 373).
The use of non-random sampling methods
destroys the physical-mathematical isomorphism
which is necessary in providing the procedure
with its objective basis. To compromise this
objectivity is to annul the fundamental purpose
ofwhy ToS are used (cp. Cox, 1977, p. 61; Fisher,
1959, p. 43; Winch & Campbell, 1969/ 1970,
pp. 205-206; Kruskal, 1979, p. 84; Ryan, 1985,
p. 525).
Lindsay’s (1988) survey of management
accounting research revealed that the random
sampling requirement is not being met in the
majority of studies. Of the 4 1 studies providing
sufficient information to make a determination,
26 (63% ) failed to utilize random sampling
procedures, with many using a sample of
convenience.* In the typical study, respondents
were selected by some senior organization
officer. As one researcher explained: “, . . the
companies were either reluctant or unable to
provide complete lists for a random selection”.
The problem is that the researcher could be
dealing with a small, ud hoc population, if
the officer included everyone meeting the
researcher’s criteria, or with a sample where the
researcher has no obj ecti ve method for deter-
mining its representativeness. In the former
case, any use of ToS is absurd: how can there
be any sampling error when the “sample” and
the population are one and the same? In the
latter case, the lack of random sampling causes
the whole edifice of orthodox statistical inference
to collapse.5 In such situations the use of ToS is
either “a waste of time” (Cochran & Cox, 1950,
p. 411) or “an act of intellectual desperation”
(Freedman et al , 1978, p. 501).
The tensi on between a and statistical power
The combination of the difliculty of obtaining
large sample sizes (Burgstahler & Sundem, 1989,
p. 90) with the small size of effects being
pursued in behavioral research (Cohen, 1977;
Haase et al , 1982; Lindsay, 1988) implies that
the statistical power in many studies will be
unacceptably low. This expectation has been
confirmed empirically in a subset of manage-
ment accounting research (Lindsay, 1993a). To
cope with this reality researchers may have to
increase a in order to offset the otherwise low
power and obtain a balanced test which is
sensitive to both kinds of errors (a and 0)
(Lindsay, 1993a).
Conversely, the high number of ToS run per
4 Random sampling was virtually non-existent in the survey studies examined. Only four of 30 studies used it. In contrast,
all 11 laboratory studies employed random sampling via the use of randomization.
’ Lukka & Kasanen (1993) extend this analysis in arguing that the concept of the “population” to which the sample results
will generalize is typically RI-defined in accounting research, despite the critical importance of doing so. The basic point
is that statistical generalizations are fine when the aim is to estimate a parameter for one particular population. However,
in scientiiic research, where the aim is to compare many populations, statistical generalization is imposible (Deming,
1975). That is, it is not feasible to sample statlsticaIIy from some undefined “super-population” of ail the past, present,
and future locations, researchers, and methods of Investigation (but see Hagood, 1941/1970). Nor, as we shall argue later,
is it efficient, even if it were possible.
RECONSIDERING THE STATUS OF TESTS OF SIGNIFICANCE 39
studf provides pressure to reduce a in order
for the experimentwise Type I error (i.e. the
probability of obtaining one or more false
rejections of the null hypothesis in a given
study) to approach reasonable levels (Ryan,
1959)’ However, this results in the statistical
power of the tests decreasing.
Thus we see the consequences of the inevi-
table tension between a and statistical power.
This tension would seem to cast doubt on
whether ToS can be meaningfully used in a
single study when sample sizes are small.
Methodological considerations
ToS do not eliminate the needforjudgement.
It is often thought that running a ToS introduces
objectivity into the research process by mechaniz-
ing the researcher’s inference (Acree, 1978;
Gigerenzer, 1987; Gigerenzer et aZ., 1989). The
slavish worship of the 0.05 level would appear
to be a manifestation of this viewpoint (see
Skipper etal., 1967/ 1970). However, this notion
of objectivity is patently false.
According to R. A. Fisher, the level of
significance p fails to provide a logical measure
of the evidence against the null hypothesis
unless the sample data exhaust alE the relevant
information bearing on the inference (i.e. no
other knowledge exists) (Spielman, 1974; Acree,
1978; Johnstone, 1987). But this condition is
seldom applicable: “for when in scientific
inquiry, even in the most preliminary phases,
are we so devoid of outside evidence” (Acree,
1978, p. 413).
Values of p therefore need to be combined
informally with information about the method
of investigation and its conduct (see next
subsection) and with prior knowledge in arriv-
ing at a conclusion. This is why Fisher regarded
a p-value “as a piece of evidence that the
scientist would somehow weigh, along with all
other relevant pieces, in summarizing his current
opinion about a hypothesis . .” (Cochran, 1967,
p. 1461). This also explains why Fisher stated
that “no scientific worker has a Iixed level of
significance at which from year to year, and in
all circumstances, he rejects hypotheses; he
rather gives his mind to each particular case in
the light of his evidence and his ideas” (Fisher,
1959, p. 42).
Similarly, Neyman & Pearson (1928, p, 176)
Fisher’s intellectual rivals, concluded that a
significance test is not decisive, but is only one
guide among many. Indeed, in 1933 Neyman &
Pearson (1933/ 1967) abandoned all prospects
of having a theory of inference and adopted a
behaviouristic orientation in order to preserve
the integrity of their framework (see footnote
2).
In summary, ToS do not enable the mechaniza-
tion of inference. Thep-value obtained in a study
should be regarded as just another piece of
evidence to be added informally to the rest; it
does not represent a summary index of all the
evidence. Judgement will always be required in
the practice of science.8 Therefore, significance
tests must become relegated to incidental rather
than decisive status; they cannot be relied
’ Lindsay’s survey found that a total of 3082 To.5 were performed in the 43 articles surveyed. Of these, 187 1 focused
directly on the status of the major hypotheses investigated. This translates into an average (median) 43.5 (20) major ToS
per article. For survey studies, the mean (median) number of major ToS per article was 53 (31.5). Note, these statistics
are conservative: researchers clearly undertake more ToS than they report.
’ It is important to appreciate just how often chance occurrences can and do arise (see Kruskai, 1979) and why researchers
must protect againt probability pyramiding. For example, numerous studies have demonstrated that impressive model R’s
and substantial t-statistics are almost inevitably produced when the researcher uses search procedures (e.g. stepwise
regression) for selecting the “best” subset of explanatory variables, even when the dependent and independent variables
are completely unrelated! (Rencher & Pun, 1980; Freedman, 1983; LoveIl, 1983; Flack & Chang, 1987).
‘Advocates of classical statistical methods often criticize Bayesians for incorporating a subjective element (i.e. the prior)
into their analyses because it is personal or non-objective, and it is an argument that they lose on each occasion (see
Savage, 1961; Carlson, 1976; Kempthorne, 1979; Urbach, 1985; Witmer & Clayton, 1986).
40 R. MURRAY LINDSAY
upon to provide the answer (Acree, 1978,
p. 413).
The validity of the model. The validity of
using a p-value as a measure of the sample
evidence against the null hypothesis depends
entirely on the soundness of the chance model
specified (implicitly or explicitly) by investigators
(Freedman et al, 1978). This means no con-
founding “third variables” exist, measurement
and sampling errors are random, and using an
appropriate probability distribution. However,
establishing the validity of a chance model
is beyond human capability (Leontief, 1971;
Kingman, 1989). As a consequence, it is
impossible to prove anything with a ToS
(Barnard, 1948). A ToS can only inform that a
difference appears real; it cannot explain what
caused the difference. Only a series of well-
designed studies can point to such an explanation.
Thus we return to the point that reaching
conclusions in science goes well beyond statisti-
cal inference, the gap being filled by our
(fallible) subject matter knowledge.9
Degree of corroboration and statistical
signi@cance. Confirmatory evidence differs
widely with respect to the degree of corrobora-
tion it can provide a hypothesis. A main tenet
of Popper’s (1959, p. 269, 1963, p. 112)
philosophy is that a theory is better corrobo-
rated the less expected, or the less probable,
the theory’s prediction is relative to our
“background knowledge” (what we accept as
unproblematic while testing the theory).
Popper’s logic underlying the use of “severe
tests” is related to understanding the growth of
scientific knowledge. For the purposes of this
paper, it is sufLicient to state that for any
confirmatory result the logical possibility always
exists that other mutually incompatible hypo-
theses will explain the data equally well
(Maxwell, 1975, pp. 124-125). However, for
practical purposes, this possibility is lowest
when predictions are extremely precise or so
counter-intutive that no other plausible theory
can currently explain them (Cook, 1983, p. 85).
An example will serve to clarify this point.
Suppose X claims to possess a theory that can
predict London’s rainfall one year in advance. If
X successfully predicts “it will rain in London
in June”, little credence would be given to X’s
theory. However, ifX successfully predicts that
“June will have 17 rainy days” we would likely
pay X more heed, although we may be inclined
to attribute the prediction to be an educated
guess, say based on prior years’ rainfall statistics.
But ifX successfully predict the exact 17 dates
of rainfall in June, then most of us would feel
that X is on to something. What other theory
could account for such an accurate prediction?
If one accepts Popper’s viewpoint on this
matter, which almost everyone does, then the
null hypothesis testing procedure must be
considered to be nearly useless because it
represents an extremely weak test. Two points
in support of this view will be examined.
First, a statistically significant result can only
suggest that the hypothesis of random variation
is untenable in accounting for the discrepancy
between the data and what is expected on the
null hypothesis. In addition to the researcher’s
hypothesis, other plausible explanations almost
always remain to provide rival explanations to
account for the difference, and these must be
ruled out one by one. ‘” In agronomy, where
9 Ravetz (1971, p. 96) reports an example which illustrates that the use of an incorrect model can occur even in well-
established experimental fields. Two chemists, in the course of a straightforward kinetics study, discovered that an important
component of one of their reactions was the glass wall of the flask containing it: the glass was not inert against sodium
hydroxide!
“For example, in quasi-experimental investigations Winch & Campbell (1969/1970) list 14 threats (rival hypotheses),
in addition to the hypothesis of chance, that must be ruled out before a researcher can begin to have confidence in a
specified relationship. To cast this discussion in Bayesian terms, these rival explanations would be understood by a Bayesian
as a host of competitor hypotheses that would go in the denominator of Bayes’ formula, the consequence of which is that
the degree of confirmation for the investigator’s hypothesis will likely be reduced, perhaps substantially.
RECONSIDERING THE STATUS
ToS prospered under the tutelage of R. A. Fisher,
the number of plausible rival explanations is
reduced by the prevalent use of experimental
control and randomization. Yet even here the
single, critical experiment is rare (Yates, 1951).
In general, recognition of the fundamental
differences which exist between behavioural
accounting and agronomy should serve to
prevent us from taking any comfort from the
fact that ToS have been useful in agronomy
(Meehl, 1978).
Most studies in accounting are observational
(Hopwood, 1989, p. 6; Birnberg & Shields, 1989,
pp. 3637; Waterhouse & Richardson, 1989). l1
This design offers little protection against the
many confounding sources of variance which
provide rival explanations.‘* In addition,
our measurements (e.g. supervisory evaluation
style, participation, environmental uncertainty,
job performance) are not nearly as reliable and
valid as their counterparts in agronomy. Further-
more, theories in the behavioural sciences
usually involve hypothetical constructs. Finally,
auxiliary theories, whose accuracy (“truthful-
ness”) is by no means guaranteed, are relied
upon to derive empirical predictions. All of
these considerations can make refutation of the
null hypothesis in the accounting context far
removed from the hypothesis that the researcher
wishes to conlirm. This situation ditfers widely
from agronomy, where theories are usually first-
level observational inductive statements and
where a high degree of control can be exercised
(Meehl, 1986, p. 333). For example, compare
the logical distance between each pair of
statements in the two following situations:
0
0
“The fertilized plots yielded more bushels of
wheat” (evidence) versus “Ferdking explains
why more wheat grew” (conclusion);
“The introduction of budgetary participation
resulted in increased job performance”
OF TESTS OF SIGNIFICANCE 41
(evidence) versus “Expectancy theory
explains why employees were motivated
to perform better” (conclusion) (see, for
example, Ronen & Livingstone, 1975).
The corroborative value for the theory afforded
by confirmatory evidence in each of these two
research contexts is categorically different.
The second point in our examination extends
the basic point made above. The null hypothesis
of no effect is (quasi-) always false in the social
sciences (see Nunnally, 1960; Bakan, 1967/
1970; Meehl, 196711970; 1986; Lykken, 1968/
1970; Hays, 1973; Deming, 1975). Even if the
treatment variable has no effect, it is highly
unlikely that the treatment and control groups
will be exactly the same in regards to other
variables that may aiIect the dependent variable.
This is because “everything is correlated with
everything”. Randomization is helpful, but it
provides no guarantee in each i ndi vfdual study
(Meehl, 196711970, 1986; Seidenfeld, 1982;
Urbach, 1985). Consequently, the model under-
lying the ToS may be misspecilied, and any
rejection of the null becomes solely a function
of the statistical power of the test.
The implication of this result has been
developed by Paul Meehl (1967/1970) in his
classic article “Theory Testing in Psychology
and Physics: A Methodological Paradox”. Meehl
shows that as the power of an experiment
approaches unity, a prior probability approach-
ing 0.50 obtains for rejection of a directional
null hypothesis, even if the researcher’s theory
i s total l y wi thout meri t. The corroborati ve
value of this practice is analogous to predicting
that it will rain on June 2, given the background
information that on average London receives 15
days of rainfall in June. Some test!
In contrast, consider the physical sciences.
There the researcher’s theory provides point
value predictions which become established as
I1 Lindsay’s (1988) survey revealed that 32 of the 43 studies examined were observational.
” Statistical control methods can be used providing we are aware of important “third-variables” and can validly measure
them (Cook & Campbell, 1979, p. 9; Kinney, 1986). However, the potential benefits obtained from doing so must be
considered in relation to the loss of degrees of freedom that results.
42 R MURRAY LI NDSAY
the hypothesis under test (i.e. equivalent to the
null in the social sciences). As Meehl explains,
the effect of increasing a test’s power is to
decrease the acceptance region around the
predicted value and thus to decrease the prior
probability of obtaining a corroborating result.
The failure to reject the hypothesis can there-
fore be interpreted as providing a high degree
of corroboration for the researcher’s hypothesis.
As the comparison highlights, the real problem
with significance tests stems from linking what
the investigator wants to prove with the diffuse
alternative hypothesis (Freedman et al, 1978,
p. 492). In accounting, rejection of the null
has no necessuv corroborative value for the
researcher’s hypothesis: a multitude of plausible
hypotheses could also be consistent with the
data, even with very low levels of significance
(Giere, 1975).13 This concern is not philo-
sophical pedantry; it is very real for behavioural
accounting research. Moreover, the problem is
augmented because of the bias against publish-
ing “negative” results, i.e. studies failing to attain
statistical significance (Lindsay, 1994). Publish-
ing negative result studies could pardally temper
the weaknesses implicit in null hypothesis
testing by revealing spurious results.
In conclusion, a ToS represents an extremely
weak test for any substantive theory in manage-
ment accounting to pass; consequently, in and
of itselfit offers no real credentials to knowledge
claims. Thus it is imprudent to quasi-identify
rejection of the null with support for the
researcher’s hypothesis.
Significane tests impede knowledge acq&ition
The exploratory mature of management
accounting research. Management accounting
research is largely exploratory in nature (Kaplan
1986; Kren & Liao, 1988; Burgstahler & Sundem,
1989; Lindsay, 1993a). This is not a critical assess-
ment; theoretical advances take time, even in
the physical sciences (Giere, 1976). As Lakatos
( 1970, p. 151) explains, “it may take decades
of theoretical work to arrive at the first novel
facts and still more time to arrive at interestingly
testable versions of the research programmes,
at the stage when refutations are no longer
foreseeable in the light of the programme itself”.
The recognition of our true state of develop-
ment is important because exploratory research
is not the same sort of activity as “testing”.
According to Finch ( 1981), the purpose of
exploratory work is to determine the extent to
which the data at hand exhibit certain effects of
interest (“stubborn facts”), the population(s) to
which they apply, and the behavioural processes
that might explain them. In this context, data
play a role in the formation of hypotheses, not
in the determination of their credibility. In con-
trast, the “testing” viewpoint presupposes that
the relevant hypotheses underlying reproducible
effects from specilied populations are already
known. Confusing the two states results in
researchers using hypothesis tests rather than
exploratory data analysis procedures which may
better reveal important patterns in the data (see,
for example, Tukey, 1977; Mosteller & Tukey,
1977).‘* In addition, researchers are diverted
from asking the right questions,15 estimating
I3 Based on their written comments, some researchers would appear to assign vaguely the substantive theory the probability
of 1 - a. As Meehl(l986, p. 333) tells it: “People commit, without being aware of it, the fallacy of thinking ‘If the theory
nxzren’r true, then there is only a probability of 0.05 of this big a difference arising’ .“. Such an interpretation is incorrect.
The p-value represents the probability of obtaining a result equal to or greater than the observed result, calculated on
the basis that the nuN hypothesis is true.
l4 A portion of Cooper & ZelYs (1992) criticism of Kinney’s (1986) article entitled “Empirical Accounting Research
Design for Ph.D. Students” follows this line of argument.
I5 For example, in an article entitled “The Religion of Statistics as Practised in Medical Journals”, Salsburg ( 1985) describes
how the practice of focusing onp-values has deterred researchers from emphasizing and studying the effects of treatments,
and horn framing their studies in a manner which is conducive to the provision of meaningful information to the practitioner
physician (e.g. treatment response time and patterns for particular subsets of patients) and framer of public policy (e.g.
the cost-benefit analysis of a treatment).
RECONSIDEIUNG THE STATUS OF TESTS OF SIGNIFICANCE 43
parameters, deri vi ng predictive models (see
Ehrenberg & Bound, 1993) and determining
the specific conditions under which the result
holds.‘6
To elaborate on this issue, the conception of
knowledge underlying mainstream accounting
research is positivistic in orientation (Colville,
198 1; Chua, 1986; Scapens, 1992). The concep-
tion of knowledge held directs the questions
asked and positivism, among other things,
has resulted in our emphasis on discovering
empirical associations (Morrison & Henkel,
197Ob, p. 309; Ha& & Secord, 1972).
The dominant contingency theory paradigm
is a case in point. Rather than simply testing for
the statistical significance of r (where a low p-
value merely means that I is probably not equal
to zero), we need to investigate and report how
the variables are related (i.e. by how much y
varies with x). ” Further, we need to determine
the conditions under which the relationship
holds in the course of developing causal
explanations (i.e. W~JJ and when does x cause
y to vary in a specified way), which is likely best
facilitated via qualitative-type studies. In qualita-
tive research the focus is on understanding
the meanings organizational actors attach to
accounting, the action they take with respect to
it, and the context dependence of their actions
(Colville, 1981). Ezzamel’s (1987, p. 34) com-
ments with respect to contingency theory are
telling:
The overemphasis on empirical investigation [typicaily
through using standard statisticaI techniques on bivariate
relationships] at the expense of theoretical development
is largely responsible for the lack of conceptual clarity
in the contingency framework which in turn contri-
buted to the confusing and contradictory nature of the
empirical results.
In concluding, the hypothesis-testing para-
digm is a sensible framework only if you are
dealing with unmistakable effects from known
populations and what is in question is the extent
to which the data confirm one substantive
hypothesis over another (Finch, 1981, p. 142).
We are far from this stage in most managerial
areas and our research practices must reflect
this fact. Indeed, contrary to what is taught in
most statistical textbooks, full-blown cot&ma-
tion studies are relatively rare in science (Harre
& Secord, 1972, pp. 69-70; Ehrenberg & Bound,
1993).
Anal yti cal vem~~stati sti cal general i zati on
Some of the above misconceptions about
research strategy come from the failure to
appreciate the difference between analytical and
statistical generalization. Sampling from as many
organizations as possible in order to increase
the genera&ability (or robustness) of the results
is widely considered to be sound research prac-
tice. ToS are considered to be desirable because
they can be used to support inferences when
less than 100% of the population is surveyed.
For example, a researcher might wish to test the
hypothesis that subordinate budgetary partici-
pation increases job performance. Suppose the
researcher randomly samples individuals from
one hundred organizations and determines that
the difference between high and low participa-
tion respondents on job performance is statistic-
ally significant at the 0.01 level. The researcher
then concludes that this low p-value strongly
supports the participation hypothesis.
I6 To take an example from the participatory budgeting literature, Hopwood (1976, p. 79) notes that “many researchers
have been concerned with the broad overall problem of either proving or disproving the general argument [that
participation is associated with improved performance] rather than specifying the conditions for various results. Lie many
interesting ideas, the believers feel that they are universally applicable”.
” Merely reporting the observed correlation, say r = 0.5, does not teU how the variables are related. If, in another study,
an r of 0.5 is obtained, the actual relationship there could either be the same as before, or different (i.e. have the same
slope and intercept as before, or not). Or if a different correlation were obtained, say r = 0.8, the relationship could be
the same or not (just the degree of scatter might differ). In short, there is no way of telling whether the quantitative
relationships agree, and hence generalize, or not (Lindsay & Ehrenberg, 1993).
Such a conclusion, which is archetypical in
management accounting research, is confused
because there are really two inferences being
made. The lowp-value supports the statki ti cal
generalization or inference that the aggregate
measures of central tendency differ systematic-
ally in the overall population (i.e. going from
the sample to the population). However, a low
p-value has no logical bearing on the scfenti fc
or analytic inference; namely, going from the
population finding to the hypothesis concerning
participation. That is, aggregate data cannot be
used to infer general propositions. Specifically,
the researcher cannot infer that budgetary
participation results in increased job perfor-
mance for euf3y employee or even for every
organization included in the population. Nor
does this study enable the prediction of whi ch
individuals will benefit from participation, and
those who will not, and nor does it explain the
process of how participation affects people. AU
of these considerations are analytical generaliza-
tions, requiring knowledge of how the treatment
(i.e. budgetary participation) interacts with
various organizational conditions and personal
parameters (see Bakan, 196711970, pp. 244-246;
Harre & Secord, 1972, pp. 5658; Oakes, 1986;
Yin, 1989). Consequently, statistical generaliza-
tions provide no useful intormation for manage=
to act upon (Harre & Secord, 1972; Deming,
1975; Scapens, 1992).
The distinction between analytical and statis-
tical generalization is therefore very important.
Our present knowledge about organizations
(e.g. technology, environmental uncertainty,
size, culture, strategy, and type of organiza-
tional subunit: production, marketing and R&D)
indicates that organizations and even sub-
organizational units are not “epistemologically
homogeneous”, i.e. similar to one another on
variables that are causally connected to the
outcome variable. Consequently, replicating
earlier results in different organizations is far
from assured given the likelihood of different
variables and/or intensities of variables con-
44 R MURRAY LINDSAY
founding the relationship of interest (Miller,
1981). This is Urnapathy’s conclusion after
assessing the conflicting results obtained in
budgeting research and their inapplicability to
practice.
Budgetary controls ConstiNte an organizational sub-
system, and are finked to other subsystems. Hence, a
change in one of the budgeting variables will result in
a predetermined change in performance only if ah other
budgeting variables and other organizational subsystems
do not change. Hence, budgetary techniques or pro-
cesses used by an effective organization may not work
satisfactorily in other organizations (1987, p. 171).
Indeed, matters become immensely more
complex and difiicult to discern once we intro
duce the idiosyncrasies of people, and consider
the fact that humans are more properly viewed
as ascribing meanings and interpretations to
situations, rather than as mechanistic objects
which follow a stimulusresponse model of
causality (Harre & Secord, 1972; Colville, 1981).
This point is not unique to organizational
research, of course; it applies to all analytic
problems where interest centres on predicting
whether a specific change in a process or
procedure is desirable in contrast to enumera-
tive studies which simply attempt to describe
one particular population (and where a 100%
sample provides the complete answer).18
Deming provides us with the following example
taken from agriculture.
[T]o test two treatments in an agricuhuraJ experiment
by randomizing the treatments in a sample of blocks
drawn from a frame that consisted of alI the arable blocks
in the world would give a result that is nigh useless, as
a sample of any practical size would be so widely
dispersed over so many conditions of soil, rainfall, and
climate, that no useful inference could be drawn. The
estimate of the dMerenceB - A would be only an average
over the whole world, and would not pin-point the types
of soil in which B might be distinctly better than A
(Lkmin& 1975, p. 150).
Thus we see that the random selection of
subjects from an identifiable population for the
‘eSee Demmg (1950, 1975) for an in-depth discussion on the differences between analytic and enumerative SN@eS.
RECONSIDERING THE STATUS OF TESTS OF SIGNIFICANCE 45
purpose of achieving representativeness, so
important for (orthodox) statistical inference,
can be counter-productive for learning about
and explaining organizational processes (cf.
Deming, 1950, 1975; Sayer, 1984; Oakes, 1986;
Ym, 1989). l 9 Instead, the complexity of organiza-
tions would seem to suggest that they be
examined one at a time (Mintzberg, 1979;
Emmanuel et al ., 1990, pp. 377-378) at least
until the research community feels that enough
conditions have been examined to be able
to predict successfully when a result will and
will not hold.
This result should come as a welcome relief
to management accounting researchers. As
Deming ( 1975, p. 149) explains, “Much of man’s
knowledge in science has been learned through
use of judgment-samples in analytic studies.
Rothamsted and other experimental stations are
places of convenience. So is a hospital, or a
clinic, and the group of patients therein that we
may examine.” Similarly, randomization should
not be considered to be indispensable in arriving
at conclusions about causal relations (see
Kyburg, 1961, 196s; Seidenfeld, 1979, 1982;
Cook & Campbell, 1979, pp. 342-344; Darling,
1980; Urbach, 1985)” Finally, the traditional
experimental precept of varying only one
condition at a time is not only dillicult, it is not
always desirable. The aim should be to deter-
mine whether the same result occurs despi te
differences in conditions (Lindsay & Ehrenberg,
1993).
TOS are mi sunderstood and mi sused i n
practi ce. Significance tests are widely misunder-
stood and misused in practice (Cox, 1981; see
also Morrison & Henkel, 1970a). This situation,
in combination with the significance test’s
privileged methodological status, has resulted
in its limited usefulness being swamped by the
detrimental consequences arising from its mis-
use and misinterpretation. For example, the
literature is replete with commentary on how
researchers “search’ for statistical significance
in ways which totaLly destroy any meaningful
use they might have (see, for example, Kinney,
1986; Lindsay, 1994).
Lindsay’s (1988) survey results are particularly
worrisome in this regard. Researchers and/or
editors:
displayed a prejudice against publishing and/
or submitting negative result studies (see
also Lindsay, 1994);
published very few replications (only four
replications in 43 studies);
often &led to be concerned with omitted
variables when interpreting p-val ues i n
observational studies (i.e. the model validity
issue);
frequently earmarked the importance of a
finding and an hypothesis’ degree of con-
firmation by the level of significance p (see
al so Lindsay, 1994);
often failed to calculate and/or assess the
results in terms of the effect size obtained
(see also Lindsay, 1993a, 1994);
often used nonrandom samples;
never considered the pyramiding effects on
Type I errors from undertaking multiple ToS;
and
I 9 kmhg (1993, p. 104) is much more pointed in his conclusion. He writes: “Tests of signiticance, t-test, chi-square, are
useless as inference - i.e., usekss for aid in prediction. Test of hypothesis has been for half a century a bristling obsttuction
to understanding statistical inference.”
*’ The statistical philosopher Teddy Seidenfeld ( 1982) takes a rather extreme position on this point. The following passage
by him is cited in the aim of motivating discussion on what hitherto in the social sciences has been a relatively unchallenged
desideratum of science. “What has been shown is tbat randomization, itself, provides no Food solution to the problem of
fair samples. [However,] there is an opthnistic note to be found in these critical remarks: if randomization is as unfounded
a procedure as I have tried to argue it is, then what makes diflicuh well designed observatiotlll (and retrospective) studies
is not that randomization is unavailable. If the designs that typically include randomizUion permit interesttng conchtsions,
e.g., conclusions about causal efficacy as opposed to mere associations, then it CannOt be the introduction of randomization
which mattes the difference. Perhaps, if we concentrate more on aspects of good designs other than their randomized
components we wilt learn how better to infer from observational data” (pp. 288-289).
46 R MURRAY LINDSAY
l often ran studies with far too little power
and in all cases failed to calculate and
incorporate actual power levels in deriving
conclusions (see aLso Lindsay, 1993a).
Taken together, these survey results are
extremely worrisome and cast doubt upon the
reliability and validity of the findings reported
in the literature surveyed.
THE WAY FORWARD
Two of the more strongly held (and closely
related) misconceptions about science are that
theories are invented, worked through and
Iinalized overnight, and that experiments pro-
duce instant rationality (see Lakatos, 1970;
Eavetz, 1971; Suppe, 1977). Management
accounting researchers are no exception to
this observation. As Burgstahler & Sundem
( 1989, p. 86) observe in their review of
behavioural accounting research (BAR), account-
ing researchers operate on the basis of “many
small, one-shot research projects”. Given this
conception of science it is not happenstance
that various commentators have stated that BAR
is “unfocused” (Hofstedt, 1976) “lacking con-
tinuity” (Collins, 1978, p. 33 1 ), “shapeless”,
“fragmented” (Colville, 1981, p. 120), and
“disjointed” (Caplan, 1989, p. 115).
Contemporary philosophy of science is com-
pletely at odds with this account. The philo-
sophers Chalmers (1982, p. 81) Hacking ( 1983,
p. 216), Kuhn (1970) Lakatos (1970) Laudan
(1981) Newton-Smith (1981, p. 227), Putnam
(1981) and Suppe (1977) view the process of
science as beginning with a vague idea, known
to be defective and quite literally false, which
undergoes active development within a research
~grum. On this view a theory can only be, at
best, an approximation to the truth (Bunge,
1967, p. 388; Durbin, 1987, pp. 178-179);
consequently, it becomes pointless to speak
of “verification”, “inductive confirmation”, or
“refutation” (Hacking, 1983, p. 15). What is to
the point, as Suppe (1977, pp. 706-707)
explains, is to undertake empirical observation
in order to deveZop comprehensive explanatory
theories. It is to this end that science ordinarily
uses data to “test” its current theories.
In the practice of successful science the
emphasis is on obtaining reptiucfble results
from performing several studies under difIerent
conditions, perhaps with different instruments
and/or methods at different sites and with
different researchers (Nelder, 1986; Scapens,
1990; Lindsay & Ehrenberg, 1993). The aim is
to gene&&e results via performing “close” and
“differentiated” replications.21 This approach
entails a vastly different orientation. As Nelder
(1986, p. 113) explains it: “Looking for repro-
ducible results is a search for sign&ant
sameness [in a series of studies], in contrast to
the emphasis on the significant difference from
a single experiment” (emphasis in originaI). In
application, this means determiningwhether the
same model holds for many sets of data
(Ehrenberg & Bound, 1993).** Once a stable
effect has been obtained and its empirical
domain established, the next step is to develop
theoretical explanations and causal understand-
ing of the observable phenomena (Ehrenberg &
Bound, 1993). 23 These theories can then be
further investigated with the aim of assessing
their adequacy and making improvements, as
weIl as by extending the theoretical domain into
new problem contexts (Lindsay, 1993b).
This prescription may seem over-simplistic.
But one cannot deny the unfortunate paradox
that exists with respect to the use of replication
” See Lindsay & Ehxnherg (1993) for the logic of designing replicated studies.
” Note how “testing” a specitic model brings us closer to the physical science situation described earlier.
23 This sequential ordering between obtaining facts and then discovering theory is somewhat artllicial. ‘Ihe twu are often
intertwined (see Kuhn, 1970).
RECONSIDERING THE STATUS OF TESTS OF SIGNIFICANCE 47
in the social sciences. Unlike the physical
sciences, the social sciences, being immature,
have little in the form of a pedigree of knowledge
or a body of reliable methods of inquiry from
which to establish new facts. Hence, one would
think that replication would bc accorded critical
importance in the social sciences relative to the
natural sciences. However, as Campbell ( 1969,
pp. 427-428) explains, this is not the case.
Whereas in the natural sciences important
findings get repeated, either deliberately or
in the course of successive experimentation,
hundreds of times, the incidence in the social
sciences is extremely low (Wilson et al ., 1973;
Glenn, 1983; Abdolmohammadi et al , 1985;
Nelder, 1986; Burgstahler & Sundem, 1989;
Leftwich, 1990; Hubbard & Armstrong, 1994).
Indeed, in the social sciences repetition is con-
sidered to be “inferior” (Umapathy, 1987, p. 170)
and of “low prestige” (Campbell, 1986, p. 122).
As Lindsay & Ehrenberg (1993) conclude,
successful replication is the bedrock of scientific
knowledge. It tells us whether we have a result
at all. It also tells us the range of conditions
under which the result can be expected to hold
and applied to practical problems. In addition,
varying the conditions between different replica-
tions not only extends the scope of generaliza-
tion and determines its limits, it also tells us
about some of the factors which do, or do not,
affect the result causally.
Replication (broadly interpreted to include
generalizing results over many sets of data) is
therefore a “crucial idea” (Freedman et al ,,
1978, p. A-23). Accounting researchers must
come to appreciate that a single study is nearly
meaningless and useless in itself (“Student”,
quoted in Pearson, 1939; Yates, 1951; Popper,
1959; Fisher, 1966; Campbell, 1969; Rave&,
197 1; Kempthorne, 1978; Ehrenberg, 1983;
Guttman, 1985; Nelder, 1986).
Finally, the consequences of failing to operate
within a research program orientation in man-
agement accounting are now beginning to show.
Lindsay 8z Ehrenberg (1993) analysed the
literature examining the affects of managers’
reliance on accounting performance measures
to evaluate subordinate job performance (one
of the more highly researched areas in mana-
gerial accounting). This review indicated that
the cumulative results add up to little by way
of generalizable findings. The reality is that we
have little advice to offer management (see also
the reviews by Kaplan, 1986; Umapathy, 1987;
and Kren & Liao, 1988). The major design flaw
observed in these studies was that they con-
sisted primarily of extensions into new measures
and conditions and even new constructs before
it had been shown that the earlier results could
be directly and routinely reproduced. Similarly,
Shields & Young ( 1993) in their review of the
participative budgeting literature, state that no
unifying empirically validated explanation or
framework has been found (despite close to 40
years of research effort). Again, the design flaw
in this group of studies parallels the one
reported above: lack of close replications. In this
regard, the authors note that the participation
studies are difficult to interpret because of their
diversity in terms of theoretical frameworks,
variables, functional relationships and results
(see also Kren & Liao, 1988).
CONCLUSION
A discipline’s criteria of adequacy (method-
ology) are directly relevant to the level of
progress it attains; they make scientific facts
possible (Ravetz, 1971, pp. 155-156; Lindsay,
1993b). It is hoped that this analysis will
convince readers that the ToS procedure should
be stripped of its special methodological status.
It offers no real credentials to knowledge claims.
Instead, the criterion should be whether the
same model holds for many sets of data.
At this point it may be useful to remind
readers of the situation existing in the physical
sciences. Major developments in these sciences
took place long before significance tests were
available, and they continue to be made without
the heavy reliance that characterizes their use
in the behavioural sciences (Morrison & Henkel,
1970b, p. 3 11). In this regard, we would do well
to consider the timely reminder provided by
Herbert Simon:
48 R MURRAY LINDSAY
In obsemi ng what are sometimes called the successful
sciences - physics, chemistry, and so on - I fmd that
statistical testing in general plays a very much smaller
role there than in the behavioral sciences. As a matter
of fact, the modem statistical theory of testing is a
misinterpretation of the reasons for introducing statistics
in the naturaI sciences. Tbey mm? used to talk about
precisfon of measmmen ts ratbev tban to decide
wbe&er we bad the right tbeoty or not (as cited in
Normann, 1973, emphasis supplied).**
Unlike other critical commentaries, this paper
does not call for the outright abandonment
of ToS. For example, the distinguished psychol-
ogist Paul Meehl is highly contemptuous of the
procedure. He writes that
the almost universal reliance on merely refuting the null
hypothesis as the standard method for corroborating
substantive theories in the sofi [behavioural] areas is a
terrible mistake, is basically unsound, poor scientific
strategy, and one of the worst things that ever happened
in the hIstory of psychology ( 1978, p. 817).
Although the author is sympathetic to such
viewpoints, the reality of the situation cannot
be ignored: researchers and even the discipline
itself( in the sense that ToS provide the pretence
of maturity) have a vested Interest in the current
state of afkiirs (Lindsay, 1993b, p. 245); con-
sequently, change can be expected to take a
considerable period of time, if results associated
with the significance test controversy in other
disciplines are at all typical (see, for example,
the anthology by Morrison & Henkel, 1970a).
A d&rent strategy has therefore been adopted.
In the short term the aim is to educate users
regarding the procedure itself and to improve
statistical practice, to argue for the legitimacy
of other approaches (e.g. case studies), and,
most importantly, to exhort the need for per-
forming replications within a research program
orientation, whereby the goal is to determine
whether the same model holds over many
sets of data. Though Wing well short of
providing any guarantee for success, the focus
on replication is consistent with the best
accounts we have of the “scientific method”
(Tukey, 1989; cited in Ehrenberg & Bound,
1993). In the long run it is hoped that the results
from following this alternative will speak for
themselves.
BIBLIOGRAPHY
Abdolmohammadi, M. J., Menon, K., Oliver, T. W. & Umapathy, S., The Role of the Doctoral Dissertation
In Accounting Research Careers, Issues in Accounting Research (1985) pp. 59-76.
Acree, M. C., Theories of Statistical Inference in PsychologicaI Research: A Historico-Critical Study,
Unpublished Ph.D. dissertation, Clark University (1978).
Aldag, R. J. &Stearns, T. M., issues in Research Me~~oio~JoumalofManagement ( 1988) pp. 253-276.
Anscombe, F. J., Tests of Goodness of Fit, Journal of the Royal Statistical Society B (1%3) pp. 81-94.
Bakan, D., The Test of Significance in Psychological Research, chapter 1, On Method, pp. l-29 (San
Francisco: Jossey-Bass, 1967). Reprinted in Morrison, D. E. & Henkel, R. E., Tbe Significance Test
Contmmrsy -A Reader, pp. 231-251 (London: Butterworths, 1970).
Barnard, G. A., Discussion on “The VaIidity of Comparative Experiments” (by F. J. Anscombc), Journal
of the Royal Sfzdsticai Society A (1948) pp. 201-202.
Barnard, G. A, Discussion on Johnstone (1986), pp. 499-502.
Bamett, V., ComparaNve StaHstical Inference, 2nd Edn (New York: Wiley, 1982).
Bimberg, J. G. & Shields, J. F., Three Decades of Behavioral Accounting Research: A Search for Order,
Behavioral Reseat& in Accounting (1989) pp. 23-74.
Bunge, M., Scfentffc Resemzb, Vol. 1 (New York: Springer, 1967).
Burgstahler, D. & Sundem, G. L., The Evolution of Behavioral Accounting Research in the United States,
1968-1987, Behavioral Research in Accounting (1989) pp. 75-108.
24 I am grateful to Charles Christenson for providing me with this reference.
RECONSIDERING THE STATUS OF TESTS OF SIGNIFICANCE 49
Campbell, D. T., Reforms in Experiments, Ameri can Pvcbol ogi sr (1969) pp. 4-29.
Campbell, D. T., Science’s Social System of Validity-enhancing Collective Belief Change and the Problems
of the Social Sciences, in Fiske, D. W. & Shweder, R. A. (eds), Metatbeory i n Soci al Sci ence: Pl ural i sms
and Subj ecti vi ti es, pp. 108-I 35 (Chicago: University of Chicago Press, 1986).
Caplan, E. H., Behavioral Accounting - A Personal View, Behavioral Research i n Accounti ng (1989)
pp. 109-123.
Carlson, R., Discussion: The Logic of Tests of Significance, Phi l osophy ofSci ence (1976) pp. 116128.
Carver, R. P., The Case Against Statistical Significance Testing, Harvard Educal fonal Revi ew ( 1978) pp.
378-399.
Chalmers, A., What i s thi s Thi ng Cal l ed Sci ence?, 2nd Edn (Milton Keynes: Open University Press, 1982).
Chow, S. L., Significance Test or Effect Size?, Psychol ogi cal Bufi el i n ( 1988) pp. 105-l 10.
Chua, W. F., Radical Developments in Accounting Thought, Accounl i ng Revi ew (October 1986) pp.
601-632.
Co&ran, W. G., Footnote to An Appreciation of R. A. Fisher, Science (1967) pp. 1460-1462.
Co&ran, W. G. & Cox, G. M., Experi meni al Desi gns (New York: Wiley, 1950).
Cohen, J., Stati sti cal Pouer Anal ysi s for the Behavi oral Sci ences, Revised Edn (New York: Academic
Press, 1977).
Collins, F., The Interaction of Budget Characteristics and Personality Variables with Budgetary Response
Attitudes, Accounl fng Revi ew (April 1978) pp. 324-335.
Colville, I., Reconstructing “Behavioral Accounting”, AccounNng 0rgani zaUon.s and Soci ety ( 1981) pp.
119-132.
Cook, T. D., Quasi-Experimentation: Its Ontology, Epistemology, and Methodology, in Morgan, G. (ed.),
Beyond Method Strategi es for Social Research, pp. 57-73 (Beverly Hills, CA: Sage, 1983).
Cook, T. D. & Campbell, D. T., Quasi -Experi menrcl ti orz Desi gn & Anal ysi s I ssues For Fi el d Setti ngs
(Chicago: Rand McNally, 1979).
Cooper, W. W. & Zeff, S. A., Kinney’s Design for Accounting Research, Cri ti cal Perspecti ves on Accounting
(1992) pp. 87-92.
Cox, D. R., The Role of Significance Tests, Scandinavian J ournal of Stati sti cs (1977) pp. 49-70.
Cox, D. R., Theory and General Principle in Statistics (Presidential Address), J ounzal of the Ruyal
Stati sti cal Soci ety A (1981) pp. 289-297.
Cox, D. R., Statistical Signiiicance Tests, Brftfsb J ounal of Cl i ni cal Pbarmucol ogy ( 1982) pp. 325-33 1.
Cronbach, L. J. & Snow, R. E., Apti tudes and I nstructi onal Methods (New York: Irvington, 1977).
Deming, W. E., Some Theory of Sampl i ng (New York: Wiley, 1950).
Deming, W. E., On Probability As a Basis for Action, Ameri can Stati sti ci an (1975) pp. 146152.
Deming, W. E., The New Economi cs (Cambridge, MA: MIT Press, 1993).
Darling, J., A Personalist’s Analysis of Statistical Hypotheses and Some Other Rejoinders to Giere’s Anti-
Positivist Metaphysics, in Cohen, L. J. & Hesse, M. (eds), Appl i cati ons of I ndudfve Logi c (Oxford:
Clarendon Press, 1980).
Dunnette, M. D., Fads, Fashions, and Folder01 in Psychology, Ameri can Pvcbol ogi sr ( 1966) pp. 343352.
Durbin, J.. Statistics and Statistical Science: The Address of the President (with Proceedings), J ournal of
the Royal Stati sti cal Soci ety A (1987) pp. 177-192.
Ehrenberg, A. S. C., We Must Preach What Is Practised, Ameri can Stati sti ci an (1983) pp. 248-250.
Ehrenberg, A. S. C. & Bound, J. A., Predictability and Prediction (with Discussion), J ournal of the Royal
Stati sti cal Soci ety A ( 1993) pp. 167-206.
Emmanuel, C. R., Otley, D. T. & Merchant, K. Accounti ng for Management Control , 2nd Edn (London:
Chapman and Hall, 1990).
Ezzamel, M., Organisation Design: An Overview, in Ezzamel, M. & Hart, H. (eds), Advanced Management
Accounti ng: An Organi sati onal Emphasi s, pp. 1%39 (London: Cassell Educational, 1987).
Finch, P. D., On the Role of Description in Statistical Enquiry, Bt-i tkb J ournal for &e Pbi l osopby of
Sci ence (1981) pp. 127-144.
Fintier, B. M., The Generation of Confidence: Evaluating Research Findings by Random Subsample
Replication, in Costner, H. L. (ed.), Soci ol ogi cal Methodol ogy, pp. 112-175 (San Francisco: Josey-Bass,
1972).
Fisher, R. A., Development of the Theory of Experimental Design, Proceedi ngs of the I n&nati onal
Stati sti cal Conferences (1947) pp. 434-439.
Fisher, R. A., Stati sti cal Methods and Sci enti fi c I nference, 2nd Edn (Edinburgh: Oliver and Boyd, 1959).
Fisher, R. A., The Desi gn of Experi ments, 8th Edn (Edinburgh: Oliver and Boyd, 1966).
50 R. MURRAY LINDSAY
Flack, V. F. & Chang, P. C., Frequency of Selecting Noise Variables in Subset Regression Analysis: A
Simulation St udy, Ameri cun Stati sti ci an (1987) pp. 84-86.
Freedman, D., A Note on Screening Regression Equations, Ameri can Stati sti ci an (1983) pp. 152-155.
Freedman, D., Pisani, R. & Purves, R., Stati sti cs (New York: Norton, 1978).
Giere, R. N., Bayesian Statistics and Biased Procedures, Synthese (1969) pp. 371-387.
Giere, R. N., Popper and the Non-Bayesian Tradition: Comments on Richard Jeifrey, Syntbese (1975) pp.
119-132.
Giere, R. N., Empirical Probability, Objective Statistical Methods, and Scientific Inquiry, in Harper, W. L.
& Hooker, C. A. (eds), Foundati ons of Probabi l i ty Theory, Stati sti cal l nfffence at&Stati sti cal Theori es
of Sci ence, Vol. II, pp. 63-101 (Dordrecht: Reidel, 1976).
Gigerenzer, G., Probabilistic Thinking and the Fight Against Subjectivity, in Kruger, L., Gigerenzer, G. &
Morgan, M. S. (eds), The Probabfl i sti c Revol uti on: I deas i n the Sci ences, Vol. 2, pp. 1 l-34 (Cambridge,
MA: MIT Press, 1987).
Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J. & Kruger, L., The Empi re of Chance: How
Probabi l i & Changed Sci ence and Everyday Li fe (Cambridge: Cambridge University Press, 1989).
Glenn, N. D., Replications, Signiiicance Tests and Confidence in Findings in Survey Research, Publ fc
Opi ni on Quarterl y (1983) pp. 261-269.
Guttman, L., The Illogic of Statistical Inference for Cumulative Science, Appl i ed Stochasti c Model s and
Data Analysis (1985) pp. 3-10.
Haasc, R. F., Waechter, D. M. & Solomon, G. S., How Significant is a Significant Dserence? Average Effect
Size of Research in Counselling Psychology, J ournal of Counsel l i ng P~cbol ogy (1982) pp. 58-65.
Hacking, I., Representi ng and I nterveni ng (Cambridge: Cambridge University Press, 1983).
Hagood, M. J., The Notion of a Hypothetical Universe, in Morrison, D. E. & Henkel, R. E. (eds), The
Si gni &ance Test Controversy -A Reader, pp. 65-78 (London: Butterworths, 1970a). First published
in Hagood, M. J., Stati sti cs for Soci ol ogi sts (New York: ReynaI & Hitchcock, 1941).
Ham?, R. & Secord, P. F., The Expl anati on of Soci al Bebavi our (Oxford: Blackwell, 1972).
Harsha, P. D. & Knapp, M. C., The Use of Within- and Between-subjects Experimental Designs in Behavioral
Accounting Research: A Methodological Note, Behavi oral Research i n Accounti ng ( 1990) pp. 50-62.
Hays, W. L., Stati sti cs, 2nd Edn (New York: Holt, Rinehart & Winston, 1973).
Hofstedt, T. R, Behavioral Accounting Research: Pathologies, Paradigms and Prescriptions, Accounti ng,
Organi zati ons and Soci ety ( 1976) pp. 43-58.
Hopwood, A., Accounti ng and Human Behaviour (Englewood CliIfs, NJ: Prentice-Ha& 1976).
Hopwood, A., Behavioral Accounting in Retrospect and Prospect, Behavi oral Research i n Accounti ng
(1989) pp. l-22.
Hubbard, R. & Armstrong, J. S., Replications and Extensions in Marketing: Rarely Published but Quite
Contrary, I nternati onal J ournal of Research i n Marketi ng ( 1994) pp. 23+248.
Johnstone, D. J., Tests of Significance in Theory and Practice, Stati sti ci an (1986) pp. 491-498.
Johnstone, D. J., Tests of Significance Following R. A. Fisher, Bri ti sh J ournal fm the Phi l osophy of Sci ence
(December 1987) pp. 481-499.
Kaplan, R. S., The Role for Empirical Research in Management Accounting, Accounti ng, Organi zati ons
and Soci ety (1986) pp. 429-452.
Kempthome, O., The Desi gn and Anal ysi s of Experi ments (New York: Wiley, 1952).
Kemptbome, O., Logical, Epistemological and Statistical Aspects of Nature-Nurture Data Interpretation,
Bi ometri cs (1978) pp. l-23.
Kempthome, O., Sampling Inference, Experimental Inference, and Observation Inference, San&ya ( 1979)
pp. 115-145.
Kempthome, O., A Review of R A Fi sh An Appreci ati on (Fienberg, S. E. & Hinkley, D. V., eds), J ournal
of the Ameri can Stati sti cal Associ ati on ( 1983) pp. 482-@0.
Kendall, M. G. & Stuart, A., The Advanced Theory of Stati sti cs, Vol. 1, 4th Edn (London: Charles G&in,
1977).
Klngman, J., Statistical Responsibility, J ournal of the Royal Stati sti cal Soci ety A (1989) pp. 277-285.
Kinney, W. R., Jr, Empirical Accounting Research Design for Ph.D. Students, Accounti ng Revi ew (April
1986) pp. 338-350.
Kren, L. & Liao, W. M., The Role of Accounting Information in the Control of Organizations: A Review
of the Evidence, J ournal of Accounti ng Li terature ( 1988) pp. 280-309.
Kruskal, W., Comment on “Field Fxperlmentation in Weather Modification” by R. E. Braham, J ournal of
the Ameri can Stati sti cal Assocchti on ( 1979) pp. 84-86.
Kuhn, T., The Structure of Sci enti J i c Revol uti ons, 2nd Edn (Chicago: University of Chicago Press, 1970).
RECONSIDERING THE STATUS OF TESTS OF SIGNIFICANCE 51
Kyburg, H. E., Jr, Probabi l i ty and the Logi c of Rati onal Bel i ef (Middletown, CT: Wesleyan University
Press, l%l).
Kyburg, H. E., Jr, Pbfl osopby of Sci ence: A Formal Approach (New York: MacMillan, 1968).
Kyburg, H. E., Jr, The Logi cal Foundhtfons of Statfsti cal I fl fffence (Dordrecht: Reidel, 1974).
Lakatos, I., Falsification and the Methodology of Scientific Research Programmes, in iakatos, I. & Musgrave,
A. (eds), Crftfci sm and the Growth of Knowl edge, pp. 91-196 (Cambridge: Cambridge University
Press, 1970).
Laudan, L., A Problem-Solving Approach to Scientific Progress, in Hacking, I. (ed.), Sci enti fi c Revol utfons,
pp. 144-155 (Oxford: Oxford University Press, 1981).
Le&wich, R., Aggregation of Test Statistics: Statistics vs. EconomicsJournal ofAccounti ng and Economi cs
(1990) pp. 37-44.
Leontief, W., Theoretical Assumptions and Nonobserved Facts (Presidential Address),American Economi c
Revi ew (1971) pp. l-7.
Lindsay, R. M., The Use of Tests of Significance in Accounting Research: A Methodological, Philosophical
and Empirical Enquiry, Unpublished Ph.D. dissertation, University of Lancaster (October 1988).
Lindsay, R. M., Incorporating Power Into Our Statistical Tests: A Methodological and Empirical Inquiry,
Behavi oral Research i n Accountfng ( 1993a) pp. 2 1 l-236.
Lindsay, R. M., Achieving Scientific Knowledge: The Rationality of Scientific Method, in Lyas, C. A.,
Mumford, M., Peasnell, K V. & Stamp, P. C. E. (eds), J usti fyi ng Accounti ng Standards: Some
Phi l osophi cal Di mensi ons, pp. 221-254 (London: Routledge, 1993b).
Lindsay, R. M., Publication System Biases Associated with the Statistical Testing Paradigm, Contemporary
Accounti ng Research (Summer, 1994).
Lindsay, R. M. & Ehrenberg, A. S. C., The Design of Replicated Studies, Ameri can Stati sti ci an (August
1993) pp. 2 17-228.
Love& M. C., Data Mining, Revi ew of Economi cs and Stati sti cs (1983) pp. 1-12.
Lukka, K & Kasanen, E., Generalisability in Accounting Research, Presented at the European Accounting
Association, Turku, Finland (April 1993).
Lykken, D. T., Statistical Significance in Psychological Research, Psychol ogi cal Eul l etfn (1968) pp. 15 l-
159. Reprinted in Morrison, D. E. & Henkel, R. E., The Sfgnf@cance Test Controversy -A Reader, pp.
267-279 (London: Butterworths, 1970).
Maxwell, G., Induction and Empiricism: A Bayesian-Frequentist Alternative, in Maxwell, G. & Andersson,
R. M. (eds), I nducti on, Probabi l fp and Confrmati on, pp. lo&l65 (Dordrecht: Reidel, 1975).
Mayo, D. G., Behaviouristic, Evidentialist, and Learning Models of Statistical Testing, Phi l osophy of Sci ence
(1985) pp. 493-516.
Mayo, O., Comment on “Randomization and the Design of Experiments” by P. Urbach, Phi l osophy of
Sci ence (1987) pp. 592-596.
Meehl, P. E., Theory Testing in Psychology and Physics: A Methodological Paradox, Phi l osophy of Sci ence
( 1967) pp. 103-l 15. Reprinted in Morrison, D. E. & Henkel, R. E., The Si gnfj kance Test Controversy
-A Reader (London: Butterworths, 1970) pp. 252-266.
Meehl, P. E., Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, and the Slow Progress of Soft
Psychology, J ournal of Consul ti ng and Cl i ni cal Psychol ogy (1978) pp. 806-834.
Meehl, P. E., What Social Scientists Don’t Understand, in Fiske, D. W. & Shweder, R. A. (eds), Metutheory
i n Soci al Sci ence: Pl ural i sms and Subj ecti vi ti es, pp. 315-338 (Chicago: University of Chicago Press,
1986).
Miller, D., Toward a New Contingency Approach: The Search for Organizational Gestalts, J ournal of
Management Studi es ( 198 1) pp. l-26.
Mintzberg, H., An Emerging Strategy of “Direct” Research, Admi ni strati ve Sci ence QuarterZy (December
1979) pp. 582-589.
Morrison, D. E. & Henkel, R. E., The Sfgnf@cance Test Conhoversy -A Reader (London: Butterworths,
1970a).
Morrison, D. E. & Henkel, R. E., Significance Tests in Behavioral Research and Beyond, in Morrison,
D. E. & Henkel, R. E. (eds), The Sfgni /cance Test Conhoversy - A Reader, pp. 305-311 (London:
Butterworths, 1970b).
Mosteller, F. & Tukey, J. W., Data Anal ysfs and Regressi on: A Second Course i n Statfstfcs (Reading, MA:
Addison-Wesley, 1977).
Nelder, J. A., Statistics, Science and Technology (Presidential Address), J ournal for the Royal Stati sti cal
Soci ety A (1986) pp. 109-121.
Newton-Smith, W. H., The Ratfonal fty of Sci ence (London: Routledge & Kegan Paul, 1981).
52 R. MURRAY LINDSAY
Neyman, J. & Pearson, E. S., On the Use and Interpretation of Certain Test Criteria for Purposes of
Statistical Inference (Part I), Bfometri ka A ( 1928) pp. 175-240.
Neyman, J. L Pearson, E. S., On the Problem of the Most Efficient Tests of Statistical Hypotheses,
Pbi l osopbfcul Transacti ons of the Royul Soci ety A (1933) pp. 289-337. Reprinted in Neyman, J. &
Pearson, E. S.,Joint Stuti stfcul Papers (Cambridge: Cambridge University Press, 1967).
Normann, R., A Personal Quest for Methodol ogy (Scandinavian institutes for Administrative Research,
1973).
NunnaIIy, J., The Place of Statistics in Psychology, Educati onal and Psychol ogi cal Measurement (1960)
pp. 641-650.
Oakes, M., Stzati sti cal I nference: A Commentary for the Soci al and Behavi oral Sci ences (New York: Wiley,
1986).
O’Grady, K E., Measures of Explained Variance: Cautions and Limitations, Psychol ogi cal Bul l eti n ( 1982)
pp. 766-777.
Pearson, E. S., ‘Student” as Statistician, Bi ometri ku (1939) pp. 210-250.
Popper, K R., Tbe Logi c of Sci enti fi c Di scot,ery (London: Hutchinson, 1959).
Popper, K R., Conjectww and Refutati ons (London: Routledge & Kegan Paul, 1963).
Putnam, H., The “Corroboration” of Theories, in Hacking, I. (ed.), Sci enti fi c Reuol uti oons, pp. 60-79
(Oxford: Oxford University Press, 198 1).
Rave&J. R., S&r@@ Knowl edge and I ts Soci al Probl ems (New York: Oxford University Press, 1971).
Reichenbach, H., The Theory of ProbaWi ty, 2nd Edn (Berkeley: University of Berkeley Press, 1949).
Rencher, A. C. & Pun, F. C., Inflation of R2 i n Best Subset Regression, Technometrl cs (1980) pp. 49-53.
Ronen, J. & Livingstone, J. L., Motivational Impact of Budgets, Accounti ng Revi ew (October 1975) pp.
671-685.
Ryan, T. A., Multiple Comparisons in Psychological Research, Psychol ogi cal Bul l eti n (1959) pp. 26-47.
Ryan, T. A., Ensemble-adjustedp Values: How Are They to be Weighted?, Psychol ogi cal Bul l eti n (1985)
pp. 521-526.
S&burg, D. S., The Religion of Statistics as Practised in Medical Journals, Ameri can Stati sti ci an (August
1985) pp. 228-223.
Savage, L J., The Foundations of Statistics Reconsidered, in Neyman, J. (ed.), Proceedi ngs of the Four&
Berkel ey Symposi um on Mathemati cs and Probabi l i ty, Vol. 1, pp. 575-586 (Berkeley: University of
California Press, 1961).
Sayer, A., Method i n Soci al Sci ence: A Real i st Approach (London: Hutchinson, 1984).
Scapens, R. W., Researching Management Accounting Practice: The Role of Case Study Methods, Bri ti sh
Accounti ng Revi ew (1990) pp. 259-281.
Scapens, R. W., The Role of Case Study Methods in Management Accounting Research: A Personal Reflection
and Reply, Bri ti sh Accounti ng Revi ew ( 1992) pp. 369-383.
SeidcnfeId, T., Phi l osopbi cul Probl ems of Stuti sti cul I nferenc Learni ng fi vm R A Fi sher (Dordrecht
Reidel, 1979).
Seidenfeld, T., Levi on the Dogma of in Experiments, in Bogdan, R. J. (ed.), ProJ Zl es: Hemy E Kyburg, J r
G I saac Levi (Dordrecht: Reidel, 1982).
Shields, M. D. & Young, S. M., Antecedents and Consequents of Participative Budgeting: Evidence on the
Elfects of Asymmetrical Information, J ournal of Management Accounti ng Research (FaII 1993) pp.
265-280.
Skipper, J. K, Jr, Guenther, A. L & Nass, G., The Sacredness of .05: A Note Concerning the Uses of
Statistical Levels of Significance in Social Science, Ameri can Soci ol ogfst (February 1967) pp. 16-18.
Reprinted in Morrison, D. E. SK Henkel, R. E., The Si gni fi cance Test Controversy -A Reader, pp. 155-
160 (London: Butterworths, 1970a).
Spielman, S., The Logic of Tests Of SigniIicance, Phi l osopby of Sci ence (1974) pp. 211-226.
Stuart, A., Tbe I deas of Sampl i ng (London: Charles GriIRn, 1984).
Subotnik, D., Wisdom or Widgets: Whither the School of “Business”?, Abacl rs (1988) pp. 95-106.
Suppe, F., Critical Introduction and Afterword, in Suppe, F. (ed.), The Structure of Sci enti fi c Theo&s,
2nd Edn, pp. 3-241, 617-730 (Urbana: University of Biinois Press, 1977).
Tukey, J. W., Expl oratory Data Anal ysi s (Reading, MA: Addison-Wesley, 1977).
Tukcy, J. W., The Philosophy of Multiple Comparisons, MiIIer Memorial Lecture, Stanford University
( 1989).
Umapathy, S., Unfavorable Variances in Budgeting: Analysis and Recommendations, in Ferris, K R. &
Livingstone, J. L (eds), Management Pl anni ng and Control , Revised Edn, pp. 163-l 76 (Beavercreek,
OH: Century VII, 1987).
RECONSIDERING THE STATUS OF TESTS OF SIGNIFICANCE 53
Urbach, P., Randomization and the Design of Experiments, Pbiiosopby of Scknce (1985) pp. 256-273.
Walster, G. W. & Cleary, T. A., A Proposal for a New Editorial Policy in the Social Sciences, American
StatistMan (April 1970) pp. 16-19.
Waterhouse, J. & Richardson, A., Behavioral Research Implications of the New Management Accounting
Environment, Working paper, University of Alberta ( 1989).
Wilson, F. D., Smoke, G. I_ % Martin, J. D., The Replication Problem in Sociology: A Report and a Suggestion,
soci010gica1 rnqurry (1973) pp. 141-149.
Winch, R. F. & Campbell, D. T., Proof? No. Evidence? Yes. The Significance of Tests of Significance,
AmerfcanSoclologlst( 1969)~~. 140-143. Reprinted inMorrison, D. E. & Henkel, R. E., TheSignificance
Test Conlroversy -A Reader, pp. 199-206 (London: Bunenvorths, 1970).
Winner, J. A. & Clayton, M. K, On Objectivity and Subjectivity in Statistical Inference: A Response To
Mayo, Syntbese (1986) pp. 369-379.
Yates, F., The Iniluence of Sfaristical Met&x& for Reseamb Workers on the Development of the Science
of Statistics, J ournul of the American Statistical Association (1951) pp. 19-34.
Yates, F., Sir RonaId Fisher and the Design of Experiments, BiomeMcs (1964) pp. 307-321.
Yin, R. K., Case St&y Resemcb, Revised Edn (Newbury Park, CA: Sage, 1989).
Yule, G. U. & Kendall, M. G., An Intro&&on to tbe Tbeoty of Stattstks, 14th Edn (London: Charles
GrifIin, 1950).
doc_501905994.pdf
Tests of significance are regarded as essential for the establishment of facts. This paper questions this, and
an alternative approach is presented which centres on replication. Replication provides the crucial test of the
reliability and validity of facts, hypotheses and theories. It leads, when successful, to generaiizable and
predictable results. The new criterion is based on obtaining “significant sameness” in related studies, in
contrast to obtaining a signilicant difference in a single isolated study. This means determining whether the
same model holds over many sets of data, and not what model fits best for one particular data set.
Pergamon
Accounting Ohganizatfons and Society. Vol. 20, No. 1, pp. 35-53, 1995
Ekvier Scicncr Ltd
Printed in Great Britain
0361-36B2/95 19.50+0.00
0361~3682(93)E0004-Z
RECONSIDERING THE STATUS OF TESTS OF SIGNIFICANCE:
AN ALTERNATNE CRITERION OF ADEQUACY*
R. MURRAY LINDSAY
University of Saskatchewan
Abstract
Tests of significance are regarded as essential for the establishment of facts. This paper questions this, and
an alternative approach is presented which centres on replication. Replication provides the crucial test of the
reliability and validity of facts, hypotheses and theories. It leads, when successful, to generaiizable and
predictable results. The new criterion is based on obtaining “significant sameness” in related studies, in
contrast to obtaining a signilicant difference in a single isolated study. This means determining whether the
same model holds over many sets of data, and not what model fits best for one particular data set.
A major conclusion of a study of all empirical
With the growing emphasis on the behavioral and
social sciences and given the great dependence of
these sciences upon statistical methods, one must
take seriously the claim, from respectable quarters,
that the statistical methods currently employed are
fundamentally misconceived (Giere, 1969, p. 372).
budgeting and control articles published
in three accounting journals, Accounting,
Organizations and Society, The Accounting
Review and J ournal of Accounting Research,
during the 197Ck1987 period was that running
a test of significance (hereinafter ToS) has
become equated with scientific rigour and the
touchstone for establishing knowledge, i.e. a
methodological criterion (Lindsay, 1988). This
conclusion was not unexpected. It confirmed
for management accounting what had already
been observed elsewhere: that the To!3 is the
most important method for inference in the
social sciences (‘Johnstone, 1986; see also
Walster & Cleat-y, 1970; Acree, 1978; Guttman,
1985; Oakes, 1986; Gigerenzer, 1987). The
procedure determines how we form hypotheses,
conduct experiments and analyze results. It is
treated as the sine qua non of the “scientific
method’ (Chow, 1988, p. 105; Gigerenzer et al.,
1989). In short, tests of signilicance opera-
tionalize what Subotnik (1988, p. 97) calls the
“Principle of Quantitative Unassailability”, the
positivistic principle underlying our conception
of knowledge.
The aim of the present analysis is to question
the appropriateness of using the procedure in
management accounting research, particularly
in its methodological capacity of supposedly
granting special status to “statistically significant”
results. Progress is being impeded as a result (cf.
Mintzberg, 1979). To this end, a number of
arguments, derived from the survey of past
research as well as from statistical, methodo-
logical, and philosophical sources, are presented.
Thereafter, a different (but not new) approach
to conducting‘research is discussed. It is argued
that replication (to establish whether the result
holds under different conditions, leading to
generalization) must become the critical cri-
terion of adequacy. Specifically, rather than
focusing on obtaining significant differences in
isolated studies, accounting researchers must
* The author has benefited fromcomments received frompresenting the paper at the Perspectives in Accounting Conference,
University of Rhode Island, and the University of Calgary. In addition, helpful comments were received from Cohn Boyd,
George Murphy, Ken Peasnell, Robert Yaansah and, especially, Andrew Ehrenberg.
35
36 R. MURRAY LINDSAY
concentrate on finding “significant sameness”
from a series of rel ated studies.
THE CASE FOR RETHINKING THE STATUS
OF TESTS OF SIGNIFICANCE
There are a number of points which either
speak directly against the use of the ToS
procedure or clarify its role and practical
difhculties. Together, these not only raise doubt
as to whether ToS are generally useful, but even
lead to the view that their use can positively
impede progress.
The role TOS pl ay and do not pl ay
ToS can play a role in the acquisition of new
facts, but it is limited.’ Running a ToS serves
only to ask one question: “How easy is it to
explain the difference between the data and
what is expected on the null hypothesis, on the
basis of chance variation alone” (Freedman et
al , 1978, p. 502). Most people agree that the
level of significance p derived from a ToS
represents a measure of the sample evidence
against the null hypothesis, whereby a low p-
value (typically 0.05 or smaller) suggests that
the nulI is not credible (Barnett, 1982; Cox,
1982; Johnstone, 1986).2 To use Fisher’s (1959,
p. 75) rationale: either an exceptionally rare
chance has occurred, or the null hypothesis is
not true. In conclusion, a ToS tells us whether
the observed result is in fact probably real, as if
the whole population in question had been
measured, and that it is unlikely to have been
the result of obtaining a biased sample due to
chance or random error.
Now to what a ToS does not do. A ToS does
not indicate:
the probability of the null hypothesis
(Kempthorne, 1952, p. 12; Fisher, 1959,
p. 43; Freedman et al , 1978, p. 442; Mayo,
1985, p. 503); nor
the probability of the researcher’s alternative
hypothesis of interest (in the sense of 1 -
p) (Oakes, 1986, chapter 1 >; nor
which hypothesis is better supported by the
data (Anscombe, 1963; Kyburg, 1974, pp.
58-59; Spielman, 1974, p. 213; Barnard,
1986, p. 500); nor
whether there is a high probability of the
results repeating (Carver, 1978; Guttman,
1985; Oakes, 1986, pp. 18-19); nor
whether the result is generahzable (Lindsay
& Ehrenberg, 1993); nor
whether the result is of scientific importance
(Cox, 1977, p. 61; Freedman et al , 1978,
chapter 29; Lindsay, 1993a).
AI1 of these interpretations are fallacies
underlying the flourishing of T,oS in the social
sciences (Finifter, 1972, pp. 155-166; Kyburg,
’ It needs to be made explicit that the reference to ToS in this paper refers to the widely practised null hypothesis testing
variant, The null represents the hypothesis of no relationship or diierence, with any observed difference simply reflecting
chance variation rather than a real diRerence. Typically, the researcher’s aim is to obtain evidence to demonstrate that
the null is inadequate to explain the observations, so that a case can be made to support the researcher’s hypothesis of
interest (denoted as the alternative hypothesis).
a This evidential interpretation, tirst promoted by R. A. Fisher, is denied by the Neyman-Pearson formulation of ToS
(normally referred to as a “hypothesis test”). For Jerzy Neyman, in particular, the result of a ToS is a deci si on between
two alternative courses of action, to either “accept” or “reject” HO, following rules of behaviour which provide
mathematically determined limits for making errors over the “long run” (i.e. not in the single case). That is, we are
provided with a guide on how to act, not on what to believe. In practice, a hybrid theory of statistical inference is typically
used, whereby ToS follow Neyman formally, but Fisher philosophically (Johnstone, 1986, p. 496; Gigerenzer, 1987,
pp. 18ff.). Bayesians, however, argue that the ToS procedure (on either interpretation) is fundamentally misconceived and
inappropriate for basic scientific research. This brief synopsis is presented because it is Gigerenzer’s (1987) thesis that
the vi rtual neglect of almost all statistical textbooks discussing unresolved extant controversies and alternative theories
of inference, along with the anonymous presentation of statistical ideas, has led to the illusion that the hybrid theory is
an immutable truth of mathematics, providing for the mechanization of inference from data to hypothesis.
RECONSIDERING THE STATUS
1974, p. 58; Cronbach & Snow, 1977, p, 52;
Acree, 1978; Carver, 1978; Guttman, 1985;
Mayo, 1985, p. 503; Oakes, 1986, chapter 1;
Gigerenzer, 1987). Moreover, they have helped
to institutionalize ToS as something we cannot
do without (Gigerenzer et al ., 1989, p. 209).
Yet they are all incorrect, and we would do well
to heed Lykken’s reminder of the limited role
played by ToS.
The moral of this story is that the Iinding of statistical
significance is perhaps the least important attribute of a
good experiment: It is neOer a suificient condition for
concluding that a theory has been corroborated, that a
useful empirical fact has been established with reason-
able confidence-or that an experimental report ought
to be published (1968/1970, p. 278; emphasis in original).
Stati sti cal consi derati ons associ ated wi th the
use of ToS
Stati sti cal si gni fi cance i s anci l l a y to para-
meter esti mati on. Regardless of whether our
interest is in theory construction, treatment
comparison, or determining the practical import-
ance of a result, parameter estimation (and its
interval) is the primary statistic of interest;
statistical significance is ancillary (Cochran &
Cox, 1950; Kempthome, 1952, pp. 24,27;Yates,
1964, p. 320; O’Grady, 1982, p. 775).
This point has an immediate practical implica-
tion. The pernicious effects of statistical power
can result in the null hypothesis being rejected,
when for all practical purposes it is true
(Lindsay, 1993a). This occurs because the level
of significance p is a function of both the size
of effect obtained and the sample size. There-
fore, without consideration of the size of effect
obtained, there is the possibility of building a
structure of scientific conclusions on a founda-
tion of triviality (see Dunnette, 1966). Harsha
& Knapp (1990, pp. 53-54) provide a quantita-
tive example of this danger as it relates to
auditing research.
The requi rement of random sampl i ng.
Samples must be randomly selected in order to
use ToS validly. Determining a precise prob-
OF TESTS OF SIGNIFICANCE 37
ability calculation (p-value) requires that the
test statistic be considered in relation to the
sample space of all possible sample outcomes
based on (hypothetical)) “repeated sampling”
from the same population. Without some form
of random sampling, whereby each element in
the population has a known probability for
being selected, the stochastic process associated
with each observation cannot be defined and
therefore the sampling distribution is indeter-
minate, or at least uncertain. In other words,
without introducing a known chance mechanism
into a study’s design, there is no way of defining
what chance means (Freedman et al., 1978).
Random sampling methods provide the founda-
tion for classical statistical inference (i.e. based
on a relative frequency conception of prob-
ability). This is a point on which all master
statisticians agree. For example, consider the
remarks of Alan Stuart, co-author (with Maurice
Kendall) of the authoritative exposition on
classical statistics:
If you feel at times that the statistician, in his insistence
upon random sampling methods, is merely talking
himself into a job, you should chasten yourself with the
reflection that in the absence of random sampling, the
whole apparatus of inference from sample to population
falls to the ground (1984, p. 23).
R. A. Fisher, founder of the modem theory
of experimentation (Kempthorne, 1983) is
equally adamant in stating that samples must be
randomly selected, otherwise “our tests of
significance would be worthl ess” (1947, p. 435,
emphasis added).
Nonetheless, despite such admonitions, investi-
gators frequently perform ToS on samples which
are not randomly chosen (Aldag & Stearns,
1988, p. 259) and, lacking knowledge to the
contrary, assume that their observations are
“like” a random sample from some population,
however ill-specified. Properly speaking, only a
(subjective) Bayesian framework can incorporate
such an assumption into its logic (Kendall &
Stuart, 1977, p. 226).3
1 Reichenbach’s ( 1949, p. 354) comments capture the author’s sentiments as to the propriety of the notorious principle
of indiiference: “To transform the absence of a reason [for thinking that the sample is biased] into a positive reason
represents a feat of oratorical art that is worthy of an attorney for the defense but is not permissible in the court of logic.”
38 R. MURRAY LINDSAY
The virtue of random sampling is not that it
provides unbiased samples on each and every
occasion, this it cannot do (see Urbach, 1985)
but rather that it can deliver an objective
promise for doing so in the “long run” (Stuart,
1984, p. 23). This promise is supported by both
mathematical theory and the abundance of
empirical checks testifying to the fact that it
works (Freedman et al ., 1978, p. 415; Mayo,
1987). In contrast, non-random sampling
methods are notorious for producing biased
samples. According to Yule & Kendall (1950):
Experience has shown that the human being is an
extremely poor instrument for the conduct of a random
selection. Wherever there is any scope for personal
choice or judgment on the part of the observer, bias is
almost certain to creep in. Nor is this a quality which
can be removed by conscious effort or training. Nearly
every human being has, as part of his psychological make-
up, a tendency away from true randomness in his choices
(p. 373).
The use of non-random sampling methods
destroys the physical-mathematical isomorphism
which is necessary in providing the procedure
with its objective basis. To compromise this
objectivity is to annul the fundamental purpose
ofwhy ToS are used (cp. Cox, 1977, p. 61; Fisher,
1959, p. 43; Winch & Campbell, 1969/ 1970,
pp. 205-206; Kruskal, 1979, p. 84; Ryan, 1985,
p. 525).
Lindsay’s (1988) survey of management
accounting research revealed that the random
sampling requirement is not being met in the
majority of studies. Of the 4 1 studies providing
sufficient information to make a determination,
26 (63% ) failed to utilize random sampling
procedures, with many using a sample of
convenience.* In the typical study, respondents
were selected by some senior organization
officer. As one researcher explained: “, . . the
companies were either reluctant or unable to
provide complete lists for a random selection”.
The problem is that the researcher could be
dealing with a small, ud hoc population, if
the officer included everyone meeting the
researcher’s criteria, or with a sample where the
researcher has no obj ecti ve method for deter-
mining its representativeness. In the former
case, any use of ToS is absurd: how can there
be any sampling error when the “sample” and
the population are one and the same? In the
latter case, the lack of random sampling causes
the whole edifice of orthodox statistical inference
to collapse.5 In such situations the use of ToS is
either “a waste of time” (Cochran & Cox, 1950,
p. 411) or “an act of intellectual desperation”
(Freedman et al , 1978, p. 501).
The tensi on between a and statistical power
The combination of the difliculty of obtaining
large sample sizes (Burgstahler & Sundem, 1989,
p. 90) with the small size of effects being
pursued in behavioral research (Cohen, 1977;
Haase et al , 1982; Lindsay, 1988) implies that
the statistical power in many studies will be
unacceptably low. This expectation has been
confirmed empirically in a subset of manage-
ment accounting research (Lindsay, 1993a). To
cope with this reality researchers may have to
increase a in order to offset the otherwise low
power and obtain a balanced test which is
sensitive to both kinds of errors (a and 0)
(Lindsay, 1993a).
Conversely, the high number of ToS run per
4 Random sampling was virtually non-existent in the survey studies examined. Only four of 30 studies used it. In contrast,
all 11 laboratory studies employed random sampling via the use of randomization.
’ Lukka & Kasanen (1993) extend this analysis in arguing that the concept of the “population” to which the sample results
will generalize is typically RI-defined in accounting research, despite the critical importance of doing so. The basic point
is that statistical generalizations are fine when the aim is to estimate a parameter for one particular population. However,
in scientiiic research, where the aim is to compare many populations, statistical generalization is imposible (Deming,
1975). That is, it is not feasible to sample statlsticaIIy from some undefined “super-population” of ail the past, present,
and future locations, researchers, and methods of Investigation (but see Hagood, 1941/1970). Nor, as we shall argue later,
is it efficient, even if it were possible.
RECONSIDERING THE STATUS OF TESTS OF SIGNIFICANCE 39
studf provides pressure to reduce a in order
for the experimentwise Type I error (i.e. the
probability of obtaining one or more false
rejections of the null hypothesis in a given
study) to approach reasonable levels (Ryan,
1959)’ However, this results in the statistical
power of the tests decreasing.
Thus we see the consequences of the inevi-
table tension between a and statistical power.
This tension would seem to cast doubt on
whether ToS can be meaningfully used in a
single study when sample sizes are small.
Methodological considerations
ToS do not eliminate the needforjudgement.
It is often thought that running a ToS introduces
objectivity into the research process by mechaniz-
ing the researcher’s inference (Acree, 1978;
Gigerenzer, 1987; Gigerenzer et aZ., 1989). The
slavish worship of the 0.05 level would appear
to be a manifestation of this viewpoint (see
Skipper etal., 1967/ 1970). However, this notion
of objectivity is patently false.
According to R. A. Fisher, the level of
significance p fails to provide a logical measure
of the evidence against the null hypothesis
unless the sample data exhaust alE the relevant
information bearing on the inference (i.e. no
other knowledge exists) (Spielman, 1974; Acree,
1978; Johnstone, 1987). But this condition is
seldom applicable: “for when in scientific
inquiry, even in the most preliminary phases,
are we so devoid of outside evidence” (Acree,
1978, p. 413).
Values of p therefore need to be combined
informally with information about the method
of investigation and its conduct (see next
subsection) and with prior knowledge in arriv-
ing at a conclusion. This is why Fisher regarded
a p-value “as a piece of evidence that the
scientist would somehow weigh, along with all
other relevant pieces, in summarizing his current
opinion about a hypothesis . .” (Cochran, 1967,
p. 1461). This also explains why Fisher stated
that “no scientific worker has a Iixed level of
significance at which from year to year, and in
all circumstances, he rejects hypotheses; he
rather gives his mind to each particular case in
the light of his evidence and his ideas” (Fisher,
1959, p. 42).
Similarly, Neyman & Pearson (1928, p, 176)
Fisher’s intellectual rivals, concluded that a
significance test is not decisive, but is only one
guide among many. Indeed, in 1933 Neyman &
Pearson (1933/ 1967) abandoned all prospects
of having a theory of inference and adopted a
behaviouristic orientation in order to preserve
the integrity of their framework (see footnote
2).
In summary, ToS do not enable the mechaniza-
tion of inference. Thep-value obtained in a study
should be regarded as just another piece of
evidence to be added informally to the rest; it
does not represent a summary index of all the
evidence. Judgement will always be required in
the practice of science.8 Therefore, significance
tests must become relegated to incidental rather
than decisive status; they cannot be relied
’ Lindsay’s survey found that a total of 3082 To.5 were performed in the 43 articles surveyed. Of these, 187 1 focused
directly on the status of the major hypotheses investigated. This translates into an average (median) 43.5 (20) major ToS
per article. For survey studies, the mean (median) number of major ToS per article was 53 (31.5). Note, these statistics
are conservative: researchers clearly undertake more ToS than they report.
’ It is important to appreciate just how often chance occurrences can and do arise (see Kruskai, 1979) and why researchers
must protect againt probability pyramiding. For example, numerous studies have demonstrated that impressive model R’s
and substantial t-statistics are almost inevitably produced when the researcher uses search procedures (e.g. stepwise
regression) for selecting the “best” subset of explanatory variables, even when the dependent and independent variables
are completely unrelated! (Rencher & Pun, 1980; Freedman, 1983; LoveIl, 1983; Flack & Chang, 1987).
‘Advocates of classical statistical methods often criticize Bayesians for incorporating a subjective element (i.e. the prior)
into their analyses because it is personal or non-objective, and it is an argument that they lose on each occasion (see
Savage, 1961; Carlson, 1976; Kempthorne, 1979; Urbach, 1985; Witmer & Clayton, 1986).
40 R. MURRAY LINDSAY
upon to provide the answer (Acree, 1978,
p. 413).
The validity of the model. The validity of
using a p-value as a measure of the sample
evidence against the null hypothesis depends
entirely on the soundness of the chance model
specified (implicitly or explicitly) by investigators
(Freedman et al, 1978). This means no con-
founding “third variables” exist, measurement
and sampling errors are random, and using an
appropriate probability distribution. However,
establishing the validity of a chance model
is beyond human capability (Leontief, 1971;
Kingman, 1989). As a consequence, it is
impossible to prove anything with a ToS
(Barnard, 1948). A ToS can only inform that a
difference appears real; it cannot explain what
caused the difference. Only a series of well-
designed studies can point to such an explanation.
Thus we return to the point that reaching
conclusions in science goes well beyond statisti-
cal inference, the gap being filled by our
(fallible) subject matter knowledge.9
Degree of corroboration and statistical
signi@cance. Confirmatory evidence differs
widely with respect to the degree of corrobora-
tion it can provide a hypothesis. A main tenet
of Popper’s (1959, p. 269, 1963, p. 112)
philosophy is that a theory is better corrobo-
rated the less expected, or the less probable,
the theory’s prediction is relative to our
“background knowledge” (what we accept as
unproblematic while testing the theory).
Popper’s logic underlying the use of “severe
tests” is related to understanding the growth of
scientific knowledge. For the purposes of this
paper, it is sufLicient to state that for any
confirmatory result the logical possibility always
exists that other mutually incompatible hypo-
theses will explain the data equally well
(Maxwell, 1975, pp. 124-125). However, for
practical purposes, this possibility is lowest
when predictions are extremely precise or so
counter-intutive that no other plausible theory
can currently explain them (Cook, 1983, p. 85).
An example will serve to clarify this point.
Suppose X claims to possess a theory that can
predict London’s rainfall one year in advance. If
X successfully predicts “it will rain in London
in June”, little credence would be given to X’s
theory. However, ifX successfully predicts that
“June will have 17 rainy days” we would likely
pay X more heed, although we may be inclined
to attribute the prediction to be an educated
guess, say based on prior years’ rainfall statistics.
But ifX successfully predict the exact 17 dates
of rainfall in June, then most of us would feel
that X is on to something. What other theory
could account for such an accurate prediction?
If one accepts Popper’s viewpoint on this
matter, which almost everyone does, then the
null hypothesis testing procedure must be
considered to be nearly useless because it
represents an extremely weak test. Two points
in support of this view will be examined.
First, a statistically significant result can only
suggest that the hypothesis of random variation
is untenable in accounting for the discrepancy
between the data and what is expected on the
null hypothesis. In addition to the researcher’s
hypothesis, other plausible explanations almost
always remain to provide rival explanations to
account for the difference, and these must be
ruled out one by one. ‘” In agronomy, where
9 Ravetz (1971, p. 96) reports an example which illustrates that the use of an incorrect model can occur even in well-
established experimental fields. Two chemists, in the course of a straightforward kinetics study, discovered that an important
component of one of their reactions was the glass wall of the flask containing it: the glass was not inert against sodium
hydroxide!
“For example, in quasi-experimental investigations Winch & Campbell (1969/1970) list 14 threats (rival hypotheses),
in addition to the hypothesis of chance, that must be ruled out before a researcher can begin to have confidence in a
specified relationship. To cast this discussion in Bayesian terms, these rival explanations would be understood by a Bayesian
as a host of competitor hypotheses that would go in the denominator of Bayes’ formula, the consequence of which is that
the degree of confirmation for the investigator’s hypothesis will likely be reduced, perhaps substantially.
RECONSIDERING THE STATUS
ToS prospered under the tutelage of R. A. Fisher,
the number of plausible rival explanations is
reduced by the prevalent use of experimental
control and randomization. Yet even here the
single, critical experiment is rare (Yates, 1951).
In general, recognition of the fundamental
differences which exist between behavioural
accounting and agronomy should serve to
prevent us from taking any comfort from the
fact that ToS have been useful in agronomy
(Meehl, 1978).
Most studies in accounting are observational
(Hopwood, 1989, p. 6; Birnberg & Shields, 1989,
pp. 3637; Waterhouse & Richardson, 1989). l1
This design offers little protection against the
many confounding sources of variance which
provide rival explanations.‘* In addition,
our measurements (e.g. supervisory evaluation
style, participation, environmental uncertainty,
job performance) are not nearly as reliable and
valid as their counterparts in agronomy. Further-
more, theories in the behavioural sciences
usually involve hypothetical constructs. Finally,
auxiliary theories, whose accuracy (“truthful-
ness”) is by no means guaranteed, are relied
upon to derive empirical predictions. All of
these considerations can make refutation of the
null hypothesis in the accounting context far
removed from the hypothesis that the researcher
wishes to conlirm. This situation ditfers widely
from agronomy, where theories are usually first-
level observational inductive statements and
where a high degree of control can be exercised
(Meehl, 1986, p. 333). For example, compare
the logical distance between each pair of
statements in the two following situations:
0
0
“The fertilized plots yielded more bushels of
wheat” (evidence) versus “Ferdking explains
why more wheat grew” (conclusion);
“The introduction of budgetary participation
resulted in increased job performance”
OF TESTS OF SIGNIFICANCE 41
(evidence) versus “Expectancy theory
explains why employees were motivated
to perform better” (conclusion) (see, for
example, Ronen & Livingstone, 1975).
The corroborative value for the theory afforded
by confirmatory evidence in each of these two
research contexts is categorically different.
The second point in our examination extends
the basic point made above. The null hypothesis
of no effect is (quasi-) always false in the social
sciences (see Nunnally, 1960; Bakan, 1967/
1970; Meehl, 196711970; 1986; Lykken, 1968/
1970; Hays, 1973; Deming, 1975). Even if the
treatment variable has no effect, it is highly
unlikely that the treatment and control groups
will be exactly the same in regards to other
variables that may aiIect the dependent variable.
This is because “everything is correlated with
everything”. Randomization is helpful, but it
provides no guarantee in each i ndi vfdual study
(Meehl, 196711970, 1986; Seidenfeld, 1982;
Urbach, 1985). Consequently, the model under-
lying the ToS may be misspecilied, and any
rejection of the null becomes solely a function
of the statistical power of the test.
The implication of this result has been
developed by Paul Meehl (1967/1970) in his
classic article “Theory Testing in Psychology
and Physics: A Methodological Paradox”. Meehl
shows that as the power of an experiment
approaches unity, a prior probability approach-
ing 0.50 obtains for rejection of a directional
null hypothesis, even if the researcher’s theory
i s total l y wi thout meri t. The corroborati ve
value of this practice is analogous to predicting
that it will rain on June 2, given the background
information that on average London receives 15
days of rainfall in June. Some test!
In contrast, consider the physical sciences.
There the researcher’s theory provides point
value predictions which become established as
I1 Lindsay’s (1988) survey revealed that 32 of the 43 studies examined were observational.
” Statistical control methods can be used providing we are aware of important “third-variables” and can validly measure
them (Cook & Campbell, 1979, p. 9; Kinney, 1986). However, the potential benefits obtained from doing so must be
considered in relation to the loss of degrees of freedom that results.
42 R MURRAY LI NDSAY
the hypothesis under test (i.e. equivalent to the
null in the social sciences). As Meehl explains,
the effect of increasing a test’s power is to
decrease the acceptance region around the
predicted value and thus to decrease the prior
probability of obtaining a corroborating result.
The failure to reject the hypothesis can there-
fore be interpreted as providing a high degree
of corroboration for the researcher’s hypothesis.
As the comparison highlights, the real problem
with significance tests stems from linking what
the investigator wants to prove with the diffuse
alternative hypothesis (Freedman et al, 1978,
p. 492). In accounting, rejection of the null
has no necessuv corroborative value for the
researcher’s hypothesis: a multitude of plausible
hypotheses could also be consistent with the
data, even with very low levels of significance
(Giere, 1975).13 This concern is not philo-
sophical pedantry; it is very real for behavioural
accounting research. Moreover, the problem is
augmented because of the bias against publish-
ing “negative” results, i.e. studies failing to attain
statistical significance (Lindsay, 1994). Publish-
ing negative result studies could pardally temper
the weaknesses implicit in null hypothesis
testing by revealing spurious results.
In conclusion, a ToS represents an extremely
weak test for any substantive theory in manage-
ment accounting to pass; consequently, in and
of itselfit offers no real credentials to knowledge
claims. Thus it is imprudent to quasi-identify
rejection of the null with support for the
researcher’s hypothesis.
Significane tests impede knowledge acq&ition
The exploratory mature of management
accounting research. Management accounting
research is largely exploratory in nature (Kaplan
1986; Kren & Liao, 1988; Burgstahler & Sundem,
1989; Lindsay, 1993a). This is not a critical assess-
ment; theoretical advances take time, even in
the physical sciences (Giere, 1976). As Lakatos
( 1970, p. 151) explains, “it may take decades
of theoretical work to arrive at the first novel
facts and still more time to arrive at interestingly
testable versions of the research programmes,
at the stage when refutations are no longer
foreseeable in the light of the programme itself”.
The recognition of our true state of develop-
ment is important because exploratory research
is not the same sort of activity as “testing”.
According to Finch ( 1981), the purpose of
exploratory work is to determine the extent to
which the data at hand exhibit certain effects of
interest (“stubborn facts”), the population(s) to
which they apply, and the behavioural processes
that might explain them. In this context, data
play a role in the formation of hypotheses, not
in the determination of their credibility. In con-
trast, the “testing” viewpoint presupposes that
the relevant hypotheses underlying reproducible
effects from specilied populations are already
known. Confusing the two states results in
researchers using hypothesis tests rather than
exploratory data analysis procedures which may
better reveal important patterns in the data (see,
for example, Tukey, 1977; Mosteller & Tukey,
1977).‘* In addition, researchers are diverted
from asking the right questions,15 estimating
I3 Based on their written comments, some researchers would appear to assign vaguely the substantive theory the probability
of 1 - a. As Meehl(l986, p. 333) tells it: “People commit, without being aware of it, the fallacy of thinking ‘If the theory
nxzren’r true, then there is only a probability of 0.05 of this big a difference arising’ .“. Such an interpretation is incorrect.
The p-value represents the probability of obtaining a result equal to or greater than the observed result, calculated on
the basis that the nuN hypothesis is true.
l4 A portion of Cooper & ZelYs (1992) criticism of Kinney’s (1986) article entitled “Empirical Accounting Research
Design for Ph.D. Students” follows this line of argument.
I5 For example, in an article entitled “The Religion of Statistics as Practised in Medical Journals”, Salsburg ( 1985) describes
how the practice of focusing onp-values has deterred researchers from emphasizing and studying the effects of treatments,
and horn framing their studies in a manner which is conducive to the provision of meaningful information to the practitioner
physician (e.g. treatment response time and patterns for particular subsets of patients) and framer of public policy (e.g.
the cost-benefit analysis of a treatment).
RECONSIDEIUNG THE STATUS OF TESTS OF SIGNIFICANCE 43
parameters, deri vi ng predictive models (see
Ehrenberg & Bound, 1993) and determining
the specific conditions under which the result
holds.‘6
To elaborate on this issue, the conception of
knowledge underlying mainstream accounting
research is positivistic in orientation (Colville,
198 1; Chua, 1986; Scapens, 1992). The concep-
tion of knowledge held directs the questions
asked and positivism, among other things,
has resulted in our emphasis on discovering
empirical associations (Morrison & Henkel,
197Ob, p. 309; Ha& & Secord, 1972).
The dominant contingency theory paradigm
is a case in point. Rather than simply testing for
the statistical significance of r (where a low p-
value merely means that I is probably not equal
to zero), we need to investigate and report how
the variables are related (i.e. by how much y
varies with x). ” Further, we need to determine
the conditions under which the relationship
holds in the course of developing causal
explanations (i.e. W~JJ and when does x cause
y to vary in a specified way), which is likely best
facilitated via qualitative-type studies. In qualita-
tive research the focus is on understanding
the meanings organizational actors attach to
accounting, the action they take with respect to
it, and the context dependence of their actions
(Colville, 1981). Ezzamel’s (1987, p. 34) com-
ments with respect to contingency theory are
telling:
The overemphasis on empirical investigation [typicaily
through using standard statisticaI techniques on bivariate
relationships] at the expense of theoretical development
is largely responsible for the lack of conceptual clarity
in the contingency framework which in turn contri-
buted to the confusing and contradictory nature of the
empirical results.
In concluding, the hypothesis-testing para-
digm is a sensible framework only if you are
dealing with unmistakable effects from known
populations and what is in question is the extent
to which the data confirm one substantive
hypothesis over another (Finch, 1981, p. 142).
We are far from this stage in most managerial
areas and our research practices must reflect
this fact. Indeed, contrary to what is taught in
most statistical textbooks, full-blown cot&ma-
tion studies are relatively rare in science (Harre
& Secord, 1972, pp. 69-70; Ehrenberg & Bound,
1993).
Anal yti cal vem~~stati sti cal general i zati on
Some of the above misconceptions about
research strategy come from the failure to
appreciate the difference between analytical and
statistical generalization. Sampling from as many
organizations as possible in order to increase
the genera&ability (or robustness) of the results
is widely considered to be sound research prac-
tice. ToS are considered to be desirable because
they can be used to support inferences when
less than 100% of the population is surveyed.
For example, a researcher might wish to test the
hypothesis that subordinate budgetary partici-
pation increases job performance. Suppose the
researcher randomly samples individuals from
one hundred organizations and determines that
the difference between high and low participa-
tion respondents on job performance is statistic-
ally significant at the 0.01 level. The researcher
then concludes that this low p-value strongly
supports the participation hypothesis.
I6 To take an example from the participatory budgeting literature, Hopwood (1976, p. 79) notes that “many researchers
have been concerned with the broad overall problem of either proving or disproving the general argument [that
participation is associated with improved performance] rather than specifying the conditions for various results. Lie many
interesting ideas, the believers feel that they are universally applicable”.
” Merely reporting the observed correlation, say r = 0.5, does not teU how the variables are related. If, in another study,
an r of 0.5 is obtained, the actual relationship there could either be the same as before, or different (i.e. have the same
slope and intercept as before, or not). Or if a different correlation were obtained, say r = 0.8, the relationship could be
the same or not (just the degree of scatter might differ). In short, there is no way of telling whether the quantitative
relationships agree, and hence generalize, or not (Lindsay & Ehrenberg, 1993).
Such a conclusion, which is archetypical in
management accounting research, is confused
because there are really two inferences being
made. The lowp-value supports the statki ti cal
generalization or inference that the aggregate
measures of central tendency differ systematic-
ally in the overall population (i.e. going from
the sample to the population). However, a low
p-value has no logical bearing on the scfenti fc
or analytic inference; namely, going from the
population finding to the hypothesis concerning
participation. That is, aggregate data cannot be
used to infer general propositions. Specifically,
the researcher cannot infer that budgetary
participation results in increased job perfor-
mance for euf3y employee or even for every
organization included in the population. Nor
does this study enable the prediction of whi ch
individuals will benefit from participation, and
those who will not, and nor does it explain the
process of how participation affects people. AU
of these considerations are analytical generaliza-
tions, requiring knowledge of how the treatment
(i.e. budgetary participation) interacts with
various organizational conditions and personal
parameters (see Bakan, 196711970, pp. 244-246;
Harre & Secord, 1972, pp. 5658; Oakes, 1986;
Yin, 1989). Consequently, statistical generaliza-
tions provide no useful intormation for manage=
to act upon (Harre & Secord, 1972; Deming,
1975; Scapens, 1992).
The distinction between analytical and statis-
tical generalization is therefore very important.
Our present knowledge about organizations
(e.g. technology, environmental uncertainty,
size, culture, strategy, and type of organiza-
tional subunit: production, marketing and R&D)
indicates that organizations and even sub-
organizational units are not “epistemologically
homogeneous”, i.e. similar to one another on
variables that are causally connected to the
outcome variable. Consequently, replicating
earlier results in different organizations is far
from assured given the likelihood of different
variables and/or intensities of variables con-
44 R MURRAY LINDSAY
founding the relationship of interest (Miller,
1981). This is Urnapathy’s conclusion after
assessing the conflicting results obtained in
budgeting research and their inapplicability to
practice.
Budgetary controls ConstiNte an organizational sub-
system, and are finked to other subsystems. Hence, a
change in one of the budgeting variables will result in
a predetermined change in performance only if ah other
budgeting variables and other organizational subsystems
do not change. Hence, budgetary techniques or pro-
cesses used by an effective organization may not work
satisfactorily in other organizations (1987, p. 171).
Indeed, matters become immensely more
complex and difiicult to discern once we intro
duce the idiosyncrasies of people, and consider
the fact that humans are more properly viewed
as ascribing meanings and interpretations to
situations, rather than as mechanistic objects
which follow a stimulusresponse model of
causality (Harre & Secord, 1972; Colville, 1981).
This point is not unique to organizational
research, of course; it applies to all analytic
problems where interest centres on predicting
whether a specific change in a process or
procedure is desirable in contrast to enumera-
tive studies which simply attempt to describe
one particular population (and where a 100%
sample provides the complete answer).18
Deming provides us with the following example
taken from agriculture.
[T]o test two treatments in an agricuhuraJ experiment
by randomizing the treatments in a sample of blocks
drawn from a frame that consisted of alI the arable blocks
in the world would give a result that is nigh useless, as
a sample of any practical size would be so widely
dispersed over so many conditions of soil, rainfall, and
climate, that no useful inference could be drawn. The
estimate of the dMerenceB - A would be only an average
over the whole world, and would not pin-point the types
of soil in which B might be distinctly better than A
(Lkmin& 1975, p. 150).
Thus we see that the random selection of
subjects from an identifiable population for the
‘eSee Demmg (1950, 1975) for an in-depth discussion on the differences between analytic and enumerative SN@eS.
RECONSIDERING THE STATUS OF TESTS OF SIGNIFICANCE 45
purpose of achieving representativeness, so
important for (orthodox) statistical inference,
can be counter-productive for learning about
and explaining organizational processes (cf.
Deming, 1950, 1975; Sayer, 1984; Oakes, 1986;
Ym, 1989). l 9 Instead, the complexity of organiza-
tions would seem to suggest that they be
examined one at a time (Mintzberg, 1979;
Emmanuel et al ., 1990, pp. 377-378) at least
until the research community feels that enough
conditions have been examined to be able
to predict successfully when a result will and
will not hold.
This result should come as a welcome relief
to management accounting researchers. As
Deming ( 1975, p. 149) explains, “Much of man’s
knowledge in science has been learned through
use of judgment-samples in analytic studies.
Rothamsted and other experimental stations are
places of convenience. So is a hospital, or a
clinic, and the group of patients therein that we
may examine.” Similarly, randomization should
not be considered to be indispensable in arriving
at conclusions about causal relations (see
Kyburg, 1961, 196s; Seidenfeld, 1979, 1982;
Cook & Campbell, 1979, pp. 342-344; Darling,
1980; Urbach, 1985)” Finally, the traditional
experimental precept of varying only one
condition at a time is not only dillicult, it is not
always desirable. The aim should be to deter-
mine whether the same result occurs despi te
differences in conditions (Lindsay & Ehrenberg,
1993).
TOS are mi sunderstood and mi sused i n
practi ce. Significance tests are widely misunder-
stood and misused in practice (Cox, 1981; see
also Morrison & Henkel, 1970a). This situation,
in combination with the significance test’s
privileged methodological status, has resulted
in its limited usefulness being swamped by the
detrimental consequences arising from its mis-
use and misinterpretation. For example, the
literature is replete with commentary on how
researchers “search’ for statistical significance
in ways which totaLly destroy any meaningful
use they might have (see, for example, Kinney,
1986; Lindsay, 1994).
Lindsay’s (1988) survey results are particularly
worrisome in this regard. Researchers and/or
editors:
displayed a prejudice against publishing and/
or submitting negative result studies (see
also Lindsay, 1994);
published very few replications (only four
replications in 43 studies);
often &led to be concerned with omitted
variables when interpreting p-val ues i n
observational studies (i.e. the model validity
issue);
frequently earmarked the importance of a
finding and an hypothesis’ degree of con-
firmation by the level of significance p (see
al so Lindsay, 1994);
often failed to calculate and/or assess the
results in terms of the effect size obtained
(see also Lindsay, 1993a, 1994);
often used nonrandom samples;
never considered the pyramiding effects on
Type I errors from undertaking multiple ToS;
and
I 9 kmhg (1993, p. 104) is much more pointed in his conclusion. He writes: “Tests of signiticance, t-test, chi-square, are
useless as inference - i.e., usekss for aid in prediction. Test of hypothesis has been for half a century a bristling obsttuction
to understanding statistical inference.”
*’ The statistical philosopher Teddy Seidenfeld ( 1982) takes a rather extreme position on this point. The following passage
by him is cited in the aim of motivating discussion on what hitherto in the social sciences has been a relatively unchallenged
desideratum of science. “What has been shown is tbat randomization, itself, provides no Food solution to the problem of
fair samples. [However,] there is an opthnistic note to be found in these critical remarks: if randomization is as unfounded
a procedure as I have tried to argue it is, then what makes diflicuh well designed observatiotlll (and retrospective) studies
is not that randomization is unavailable. If the designs that typically include randomizUion permit interesttng conchtsions,
e.g., conclusions about causal efficacy as opposed to mere associations, then it CannOt be the introduction of randomization
which mattes the difference. Perhaps, if we concentrate more on aspects of good designs other than their randomized
components we wilt learn how better to infer from observational data” (pp. 288-289).
46 R MURRAY LINDSAY
l often ran studies with far too little power
and in all cases failed to calculate and
incorporate actual power levels in deriving
conclusions (see aLso Lindsay, 1993a).
Taken together, these survey results are
extremely worrisome and cast doubt upon the
reliability and validity of the findings reported
in the literature surveyed.
THE WAY FORWARD
Two of the more strongly held (and closely
related) misconceptions about science are that
theories are invented, worked through and
Iinalized overnight, and that experiments pro-
duce instant rationality (see Lakatos, 1970;
Eavetz, 1971; Suppe, 1977). Management
accounting researchers are no exception to
this observation. As Burgstahler & Sundem
( 1989, p. 86) observe in their review of
behavioural accounting research (BAR), account-
ing researchers operate on the basis of “many
small, one-shot research projects”. Given this
conception of science it is not happenstance
that various commentators have stated that BAR
is “unfocused” (Hofstedt, 1976) “lacking con-
tinuity” (Collins, 1978, p. 33 1 ), “shapeless”,
“fragmented” (Colville, 1981, p. 120), and
“disjointed” (Caplan, 1989, p. 115).
Contemporary philosophy of science is com-
pletely at odds with this account. The philo-
sophers Chalmers (1982, p. 81) Hacking ( 1983,
p. 216), Kuhn (1970) Lakatos (1970) Laudan
(1981) Newton-Smith (1981, p. 227), Putnam
(1981) and Suppe (1977) view the process of
science as beginning with a vague idea, known
to be defective and quite literally false, which
undergoes active development within a research
~grum. On this view a theory can only be, at
best, an approximation to the truth (Bunge,
1967, p. 388; Durbin, 1987, pp. 178-179);
consequently, it becomes pointless to speak
of “verification”, “inductive confirmation”, or
“refutation” (Hacking, 1983, p. 15). What is to
the point, as Suppe (1977, pp. 706-707)
explains, is to undertake empirical observation
in order to deveZop comprehensive explanatory
theories. It is to this end that science ordinarily
uses data to “test” its current theories.
In the practice of successful science the
emphasis is on obtaining reptiucfble results
from performing several studies under difIerent
conditions, perhaps with different instruments
and/or methods at different sites and with
different researchers (Nelder, 1986; Scapens,
1990; Lindsay & Ehrenberg, 1993). The aim is
to gene&&e results via performing “close” and
“differentiated” replications.21 This approach
entails a vastly different orientation. As Nelder
(1986, p. 113) explains it: “Looking for repro-
ducible results is a search for sign&ant
sameness [in a series of studies], in contrast to
the emphasis on the significant difference from
a single experiment” (emphasis in originaI). In
application, this means determiningwhether the
same model holds for many sets of data
(Ehrenberg & Bound, 1993).** Once a stable
effect has been obtained and its empirical
domain established, the next step is to develop
theoretical explanations and causal understand-
ing of the observable phenomena (Ehrenberg &
Bound, 1993). 23 These theories can then be
further investigated with the aim of assessing
their adequacy and making improvements, as
weIl as by extending the theoretical domain into
new problem contexts (Lindsay, 1993b).
This prescription may seem over-simplistic.
But one cannot deny the unfortunate paradox
that exists with respect to the use of replication
” See Lindsay & Ehxnherg (1993) for the logic of designing replicated studies.
” Note how “testing” a specitic model brings us closer to the physical science situation described earlier.
23 This sequential ordering between obtaining facts and then discovering theory is somewhat artllicial. ‘Ihe twu are often
intertwined (see Kuhn, 1970).
RECONSIDERING THE STATUS OF TESTS OF SIGNIFICANCE 47
in the social sciences. Unlike the physical
sciences, the social sciences, being immature,
have little in the form of a pedigree of knowledge
or a body of reliable methods of inquiry from
which to establish new facts. Hence, one would
think that replication would bc accorded critical
importance in the social sciences relative to the
natural sciences. However, as Campbell ( 1969,
pp. 427-428) explains, this is not the case.
Whereas in the natural sciences important
findings get repeated, either deliberately or
in the course of successive experimentation,
hundreds of times, the incidence in the social
sciences is extremely low (Wilson et al ., 1973;
Glenn, 1983; Abdolmohammadi et al , 1985;
Nelder, 1986; Burgstahler & Sundem, 1989;
Leftwich, 1990; Hubbard & Armstrong, 1994).
Indeed, in the social sciences repetition is con-
sidered to be “inferior” (Umapathy, 1987, p. 170)
and of “low prestige” (Campbell, 1986, p. 122).
As Lindsay & Ehrenberg (1993) conclude,
successful replication is the bedrock of scientific
knowledge. It tells us whether we have a result
at all. It also tells us the range of conditions
under which the result can be expected to hold
and applied to practical problems. In addition,
varying the conditions between different replica-
tions not only extends the scope of generaliza-
tion and determines its limits, it also tells us
about some of the factors which do, or do not,
affect the result causally.
Replication (broadly interpreted to include
generalizing results over many sets of data) is
therefore a “crucial idea” (Freedman et al ,,
1978, p. A-23). Accounting researchers must
come to appreciate that a single study is nearly
meaningless and useless in itself (“Student”,
quoted in Pearson, 1939; Yates, 1951; Popper,
1959; Fisher, 1966; Campbell, 1969; Rave&,
197 1; Kempthorne, 1978; Ehrenberg, 1983;
Guttman, 1985; Nelder, 1986).
Finally, the consequences of failing to operate
within a research program orientation in man-
agement accounting are now beginning to show.
Lindsay 8z Ehrenberg (1993) analysed the
literature examining the affects of managers’
reliance on accounting performance measures
to evaluate subordinate job performance (one
of the more highly researched areas in mana-
gerial accounting). This review indicated that
the cumulative results add up to little by way
of generalizable findings. The reality is that we
have little advice to offer management (see also
the reviews by Kaplan, 1986; Umapathy, 1987;
and Kren & Liao, 1988). The major design flaw
observed in these studies was that they con-
sisted primarily of extensions into new measures
and conditions and even new constructs before
it had been shown that the earlier results could
be directly and routinely reproduced. Similarly,
Shields & Young ( 1993) in their review of the
participative budgeting literature, state that no
unifying empirically validated explanation or
framework has been found (despite close to 40
years of research effort). Again, the design flaw
in this group of studies parallels the one
reported above: lack of close replications. In this
regard, the authors note that the participation
studies are difficult to interpret because of their
diversity in terms of theoretical frameworks,
variables, functional relationships and results
(see also Kren & Liao, 1988).
CONCLUSION
A discipline’s criteria of adequacy (method-
ology) are directly relevant to the level of
progress it attains; they make scientific facts
possible (Ravetz, 1971, pp. 155-156; Lindsay,
1993b). It is hoped that this analysis will
convince readers that the ToS procedure should
be stripped of its special methodological status.
It offers no real credentials to knowledge claims.
Instead, the criterion should be whether the
same model holds for many sets of data.
At this point it may be useful to remind
readers of the situation existing in the physical
sciences. Major developments in these sciences
took place long before significance tests were
available, and they continue to be made without
the heavy reliance that characterizes their use
in the behavioural sciences (Morrison & Henkel,
1970b, p. 3 11). In this regard, we would do well
to consider the timely reminder provided by
Herbert Simon:
48 R MURRAY LINDSAY
In obsemi ng what are sometimes called the successful
sciences - physics, chemistry, and so on - I fmd that
statistical testing in general plays a very much smaller
role there than in the behavioral sciences. As a matter
of fact, the modem statistical theory of testing is a
misinterpretation of the reasons for introducing statistics
in the naturaI sciences. Tbey mm? used to talk about
precisfon of measmmen ts ratbev tban to decide
wbe&er we bad the right tbeoty or not (as cited in
Normann, 1973, emphasis supplied).**
Unlike other critical commentaries, this paper
does not call for the outright abandonment
of ToS. For example, the distinguished psychol-
ogist Paul Meehl is highly contemptuous of the
procedure. He writes that
the almost universal reliance on merely refuting the null
hypothesis as the standard method for corroborating
substantive theories in the sofi [behavioural] areas is a
terrible mistake, is basically unsound, poor scientific
strategy, and one of the worst things that ever happened
in the hIstory of psychology ( 1978, p. 817).
Although the author is sympathetic to such
viewpoints, the reality of the situation cannot
be ignored: researchers and even the discipline
itself( in the sense that ToS provide the pretence
of maturity) have a vested Interest in the current
state of afkiirs (Lindsay, 1993b, p. 245); con-
sequently, change can be expected to take a
considerable period of time, if results associated
with the significance test controversy in other
disciplines are at all typical (see, for example,
the anthology by Morrison & Henkel, 1970a).
A d&rent strategy has therefore been adopted.
In the short term the aim is to educate users
regarding the procedure itself and to improve
statistical practice, to argue for the legitimacy
of other approaches (e.g. case studies), and,
most importantly, to exhort the need for per-
forming replications within a research program
orientation, whereby the goal is to determine
whether the same model holds over many
sets of data. Though Wing well short of
providing any guarantee for success, the focus
on replication is consistent with the best
accounts we have of the “scientific method”
(Tukey, 1989; cited in Ehrenberg & Bound,
1993). In the long run it is hoped that the results
from following this alternative will speak for
themselves.
BIBLIOGRAPHY
Abdolmohammadi, M. J., Menon, K., Oliver, T. W. & Umapathy, S., The Role of the Doctoral Dissertation
In Accounting Research Careers, Issues in Accounting Research (1985) pp. 59-76.
Acree, M. C., Theories of Statistical Inference in PsychologicaI Research: A Historico-Critical Study,
Unpublished Ph.D. dissertation, Clark University (1978).
Aldag, R. J. &Stearns, T. M., issues in Research Me~~oio~JoumalofManagement ( 1988) pp. 253-276.
Anscombe, F. J., Tests of Goodness of Fit, Journal of the Royal Statistical Society B (1%3) pp. 81-94.
Bakan, D., The Test of Significance in Psychological Research, chapter 1, On Method, pp. l-29 (San
Francisco: Jossey-Bass, 1967). Reprinted in Morrison, D. E. & Henkel, R. E., Tbe Significance Test
Contmmrsy -A Reader, pp. 231-251 (London: Butterworths, 1970).
Barnard, G. A., Discussion on “The VaIidity of Comparative Experiments” (by F. J. Anscombc), Journal
of the Royal Sfzdsticai Society A (1948) pp. 201-202.
Barnard, G. A, Discussion on Johnstone (1986), pp. 499-502.
Bamett, V., ComparaNve StaHstical Inference, 2nd Edn (New York: Wiley, 1982).
Bimberg, J. G. & Shields, J. F., Three Decades of Behavioral Accounting Research: A Search for Order,
Behavioral Reseat& in Accounting (1989) pp. 23-74.
Bunge, M., Scfentffc Resemzb, Vol. 1 (New York: Springer, 1967).
Burgstahler, D. & Sundem, G. L., The Evolution of Behavioral Accounting Research in the United States,
1968-1987, Behavioral Research in Accounting (1989) pp. 75-108.
24 I am grateful to Charles Christenson for providing me with this reference.
RECONSIDERING THE STATUS OF TESTS OF SIGNIFICANCE 49
Campbell, D. T., Reforms in Experiments, Ameri can Pvcbol ogi sr (1969) pp. 4-29.
Campbell, D. T., Science’s Social System of Validity-enhancing Collective Belief Change and the Problems
of the Social Sciences, in Fiske, D. W. & Shweder, R. A. (eds), Metatbeory i n Soci al Sci ence: Pl ural i sms
and Subj ecti vi ti es, pp. 108-I 35 (Chicago: University of Chicago Press, 1986).
Caplan, E. H., Behavioral Accounting - A Personal View, Behavioral Research i n Accounti ng (1989)
pp. 109-123.
Carlson, R., Discussion: The Logic of Tests of Significance, Phi l osophy ofSci ence (1976) pp. 116128.
Carver, R. P., The Case Against Statistical Significance Testing, Harvard Educal fonal Revi ew ( 1978) pp.
378-399.
Chalmers, A., What i s thi s Thi ng Cal l ed Sci ence?, 2nd Edn (Milton Keynes: Open University Press, 1982).
Chow, S. L., Significance Test or Effect Size?, Psychol ogi cal Bufi el i n ( 1988) pp. 105-l 10.
Chua, W. F., Radical Developments in Accounting Thought, Accounl i ng Revi ew (October 1986) pp.
601-632.
Co&ran, W. G., Footnote to An Appreciation of R. A. Fisher, Science (1967) pp. 1460-1462.
Co&ran, W. G. & Cox, G. M., Experi meni al Desi gns (New York: Wiley, 1950).
Cohen, J., Stati sti cal Pouer Anal ysi s for the Behavi oral Sci ences, Revised Edn (New York: Academic
Press, 1977).
Collins, F., The Interaction of Budget Characteristics and Personality Variables with Budgetary Response
Attitudes, Accounl fng Revi ew (April 1978) pp. 324-335.
Colville, I., Reconstructing “Behavioral Accounting”, AccounNng 0rgani zaUon.s and Soci ety ( 1981) pp.
119-132.
Cook, T. D., Quasi-Experimentation: Its Ontology, Epistemology, and Methodology, in Morgan, G. (ed.),
Beyond Method Strategi es for Social Research, pp. 57-73 (Beverly Hills, CA: Sage, 1983).
Cook, T. D. & Campbell, D. T., Quasi -Experi menrcl ti orz Desi gn & Anal ysi s I ssues For Fi el d Setti ngs
(Chicago: Rand McNally, 1979).
Cooper, W. W. & Zeff, S. A., Kinney’s Design for Accounting Research, Cri ti cal Perspecti ves on Accounting
(1992) pp. 87-92.
Cox, D. R., The Role of Significance Tests, Scandinavian J ournal of Stati sti cs (1977) pp. 49-70.
Cox, D. R., Theory and General Principle in Statistics (Presidential Address), J ounzal of the Ruyal
Stati sti cal Soci ety A (1981) pp. 289-297.
Cox, D. R., Statistical Signiiicance Tests, Brftfsb J ounal of Cl i ni cal Pbarmucol ogy ( 1982) pp. 325-33 1.
Cronbach, L. J. & Snow, R. E., Apti tudes and I nstructi onal Methods (New York: Irvington, 1977).
Deming, W. E., Some Theory of Sampl i ng (New York: Wiley, 1950).
Deming, W. E., On Probability As a Basis for Action, Ameri can Stati sti ci an (1975) pp. 146152.
Deming, W. E., The New Economi cs (Cambridge, MA: MIT Press, 1993).
Darling, J., A Personalist’s Analysis of Statistical Hypotheses and Some Other Rejoinders to Giere’s Anti-
Positivist Metaphysics, in Cohen, L. J. & Hesse, M. (eds), Appl i cati ons of I ndudfve Logi c (Oxford:
Clarendon Press, 1980).
Dunnette, M. D., Fads, Fashions, and Folder01 in Psychology, Ameri can Pvcbol ogi sr ( 1966) pp. 343352.
Durbin, J.. Statistics and Statistical Science: The Address of the President (with Proceedings), J ournal of
the Royal Stati sti cal Soci ety A (1987) pp. 177-192.
Ehrenberg, A. S. C., We Must Preach What Is Practised, Ameri can Stati sti ci an (1983) pp. 248-250.
Ehrenberg, A. S. C. & Bound, J. A., Predictability and Prediction (with Discussion), J ournal of the Royal
Stati sti cal Soci ety A ( 1993) pp. 167-206.
Emmanuel, C. R., Otley, D. T. & Merchant, K. Accounti ng for Management Control , 2nd Edn (London:
Chapman and Hall, 1990).
Ezzamel, M., Organisation Design: An Overview, in Ezzamel, M. & Hart, H. (eds), Advanced Management
Accounti ng: An Organi sati onal Emphasi s, pp. 1%39 (London: Cassell Educational, 1987).
Finch, P. D., On the Role of Description in Statistical Enquiry, Bt-i tkb J ournal for &e Pbi l osopby of
Sci ence (1981) pp. 127-144.
Fintier, B. M., The Generation of Confidence: Evaluating Research Findings by Random Subsample
Replication, in Costner, H. L. (ed.), Soci ol ogi cal Methodol ogy, pp. 112-175 (San Francisco: Josey-Bass,
1972).
Fisher, R. A., Development of the Theory of Experimental Design, Proceedi ngs of the I n&nati onal
Stati sti cal Conferences (1947) pp. 434-439.
Fisher, R. A., Stati sti cal Methods and Sci enti fi c I nference, 2nd Edn (Edinburgh: Oliver and Boyd, 1959).
Fisher, R. A., The Desi gn of Experi ments, 8th Edn (Edinburgh: Oliver and Boyd, 1966).
50 R. MURRAY LINDSAY
Flack, V. F. & Chang, P. C., Frequency of Selecting Noise Variables in Subset Regression Analysis: A
Simulation St udy, Ameri cun Stati sti ci an (1987) pp. 84-86.
Freedman, D., A Note on Screening Regression Equations, Ameri can Stati sti ci an (1983) pp. 152-155.
Freedman, D., Pisani, R. & Purves, R., Stati sti cs (New York: Norton, 1978).
Giere, R. N., Bayesian Statistics and Biased Procedures, Synthese (1969) pp. 371-387.
Giere, R. N., Popper and the Non-Bayesian Tradition: Comments on Richard Jeifrey, Syntbese (1975) pp.
119-132.
Giere, R. N., Empirical Probability, Objective Statistical Methods, and Scientific Inquiry, in Harper, W. L.
& Hooker, C. A. (eds), Foundati ons of Probabi l i ty Theory, Stati sti cal l nfffence at&Stati sti cal Theori es
of Sci ence, Vol. II, pp. 63-101 (Dordrecht: Reidel, 1976).
Gigerenzer, G., Probabilistic Thinking and the Fight Against Subjectivity, in Kruger, L., Gigerenzer, G. &
Morgan, M. S. (eds), The Probabfl i sti c Revol uti on: I deas i n the Sci ences, Vol. 2, pp. 1 l-34 (Cambridge,
MA: MIT Press, 1987).
Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J. & Kruger, L., The Empi re of Chance: How
Probabi l i & Changed Sci ence and Everyday Li fe (Cambridge: Cambridge University Press, 1989).
Glenn, N. D., Replications, Signiiicance Tests and Confidence in Findings in Survey Research, Publ fc
Opi ni on Quarterl y (1983) pp. 261-269.
Guttman, L., The Illogic of Statistical Inference for Cumulative Science, Appl i ed Stochasti c Model s and
Data Analysis (1985) pp. 3-10.
Haasc, R. F., Waechter, D. M. & Solomon, G. S., How Significant is a Significant Dserence? Average Effect
Size of Research in Counselling Psychology, J ournal of Counsel l i ng P~cbol ogy (1982) pp. 58-65.
Hacking, I., Representi ng and I nterveni ng (Cambridge: Cambridge University Press, 1983).
Hagood, M. J., The Notion of a Hypothetical Universe, in Morrison, D. E. & Henkel, R. E. (eds), The
Si gni &ance Test Controversy -A Reader, pp. 65-78 (London: Butterworths, 1970a). First published
in Hagood, M. J., Stati sti cs for Soci ol ogi sts (New York: ReynaI & Hitchcock, 1941).
Ham?, R. & Secord, P. F., The Expl anati on of Soci al Bebavi our (Oxford: Blackwell, 1972).
Harsha, P. D. & Knapp, M. C., The Use of Within- and Between-subjects Experimental Designs in Behavioral
Accounting Research: A Methodological Note, Behavi oral Research i n Accounti ng ( 1990) pp. 50-62.
Hays, W. L., Stati sti cs, 2nd Edn (New York: Holt, Rinehart & Winston, 1973).
Hofstedt, T. R, Behavioral Accounting Research: Pathologies, Paradigms and Prescriptions, Accounti ng,
Organi zati ons and Soci ety ( 1976) pp. 43-58.
Hopwood, A., Accounti ng and Human Behaviour (Englewood CliIfs, NJ: Prentice-Ha& 1976).
Hopwood, A., Behavioral Accounting in Retrospect and Prospect, Behavi oral Research i n Accounti ng
(1989) pp. l-22.
Hubbard, R. & Armstrong, J. S., Replications and Extensions in Marketing: Rarely Published but Quite
Contrary, I nternati onal J ournal of Research i n Marketi ng ( 1994) pp. 23+248.
Johnstone, D. J., Tests of Significance in Theory and Practice, Stati sti ci an (1986) pp. 491-498.
Johnstone, D. J., Tests of Significance Following R. A. Fisher, Bri ti sh J ournal fm the Phi l osophy of Sci ence
(December 1987) pp. 481-499.
Kaplan, R. S., The Role for Empirical Research in Management Accounting, Accounti ng, Organi zati ons
and Soci ety (1986) pp. 429-452.
Kempthome, O., The Desi gn and Anal ysi s of Experi ments (New York: Wiley, 1952).
Kemptbome, O., Logical, Epistemological and Statistical Aspects of Nature-Nurture Data Interpretation,
Bi ometri cs (1978) pp. l-23.
Kempthome, O., Sampling Inference, Experimental Inference, and Observation Inference, San&ya ( 1979)
pp. 115-145.
Kempthome, O., A Review of R A Fi sh An Appreci ati on (Fienberg, S. E. & Hinkley, D. V., eds), J ournal
of the Ameri can Stati sti cal Associ ati on ( 1983) pp. 482-@0.
Kendall, M. G. & Stuart, A., The Advanced Theory of Stati sti cs, Vol. 1, 4th Edn (London: Charles G&in,
1977).
Klngman, J., Statistical Responsibility, J ournal of the Royal Stati sti cal Soci ety A (1989) pp. 277-285.
Kinney, W. R., Jr, Empirical Accounting Research Design for Ph.D. Students, Accounti ng Revi ew (April
1986) pp. 338-350.
Kren, L. & Liao, W. M., The Role of Accounting Information in the Control of Organizations: A Review
of the Evidence, J ournal of Accounti ng Li terature ( 1988) pp. 280-309.
Kruskal, W., Comment on “Field Fxperlmentation in Weather Modification” by R. E. Braham, J ournal of
the Ameri can Stati sti cal Assocchti on ( 1979) pp. 84-86.
Kuhn, T., The Structure of Sci enti J i c Revol uti ons, 2nd Edn (Chicago: University of Chicago Press, 1970).
RECONSIDERING THE STATUS OF TESTS OF SIGNIFICANCE 51
Kyburg, H. E., Jr, Probabi l i ty and the Logi c of Rati onal Bel i ef (Middletown, CT: Wesleyan University
Press, l%l).
Kyburg, H. E., Jr, Pbfl osopby of Sci ence: A Formal Approach (New York: MacMillan, 1968).
Kyburg, H. E., Jr, The Logi cal Foundhtfons of Statfsti cal I fl fffence (Dordrecht: Reidel, 1974).
Lakatos, I., Falsification and the Methodology of Scientific Research Programmes, in iakatos, I. & Musgrave,
A. (eds), Crftfci sm and the Growth of Knowl edge, pp. 91-196 (Cambridge: Cambridge University
Press, 1970).
Laudan, L., A Problem-Solving Approach to Scientific Progress, in Hacking, I. (ed.), Sci enti fi c Revol utfons,
pp. 144-155 (Oxford: Oxford University Press, 1981).
Le&wich, R., Aggregation of Test Statistics: Statistics vs. EconomicsJournal ofAccounti ng and Economi cs
(1990) pp. 37-44.
Leontief, W., Theoretical Assumptions and Nonobserved Facts (Presidential Address),American Economi c
Revi ew (1971) pp. l-7.
Lindsay, R. M., The Use of Tests of Significance in Accounting Research: A Methodological, Philosophical
and Empirical Enquiry, Unpublished Ph.D. dissertation, University of Lancaster (October 1988).
Lindsay, R. M., Incorporating Power Into Our Statistical Tests: A Methodological and Empirical Inquiry,
Behavi oral Research i n Accountfng ( 1993a) pp. 2 1 l-236.
Lindsay, R. M., Achieving Scientific Knowledge: The Rationality of Scientific Method, in Lyas, C. A.,
Mumford, M., Peasnell, K V. & Stamp, P. C. E. (eds), J usti fyi ng Accounti ng Standards: Some
Phi l osophi cal Di mensi ons, pp. 221-254 (London: Routledge, 1993b).
Lindsay, R. M., Publication System Biases Associated with the Statistical Testing Paradigm, Contemporary
Accounti ng Research (Summer, 1994).
Lindsay, R. M. & Ehrenberg, A. S. C., The Design of Replicated Studies, Ameri can Stati sti ci an (August
1993) pp. 2 17-228.
Love& M. C., Data Mining, Revi ew of Economi cs and Stati sti cs (1983) pp. 1-12.
Lukka, K & Kasanen, E., Generalisability in Accounting Research, Presented at the European Accounting
Association, Turku, Finland (April 1993).
Lykken, D. T., Statistical Significance in Psychological Research, Psychol ogi cal Eul l etfn (1968) pp. 15 l-
159. Reprinted in Morrison, D. E. & Henkel, R. E., The Sfgnf@cance Test Controversy -A Reader, pp.
267-279 (London: Butterworths, 1970).
Maxwell, G., Induction and Empiricism: A Bayesian-Frequentist Alternative, in Maxwell, G. & Andersson,
R. M. (eds), I nducti on, Probabi l fp and Confrmati on, pp. lo&l65 (Dordrecht: Reidel, 1975).
Mayo, D. G., Behaviouristic, Evidentialist, and Learning Models of Statistical Testing, Phi l osophy of Sci ence
(1985) pp. 493-516.
Mayo, O., Comment on “Randomization and the Design of Experiments” by P. Urbach, Phi l osophy of
Sci ence (1987) pp. 592-596.
Meehl, P. E., Theory Testing in Psychology and Physics: A Methodological Paradox, Phi l osophy of Sci ence
( 1967) pp. 103-l 15. Reprinted in Morrison, D. E. & Henkel, R. E., The Si gnfj kance Test Controversy
-A Reader (London: Butterworths, 1970) pp. 252-266.
Meehl, P. E., Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, and the Slow Progress of Soft
Psychology, J ournal of Consul ti ng and Cl i ni cal Psychol ogy (1978) pp. 806-834.
Meehl, P. E., What Social Scientists Don’t Understand, in Fiske, D. W. & Shweder, R. A. (eds), Metutheory
i n Soci al Sci ence: Pl ural i sms and Subj ecti vi ti es, pp. 315-338 (Chicago: University of Chicago Press,
1986).
Miller, D., Toward a New Contingency Approach: The Search for Organizational Gestalts, J ournal of
Management Studi es ( 198 1) pp. l-26.
Mintzberg, H., An Emerging Strategy of “Direct” Research, Admi ni strati ve Sci ence QuarterZy (December
1979) pp. 582-589.
Morrison, D. E. & Henkel, R. E., The Sfgnf@cance Test Conhoversy -A Reader (London: Butterworths,
1970a).
Morrison, D. E. & Henkel, R. E., Significance Tests in Behavioral Research and Beyond, in Morrison,
D. E. & Henkel, R. E. (eds), The Sfgni /cance Test Conhoversy - A Reader, pp. 305-311 (London:
Butterworths, 1970b).
Mosteller, F. & Tukey, J. W., Data Anal ysfs and Regressi on: A Second Course i n Statfstfcs (Reading, MA:
Addison-Wesley, 1977).
Nelder, J. A., Statistics, Science and Technology (Presidential Address), J ournal for the Royal Stati sti cal
Soci ety A (1986) pp. 109-121.
Newton-Smith, W. H., The Ratfonal fty of Sci ence (London: Routledge & Kegan Paul, 1981).
52 R. MURRAY LINDSAY
Neyman, J. & Pearson, E. S., On the Use and Interpretation of Certain Test Criteria for Purposes of
Statistical Inference (Part I), Bfometri ka A ( 1928) pp. 175-240.
Neyman, J. L Pearson, E. S., On the Problem of the Most Efficient Tests of Statistical Hypotheses,
Pbi l osopbfcul Transacti ons of the Royul Soci ety A (1933) pp. 289-337. Reprinted in Neyman, J. &
Pearson, E. S.,Joint Stuti stfcul Papers (Cambridge: Cambridge University Press, 1967).
Normann, R., A Personal Quest for Methodol ogy (Scandinavian institutes for Administrative Research,
1973).
NunnaIIy, J., The Place of Statistics in Psychology, Educati onal and Psychol ogi cal Measurement (1960)
pp. 641-650.
Oakes, M., Stzati sti cal I nference: A Commentary for the Soci al and Behavi oral Sci ences (New York: Wiley,
1986).
O’Grady, K E., Measures of Explained Variance: Cautions and Limitations, Psychol ogi cal Bul l eti n ( 1982)
pp. 766-777.
Pearson, E. S., ‘Student” as Statistician, Bi ometri ku (1939) pp. 210-250.
Popper, K R., Tbe Logi c of Sci enti fi c Di scot,ery (London: Hutchinson, 1959).
Popper, K R., Conjectww and Refutati ons (London: Routledge & Kegan Paul, 1963).
Putnam, H., The “Corroboration” of Theories, in Hacking, I. (ed.), Sci enti fi c Reuol uti oons, pp. 60-79
(Oxford: Oxford University Press, 198 1).
Rave&J. R., S&r@@ Knowl edge and I ts Soci al Probl ems (New York: Oxford University Press, 1971).
Reichenbach, H., The Theory of ProbaWi ty, 2nd Edn (Berkeley: University of Berkeley Press, 1949).
Rencher, A. C. & Pun, F. C., Inflation of R2 i n Best Subset Regression, Technometrl cs (1980) pp. 49-53.
Ronen, J. & Livingstone, J. L., Motivational Impact of Budgets, Accounti ng Revi ew (October 1975) pp.
671-685.
Ryan, T. A., Multiple Comparisons in Psychological Research, Psychol ogi cal Bul l eti n (1959) pp. 26-47.
Ryan, T. A., Ensemble-adjustedp Values: How Are They to be Weighted?, Psychol ogi cal Bul l eti n (1985)
pp. 521-526.
S&burg, D. S., The Religion of Statistics as Practised in Medical Journals, Ameri can Stati sti ci an (August
1985) pp. 228-223.
Savage, L J., The Foundations of Statistics Reconsidered, in Neyman, J. (ed.), Proceedi ngs of the Four&
Berkel ey Symposi um on Mathemati cs and Probabi l i ty, Vol. 1, pp. 575-586 (Berkeley: University of
California Press, 1961).
Sayer, A., Method i n Soci al Sci ence: A Real i st Approach (London: Hutchinson, 1984).
Scapens, R. W., Researching Management Accounting Practice: The Role of Case Study Methods, Bri ti sh
Accounti ng Revi ew (1990) pp. 259-281.
Scapens, R. W., The Role of Case Study Methods in Management Accounting Research: A Personal Reflection
and Reply, Bri ti sh Accounti ng Revi ew ( 1992) pp. 369-383.
SeidcnfeId, T., Phi l osopbi cul Probl ems of Stuti sti cul I nferenc Learni ng fi vm R A Fi sher (Dordrecht
Reidel, 1979).
Seidenfeld, T., Levi on the Dogma of in Experiments, in Bogdan, R. J. (ed.), ProJ Zl es: Hemy E Kyburg, J r
G I saac Levi (Dordrecht: Reidel, 1982).
Shields, M. D. & Young, S. M., Antecedents and Consequents of Participative Budgeting: Evidence on the
Elfects of Asymmetrical Information, J ournal of Management Accounti ng Research (FaII 1993) pp.
265-280.
Skipper, J. K, Jr, Guenther, A. L & Nass, G., The Sacredness of .05: A Note Concerning the Uses of
Statistical Levels of Significance in Social Science, Ameri can Soci ol ogfst (February 1967) pp. 16-18.
Reprinted in Morrison, D. E. SK Henkel, R. E., The Si gni fi cance Test Controversy -A Reader, pp. 155-
160 (London: Butterworths, 1970a).
Spielman, S., The Logic of Tests Of SigniIicance, Phi l osopby of Sci ence (1974) pp. 211-226.
Stuart, A., Tbe I deas of Sampl i ng (London: Charles GriIRn, 1984).
Subotnik, D., Wisdom or Widgets: Whither the School of “Business”?, Abacl rs (1988) pp. 95-106.
Suppe, F., Critical Introduction and Afterword, in Suppe, F. (ed.), The Structure of Sci enti fi c Theo&s,
2nd Edn, pp. 3-241, 617-730 (Urbana: University of Biinois Press, 1977).
Tukey, J. W., Expl oratory Data Anal ysi s (Reading, MA: Addison-Wesley, 1977).
Tukcy, J. W., The Philosophy of Multiple Comparisons, MiIIer Memorial Lecture, Stanford University
( 1989).
Umapathy, S., Unfavorable Variances in Budgeting: Analysis and Recommendations, in Ferris, K R. &
Livingstone, J. L (eds), Management Pl anni ng and Control , Revised Edn, pp. 163-l 76 (Beavercreek,
OH: Century VII, 1987).
RECONSIDERING THE STATUS OF TESTS OF SIGNIFICANCE 53
Urbach, P., Randomization and the Design of Experiments, Pbiiosopby of Scknce (1985) pp. 256-273.
Walster, G. W. & Cleary, T. A., A Proposal for a New Editorial Policy in the Social Sciences, American
StatistMan (April 1970) pp. 16-19.
Waterhouse, J. & Richardson, A., Behavioral Research Implications of the New Management Accounting
Environment, Working paper, University of Alberta ( 1989).
Wilson, F. D., Smoke, G. I_ % Martin, J. D., The Replication Problem in Sociology: A Report and a Suggestion,
soci010gica1 rnqurry (1973) pp. 141-149.
Winch, R. F. & Campbell, D. T., Proof? No. Evidence? Yes. The Significance of Tests of Significance,
AmerfcanSoclologlst( 1969)~~. 140-143. Reprinted inMorrison, D. E. & Henkel, R. E., TheSignificance
Test Conlroversy -A Reader, pp. 199-206 (London: Bunenvorths, 1970).
Winner, J. A. & Clayton, M. K, On Objectivity and Subjectivity in Statistical Inference: A Response To
Mayo, Syntbese (1986) pp. 369-379.
Yates, F., The Iniluence of Sfaristical Met&x& for Reseamb Workers on the Development of the Science
of Statistics, J ournul of the American Statistical Association (1951) pp. 19-34.
Yates, F., Sir RonaId Fisher and the Design of Experiments, BiomeMcs (1964) pp. 307-321.
Yin, R. K., Case St&y Resemcb, Revised Edn (Newbury Park, CA: Sage, 1989).
Yule, G. U. & Kendall, M. G., An Intro&&on to tbe Tbeoty of Stattstks, 14th Edn (London: Charles
GrifIin, 1950).
doc_501905994.pdf