Description
Surveying or land surveying is the technique, profession, and science of accurately determining the terrestrial or three-dimensional position of points and the distances and angles between them.
Comprehensive Summaries of Uppsala Dissertations from the Faculty of Social Sciences 126
ESSAYS ON MODEL ASSISTED SURVEY PLANNING
BY
ANDERS HOLMBERG
ACTA UNIVERSITATIS UPSALIENSIS UPPSALA 2003
!!
"!#"$ % & ' % % (&
&) &
* +*&)
' ,) !! ) ++,-+ ./ 0.1 ,++2+ +34- (1,//2/5) ,
)
"6) 7!
)
) 2+8/ 9":$$7:$6 :$ & ;
%
' '
&
'
') & %
*
&
'
'# <= & %
' &
* & >
* &
: :>
' <?
'=
<=
%
& % % > ?
') & *
&
*&& % & & ; % &
% &
*&
&
%
& '
&
&
) &
& & %
&
&* &
%%
' ' *& &
* &
&
&
) & &
''
'
& & % '
&
% &
%
%%
'
) &
'
& &
%
%
'
%
; %
)
&
' &
:
& ;
) &'& & '
222
' :
'
' & ' & % &
) 8 *
%
+*&
& % &
'
&'& :
'
'
' & * & '
222
*
'
&
& & % *
) &
''
% '
' &*
%
&
'
% & %
' &)
!" #$%
&'(#$)* + @ , ' !! 2++/ !A:B79 2+8/ 9":$$7:$6 :$ ####
: 7"B <&
#CC))CDE####
: 7"B=
PAPERS INCLUDED IN THE THESIS
This thesis is based on the following papers, which are referred to in the text by their Roman numerals:
I
Holmberg, A. and Swensson, B. (2001). On Pareto ?ps sampling: Reflections on unequal probability sampling strategies. In Theory of Stochastic Processes vol. 7(23), no. (1-2): 142–155.
II Holmberg, A. (2001). On the choice of strategy in unequal probability sampling. In Proceedings of the Annual meeting of the American Statistical Association, August 5–9, 2001, Section on Survey Research Methods. American Statistical Association, CD media (Amended version). III Holmberg, A. (2002). A multiparameter perspective on the choice of sampling design in surveys. In Statistics in Transition vol. 5, no. (6): 969–994. IV Holmberg. A., Flisberg, P. and Rönnqvist, M. (2002). On the choice of optimal design in business surveys with several important study variables. In R&D Report 2002:3, Statistiska Centralbyrån (SCB), metodenheten (Submitted).
Reprints were made with the permission of the publishers.
Contents
Acknowledgements ........................................................................................ 5 1 Introduction........................................................................................... 7 1.1 Concepts of survey planning in this thesis................................... 7 1.1.1 On the use of auxiliary, or supplementary, information ......... 9 1.2 Using models to assist survey planning..................................... 12 1.3 General definitions..................................................................... 14 1.3.1 Sampling design and inclusion probabilities......................... 14 1.3.2 Descriptive planning stage models........................................ 15 1.4 Auxiliary variables and the strategy choice............................... 16 1.4.1 The estimator choice ............................................................. 17 1.4.2 The GREG estimator for population totals ........................... 17 1.4.3 Use of auxiliary variables in the sampling design, the single parameter case..................................................................................... 18 1.4.4 Probability proportional-to-size sampling (?SV sampling).... 19 1.4.4 The Anticipated Variance ..................................................... 22 1.4.5 Planning values and guesstimates ......................................... 24 1.5 The multivariate sampling problem........................................... 25 Summary of the papers ....................................................................... 28 2.1 Reflections on unequal probability sampling strategies ............ 28 2.2 On the choice of strategy in unequal probability sampling ....... 30 2.3 A multiparameter perspective on the choice of sampling design in surveys................................................................................................. 32 2.4 On the choice of optimal sampling design in business surveys with several important study variables .................................................... 34 2.5 Minor comments and concluding remarks................................. 35
2
References .................................................................................................... 38
4
Acknowledgements
It is a special pleasure to write this part of the thesis, namely the expression of my indebtedness to people and institutions whose help has made this work possible. This thesis marks the end of my graduate studies, and over the years many have directly or indirectly contributed to the outcome. However, as always the author accepts the full responsibility for any inadequacies. The most important person to whom I wish to express my true and deepest gratitude is Professor Bengt Swensson. From the beginning, he has had major a influence on this work and on my development within the field of statistics. With great knowledge, insight, scrutiny and patience he has read and commented on numerous manuscripts. This thesis and the papers on which it is based would not have developed to their present standard, had it not been for Bengt. I also wish to thank Bengt’s wife Lena for her patience and understanding of my need to sometimes consult Bengt outside office hours. Bengt has also co-authored one paper. Special credit must be given to him as well as to Patrik Flisberg and Mikael Rönnqvist co-authoring another paper. Their efforts have had a direct impact on paper IV and it has been stimulating and rewarding working with them. This work is the fruit of efforts I have made both at the Department of Statistics, Örebro University, and Statistics Sweden’s R&D Department. Consequently, there are many people who have actively and morally supported my efforts. I am especially grateful to my present and former managers at Statistics Sweden, Eva Elvers, Agneta Lissmats, Evert Blom and Lars Lyberg, for their encouragement and for allowing me to spend parts of my working hours on this thesis. In a similar way, I am also indebted to Olle Häggbom and Erik Wallgren at Örebro University. Olle, as head of department, helped by finding means for adequate funding and Erik, as director of studies, helped by planning and administering reasonable teaching schedules. I also gratefully acknowledge the financial support provided by $GROI /LQGJUHQV 6WLIWHOVH 5
In addition, I want to thank personnel at Uppsala University, my formal supervisor Anders Christoffersson (who took important initial steps regarding my admission as a graduate student) and Gunilla Klaar as well as Anders Ågren (for appreciated help on practical as well as administrative matters throughout the years). There are many colleagues and friends I wish to thank for showing understanding and interest in my work. In particular, I thank Martin Axelson. Over the years we have shared statistical problems, offices, train compartments and leisure, and without his sense of humour and readiness to discuss both serious and tiny matters, this work would have been less joyful. As dear colleagues and friends (also teachers) at Örebro University, Olle Carlsson, Hans Hedén, Åke Holmén (who, alas, did not live to read this), Niklas Karlsson and Erik Wallgren have given me invaluable experience in various fields of statistics as well as explaining how it is taught and practised. One of the privileges at Statistics Sweden has been the close access to inspiring discussions with skilful survey statisticians. I am obliged to Claes Andersson (for sharing his vast experience and all his tips and solutions helpful to me when I have got stuck in daily work), to Lennart Nordberg (for encouraging detailed discussions and for providing me with realistic data sets), to Sixten Lundström and Håkan L. Lindström (for their excellent tutoring which inspired my career choice) and to Carl-Erik Särndal (for always showing an interest in my work, giving me constructive criticism, discussing innovative ideas and sharing his knowledge). Collectively, I also acknowledge the support from my colleagues at the R&D Department and those I have closely worked with from other departments of Statistics Sweden as well as at Örebro University. Without your assistance and fine co-operation in different projects, this work would have been harder. Finally, I wish to give tribute to my family and closest friends, especially my parents who did not live to see this, my two sons John and Filip and (at the final stages) Britta. To have your trust and support has been priceless. Thank you! Vadstena, March 2003 Anders Holmberg
6
1 Introduction
Whenever there is a need to obtain information on characteristics of a large finite population of people, households, business enterprises or other elementary units, it is natural to consider a sample survey. In most cases, a carefully planned and well-executed sample survey will provide more precise and timelier information at a lower cost than a complete enumeration would. Obviously, taking a sample means that less time to collect and process data is required, but the huge and unwieldy organization required for a complete enumeration often also implies less control of data collection, resulting in many errors. The present thesis treats some of the statistical questions that arise in the planning process of a sample survey, in particular questions concerning the combined choice of a sampling design, a sample selection scheme and estimator(s).
1.1 Concepts of survey planning in this thesis
To plan a medium-sized or large sample survey is a complex undertaking involving a large number of considerations, decisions and choices on the part of a team of – among others – subject matter experts, survey measurement specialists, computer specialists and survey statisticians. In the discussion of sample survey planning in the present thesis the focus is on some of the sta-
7
tistical tasks and the assumption is that much planning work already has been done, which means that far from all aspects of survey planning will be covered. For example, neither tasks such as the formulation of a subject matter problem and its translation to a statistical problem, nor the choice of measurement device and editing procedures (although they are all fundamental in the planning process) will be discussed here. The point of departure for the discussion in the thesis is as follows: a sample survey is to be taken in order to estimate certain well-defined characteristics (parameters) of a welldefined finite population for which an up-to-date good sampling frame exists, i.e. a frame which besides information that makes it possible to get observational contact with selected elements, contains useful data on a set of auxiliary variables. The focus will be on the case where the frame is a direct listing of the population elements, making direct sampling of elements possible; examples are business registers kept by many countries, and registers of the total population found in some European countries. A typical survey has many study variables and aims to estimate many parameters. Common parameters are totals (e.g. the total monthly value of imported commodities) and functions of totals such as means (e.g. the average income of adult wage-earners), proportions (e.g. the proportion of cultivated area devoted to wheat) and ratios (e.g. the ratio of rental income of commercial premises to that of rented area). Hence, one important goal for the survey statistician in the planning phase of the survey is to find a VWUDWHJ\ – a combination of sampling design and estimators for the most important parameters – which gives the best possible (most precise) parameter estimates for a given cost, or conversely to obtain estimates with a desired precision at lowest possible cost. Survey sampling 8
theory offers instruments to achieve such a goal. However, for the purpose of coming to a decision on the best feasible strategy, survey theory should be applied as early as possible in the planning process, and it must be balanced by practical and budgetary restrictions.
1.1.1 On the use of auxiliary, or supplementary, information
In a broad sense, auxiliary information – supplementary information is an often-used essentially synonymous expression – is any information of assistance about the population available prior to a survey. For instance, in questionnaire design the language and the ordering of questions could be tailored to suit some characteristics of the respondents, and when data collection methods are specified, decisions about the measurement instrument and decisions on how to implement technical equipment could be greatly improved. In the present thesis, ‘supplementary information’ will be used in a broad sense. In contrast, ‘auxiliary information’ will be reserved for the following specific kind of information, consisting of two related components: (i) data on a set of variables (which will be called auxiliary variables) available at the planning phase as well as at the estimation phase for every population element; (ii) appropriate a priori knowledge of the structure of relationships between important study variables and (subsets of) the auxiliary variables. In the search for estimators of high precision, survey sampling theory strongly emphasizes the use of auxiliary information. Often, the sampling frame is equipped with at least a few auxiliary variables that may be useful, while subject matter experts and the survey statistician may have knowledge of the relationship between study variables and auxiliary variables from
9
similar surveys of similar populations. If no such useful knowledge is available, it will often be worthwhile to make a pilot survey. The use of auxiliary information can be traced back to the early days when the foundations of sampling theory were laid. In Neyman’s classical paper from 1934, one of the major contributions is the explicit use of auxiliary information. Neyman suggested a partition of the population into strata followed by the selection of a probability sample in each stratum. Cochran (1942) is an early reference which introduces auxiliary information in the estimation procedure. Cochran discussed the regression method of estimation in detail. Since Neyman, much of the development of sampling theory has evolved around the use of auxiliary information and auxiliary variables. In an introduction to Neyman’s paper, Tore Dalenius states:
“As demonstrated by the developments in the last half-century, supplementary information may be exploited for all aspects of the sample design, the definition of sampling units, the selection design and the estimation method.” (Dalenius 1992)
Because auxiliary variables can be used in a wide variety of ways at the design stage as well as at the estimation stage, efforts to use this information to find efficient estimators lead to many techniques for sampling and estimation. Results on optimum allocation in stratified sampling, on probability proportional-to-size sampling, and on regression or calibration estimators all rely on properties of auxiliary variables. Recent references that illustrate this wide variety are Laaksonen (2002), who attempts to systematize auxiliary variables into different types, and Estevao and Särndal (2002), who show
10
that there are ten different cases where auxiliary information may be used for calibration in two-phase sampling. In all of the papers in the present thesis, auxiliary information plays a key role in the planning process. The first paper studies two issues related to the planning stage: (i) the sensitivity of model assumptions concerning the relation between the size measure and a study variable in without replacement probability proportional-to-size sampling (?SV sampling), and (ii) properties of practicable sample selection schemes for fixed size ?SV sampling. These two issues are also addressed in the second paper, which furthermore discusses the consequences of the presence of more than one study variable and to what extent the auxiliary information used in the design and that used in the estimators interact. The evident problem in both the first and the second paper is how to choose an overall efficient VDPSOLQJ GHVLJQ when there are several important study variables with various relationships to the available auxiliary variables. The third paper suggests a diagnostic tool to support the choice of design, and on the basis of three criteria of overall efficiency optimal designs are derived. The optimal designs presented in the third paper may not be fully satisfactory in meeting specified precision requirements for separate estimators. To achieve a design that is tailor-made to meet such requirements, optimisation must be done under restrictions. Though the underlying optimisation problem is only outlined in paper III, a solution involving non-linear programming methods is given in the fourth paper. By way of an example based on an application to a Swedish business population, the fourth paper also compares a design obtained through non-linear programming algorithms with 11
designs discussed in paper III as well as designs based on the same basic ideas as those discussed in the first two papers. The results presented in the thesis are applicable in multipurpose surveys with several important study variables, where auxiliary information is available and a probability proportional-to-size sampling design is considered, e.g. business and agricultural surveys repeated over time. The discussion, notation and methods presuppose that without replacement direct element sampling is used. However, with proper adjustments, the results should also be of potential use in multipurpose, multistage area sampling, where the first-stage units are chosen proportional to size. In the following, there is first given what would appear to be a necessary and more detailed theoretical background. In conclusion, there are then offered a summary of the major contributions of the four papers and some comments and remarks.
1.2 Using models to assist survey planning
“Models are appropriately used to guide and evaluate the design of probability samples, but with large samples the inferences should not depend on the model.” Hansen, Madow and Tepping (1983)
One thing that distinguishes survey sampling theory from the mainstream of statistical theory is that inference concerns parameters of an actual finite population and not unknown parameters of an otherwise (assumed to be) known frequency function of some hypothetical population. The latter is commonly the case in general statistical theory and means that the interest 12
lies in inference about model parameters. However, since the usual target of our inference in survey sampling is unknown parameters of the finite population, models are usually used differently. Two main types of inferences are considered in the survey literature, design-based and model-based inference. In design-based inference, the finite population is considered fixed and the variable values are fixed. In modelbased inference, the values of the variables in the finite population are assumed realizations from a superpopulation model. In this thesis, the concern is with design-based inference. The population is fixed, and the study variable values are fixed (but unknown). The stochastic structure is induced by the sampling design, and the sample itself is the random element. However, the superpopulation concept will be used as well, but differently from its use in model-based inference. Superpopulation models will be used merely (i) as a tool for giving compact descriptions of useful a priori knowledge about relations between the study variables and available auxiliary variables, e.g. for describing basic features of finite population scatterplots, and (ii) as guides in the search for efficient strategies. Hence, inference will not depend on the model, but the model will assist it in the sense that the better the superpopulation model depicts the finite population, the better we can hope to succeed in our task of estimating the finite population parameters.
13
1.3 General definitions
Let 8 = { 1,! , N ,! , 1 } be our finite population consisting of 1 elements. Consider 4 study variables \1 ,! , \ ,! , \ , and let \
¡ ¢ ¥ £
denote the un-
known value of \ for population element N . In addition, we suppose that there are 3 auxiliary variables at hand. They are denoted X1 ,! , X ,! , X
¦ §
and their values X
¦
( S = 1,! , 3 )
\
©
are known for every element N in the
population. Furthermore, we will concentrate on the estimation of population totals, i.e. W
¨
=?
©
?
=? \ .
©
1.3.1 Sampling design and inclusion probabilities
To select a set sample V ? 8 of size Q , a without replacement sampling
design, S (< ) , will be used. The sampling design is the function that gives each possible sample its probability of being selected, and it is essential for determining the properties of estimators to be used in the survey. Fundamental characteristics of a given sampling design are the inclusion probabilities of first and second order. The first-order inclusion probability of element N is the probability that a sample containing the N:th population element is realized. We denote this by ?
(N = 1,! , 1 ) . The second-order
inclusion probability (the probability that a sample containing elements N and O is realized) will be denoted by ?
(N , O = 1,! , 1 ) . They are mainly of
14
importance for variance calculations. Throughout the thesis it is assumed that ? > 0 and ?
> 0 for every N and O.
Since S (< ) often is a complicated function it is of little direct practical use. However, knowledge of the ? and the ?
alone is normally sufficient
for our needs, e.g. for determining expected values and variances of estimators. The planning approach to come up with a good strategy for a sample survey adopted in this thesis results in a decision on a class of SUHIHUUHG VDP SOLQJ GHVLJQV characterized by a preferred set of first-order inclusion probabilities. To implement a design conforming to this set we need a VDPSOH VHOHFWLRQ VFKHPH. Often, but not always, we can find such a scheme. Sometimes, however, there is no (practical) scheme that implements a design with a set of first-order inclusion probabilities which exactly match the preferred ones. The latter is the case for (medium to large) IL[HG size probability proportional-to-size sampling designs. Without replacement, probability proportional-to-size sampling designs ( ?SV designs) are present in all four papers of this thesis. Brief summaries of the basic ideas behind these designs are given in the papers and can be found in most standard textbooks.
1.3.2 Descriptive planning stage models
There must be a useful and well-guessed relationship between the auxiliary variables and the study variables if the auxiliary variables are to be of assistance in the choice of strategy. In the following, we assume that the structure of such relationships can be caught by linear models ?
such that
15
\ = [?
& ' )
+?
with
(? (?
"
)= 0,
9? (?
# $
%
)=?
$
2
%
and
(? (? ?
'
+
) = 0 (N ? O ), i.e. for (N = 1,!, 1 ) ,
? (? ? ? 9 ? ? ?
-
( \ ) = [? (\ ) = ?
. / . . / .
/
.
2
/
(1)
where [? = [1 ,! , [ ,! , [
2 3 3 4 3 6 1 1 1
(
3
) is a suitable set of 7 8 9
(positive) auxiland where
iary
variables
formed
from
2 1
>
X1 ,! , X ,! , X ,
;
= ?1 , ! , ? , ! , ?
< = : :
(
:
)? and ?
,! , ? 2 are model parameters.
> ?
1.4 Auxiliary variables and the strategy choice
As mentioned previously, the available auxiliary information will influence the survey statistician’s final choice of strategy. The general problem of finding the, in some sense, best strategy has been studied by different authors. In such studies, it seems necessary to accept certain restrictions of the general problem. Sometimes restrictions are made regarding specific classes of estimators, and sometimes the emphasis is placed solely on finding an efficient sampling design with certain properties. Horvitz and Thompson (1952) and Godambe (1955) are classical references that approach the problem, restricting themselves to a class of linear estimators, whilst Neyman (1934) and Dalenius (1950) are examples of references where the sampling design has been focused upon.
16
1.4.1 The estimator choice
In this thesis, one restriction is that we consider the family of estimators known as GREG (generalized regression) estimators (for details see Särndal, Swensson and Wretman (1992), sections 6.4-6.7). It is a wide family of estimators, however, which includes several well-known estimators used in practice. With use of a model like the one given in equation (1) as a starting point for a GREG estimator, different model expressions give us estimators such as the post-stratified estimator and the simple ratio estimator. A GREG estimator is approximately unbiased even with poorly chosen models, but with a strong relationship between \ and [ , and a fair knowledge of that relationship, the GREG estimator will, as far as efficiency is concerned, outperform the well-known Horvitz-Thompson estimator which does not use auxiliary information.
1.4.2 The GREG estimator for population totals
The GREG estimator for a population total is defined as
ˆ W
B A
C
ˆ =W
B A
?
ˆ + W ?W
D D A
(
I
A
?
ˆ . )? %
E
(2)
ˆ Here, W
G F
?
=?
I O
?
J
\ /? = ? \ /?
H I I H I J O O O
is the Horvitz-Thompson, or ?
estimator, W
L
= W 1 ,! , W ,! , W
L M L
(
N
L
)?
is a - -dimensional vector of [
P Q
ˆ totals, W
S
R
?
is a vector of the corresponding ? estimators and
? [ [? ? ˆ =? % ? ? ? F ? ? ? ?
T U T U T V T U U
?1
?
V
[ \
T U T U
F ?
T U U
(3)
17
is an estimated vector of regression coefficients, where F
X Y
is a suitable con-
stant.
ˆ The Taylor expansion (design) variance of W
Z [
\
is given by
^ _
ˆ ) = ?? 9 (W
a b d ] ^
(?
^
_
?? ?
?
f
_
?
f
??
^ _
)(
e
^
(
e
_
(4)
where (
j
k
= \ ? [? %
j k j k j
(N = 1,! , 1 )
m n
are finite population fit residuals, a finite population regresn
with % =
m
(?
q
[ [? / F
m n m n
) ?
?1
q
\ [? / F
m n m n m
sion coefficient. The issue of variance estimation will not be discussed in detail in this thesis. However, for the strategies we consider, we require that at least approximately unbiased estimators for (4) are available.
1.4.3 Use of auxiliary variables in the sampling design, the single parameter case
By far the most important application of auxiliary variables in the sampling design is that of stratification. As mentioned above, it goes back to Neyman and it can be justified for several reasons. General reasons given in Särndal et al. (section 3.7) are: 1. Precision requirements of estimates for certain subpopulations, domains, can be assured by using domains as strata. 2. Practical aspects such as nonresponse rates, method of measurement and the quality of auxiliary information may differ between subpopulations, and can be efficiently handled by stratification.
18
3. The survey organization may be divided into geographical districts, which makes it natural for administrative reasons to let each district be a stratum. In addition, when the stratification is made on an auxiliary variable which is well-correlated with the important study variables, stratified sampling ensures a good basis for obtaining efficient estimates. Although it is elementary on its own, stratified sampling may be combined with other methods of sampling yielding rather complex sampling designs, e.g. one may consider stratified sampling with probability proportional-to-size sampling (?SV sampling) within strata.
1.4.4 Probability proportional-to-size sampling (?SV sampling)
As the name proportional-to-size indicates, the elements for a ?SV sample are selected with probabilities proportional to a size measure. The size measure, ], is one of the auxiliary variables X , or a function (transformation) of one
r
or several auxiliary variables, i.e. ] = K(X) . Its value has to be known for every population element, and ideally, if the simple ? estimator is to be used, it should be proportional to the study variable. We illustrate the ideas behind ?SV sampling by first looking at the
ˆ ? estimator for a single study variable \ , W
s u t
?
= ? \ / ? . We note
v w w x y z
that, if the inclusion probabilities ?
w
could be made proportional to \ ,
then the ratio \ / ?
{ | |
is constant, say F, and for every (fixed size design)
u t
ˆ sample we would get W
?
= ? \ / ? = QF , which has zero variance.
v w w x
However, proportional inclusion probabilities cannot be derived using the 19
study variable since all \
{
|
are unknown. Instead, we choose inclusion
probabilities proportional to ] whose values in turn should be nearly proportional to the values of \ . The first-order inclusion probabilities are then
}
given by
? =
~
? ?
Q]
~
]
~
(N = 1,! , 1 )
(5)
where ] is strictly positive,
?
? = Q and ? ? 1 .
? ?
Occasionally, some ] values are very large and equation (5) violates
? ? 1 . One common method to deal with this is to create a ‘take-all’ stra?
tum, $ , of those elements and choose them with certainty (i.e. ? = 1 for
?
N ? $ ). Then, for the remaining elements ( N ? 8 ? $ ), the ? are chosen
?
to be proportional to ]. The method presented in the fourth paper of this thesis offers another solution, where the numerical optimization method automatically fulfills the restriction ? ? 1 .
?
Henceforth, without-replacement probability proportional-to-size sampling designs based on the size variable z will be denoted by ?SV ( ] ) , and we will assume that ? ? 1 .
?
There are a large number of sample selection schemes to implement a
?SV( ] ) design. For example, Brewer and Hanif (1983) list 50 schemes.
However, if we exclude random size designs, it has turned out to be hard to devise a scheme for arbitrary sample size Q that has a number of desirable properties, e.g. (a) the actual selection of the sample is relatively simple, (b) all first-order inclusion probabilities are strictly proportional to the size vari20
able, (c) the design admits (at least approximately) unbiased estimation of estimator variances. If we also want to be able to base the sample selection on the technique of permanent random numbers (PRN), which is desirable in large survey organizations taking many surveys, some of which are repeated over time, it will be even harder. Relatively new ?SV designs, attractive to practitioners, are the order sampling designs like sequential Poisson sampling (see Ohlsson (1990, 1998)) and Pareto ?SV (see Rosén (1997a,b) and Saavedra (1995)), and the 3R0L[ design proposed by Kröger, Särndal and Teikari (1999, 2000). In papers I–II, the properties of strategies based on these designs are compared with PRGHO EDVHG VWUDWLILHG VLPSOH UDQGRP VDPSOLQJ (PE676,) as proposed by Wright (1983). When ?SV sampling is relevant, we have no reason to restrict ourselves to the ? estimator. Since we have auxiliary information at hand we can use the GREG estimator. Let us study ?SV sampling in the light of a special case of model (1) and suppose that the following first and second moments of the superpopulation make sense (where ? 2 is a (possibly unknown) constant, ? is a constant reflecting the degree of heteroscedasticity and X > 0 ):
?
? (? ? ? 9 ? ? ?
? ?
(\ ) = ?X (\ ) = ? X
? ? ? ? ?
2 ?
?
(6)
With this description of a population structure, the observations in a finite population scatterplot of \
? ?
and X would gather around the line ? X ,
? ?
with a larger spread around the line for larger values of X if ? > 0 . When
? = 1 and F = ? 2 = ? 2X , the GREG estimator following from this
21
model (sometimes called the common ratio model) is the well-known ratio estimator
ˆ W
? ?
?
=? X
? ?
? ?
?
\ ?
? ? ?
X ?
? ?
.
If the model structure given by (6) reasonably well portrays the finite population scatter, it indicates a nearly proportional relationship between
\
?
?
and X in the finite population. The X
? ?
(N = 1,! , 1 ) are known and a
(7)
?SV(X ) design yields
ˆ W
? ?
= ? \ /? .
? ? ?
Hence, the ratio estimator in this case coincides with the ? estimator. Heteroscedastic patterns are common in actual populations. Brewer (1963, 2002) discusses this in detail and suggests that ? often is in the interval [1, 2]. 5HPDUN 7H[WERRNV RIWHQ UHIHU WR D ?SV (X ) GHVLJQ FRPELQHG ZLWK WKH
? HVWLPDWRU DV D YHU\ HIILFLHQW VWUDWHJ\ SDUWLFXODUO\ ZKHQ WKH YDULDQFH
VWUXFWXUH LV VWURQJO\ KHWHURVFHGDVWLF VD\ 9?
¡
(\ ) = ?
¢ £
2
X 2 LH ? = 2 +RZ
£
HYHU IURP WKH VWXGLHV LQ WKLV WKHVLV ZH QRWH WKDW LW XVXDOO\ LV SRVVLEOH WR ILQG RWKHU VWUDWHJLHV WKDW ZRUN EHWWHU
1.4.4 The Anticipated Variance
When we plan a survey, we need some measure that can help us to discriminate and choose between alternative strategies. Obviously, since only auxiliary information is available, this measure must be a function of auxiliary information. One such measure is the anticipated variance, defined by Isaki 22
ˆ (1970) (see also Isaki and Fuller (1982)). If W
¤ ¥
¦
is an estimator of W and if
¨ §
(1) is interpreted as moments of a superpopulation from which a finite popu-
ˆ lation has been selected, the anticipated variance is the variance of W
© ª
«
?W
ª
©
over both the model ? and the sampling design, i.e.
ˆ (? ( ? W ? ?
¯ ° ® ®
(
±
?W
°
®
)
2
? ? ?( ( W ˆ ?W ? ? ? ?
¯ ° ± ° ® ®
(
®
)
? . ?
2
If the model ? is well specified, then an approximation to the anticipated variance can be written as (see Särndal et al. pp 450–451),
ˆ ) = ? ? ?1 ? 1 ? 2 . $19 (W
µ ¶ ¸ ¹ µ ¹ º ´
(
)
(8)
By using equation (8), we find a justification for choosing a ?SV design with a certain size measure. For a sampling design such that ( (Q ) = Q ,
» ¼
Result 12.2.1 in Särndal et al. shows that equation (8) is minimized if the inclusion probabilities, ? , are chosen proportional to ? , i.e a ? SV (? )
¾ ¿ À ¿
ˆ ) is then design. The minimum value $19 (W
Â Ã Ä Á
$19
Æ
min
ˆ ) = (W
Ç È Å
? (?
Æ Ï
?1 (
É Ê
Ë
)
Í
?1 ? 2
Æ Í
)
Ï
=
1 Q
(?
Ï
?
Æ
Í
)
2
?? ?2 .
Æ Í
(9)
If we apply this result to the common ratio model in equation (6), the optimal inclusion probabilities that minimize $19
Ñ
min
ˆ ) are, (W
Ò Ó Ð
23
? =?
Õ Ô
(
Ö
×
Ø
)
Õ
=
?
QX? / 2
Ô Õ Ù
X? / 2
Ô Õ
(N = 1,! , 1 )
hence, ? ? ] = X ? / 2 .
Ú Ú Û Ú
For fixed size designs, other authors have presented results also indicating that an optimal design is attained when ? ? ?
Ü Ý Ü
– Hajek (1959) (linear
design-unbiased estimators), Brewer (1963) (ratio estimation) and Cassel, Särndal and Wretman (1976, 1977) (generalized difference estimators). When the ? estimator coincides with the ratio estimator (see equation (7)), Godambe also gives a similar result. Rosén (2000) used the result by Cassel, Särndal and Wretman to study optimal strategies by combining GREG estimation and Pareto ?SV sampling. Wright’s proposal of PE676, is inspired by the wish to find a simplified design with inclusion probabilities that come close to the optimal.
1.4.5 Planning values and guesstimates
We need a notation to distinguish quantities which are used in the planning process from values which hypothetically are considered to be true values. Let us call these planning values ‘guesstimates’, since they reflect quantifications of the surveyor’s belief or guess about the structure in the finite population. In this summary and in papers II–IV, guesstimates that are used for calculations will be marked by the tilde symbol. For example, a guesstimate of the variance structure will be denoted by ? 2 and a guesstiÞ ß
mate of the ? parameter discussed earlier will be denoted by ? . In the first paper a somewhat different notation is used. 24
1.5 The multivariate sampling problem
“It is often better to have reasonably good solutions of the proper problems than optimum solutions of the wrong problems.” Herbert Robbins
Most sample surveys are multivariate. When a multivariate survey is planned, the design choice is – since it affects all estimates – relatively more important than the choice of estimators. Although a good estimator sometimes can compensate for a poorly chosen design and give an acceptable precision, it is also likely that precision is lost compared to what is achievable with a better design. The theoretical results described in section 1.4 and in the survey literature are often not sufficient guides for finding a good sampling design for the multivariate situation. In most presentations, the concept of ‘optimum’ is tied to a single study variable. Examples are to be found in the theory above (choice of unequal inclusion probabilities), in the theory of optimum stratification and optimum allocation in stratified sampling and in the theory of multi-stage sampling, where the number of units in each stage can be determined optimally. A simple reason why few papers deal with the multivariate survey problem might be that there is no evident criterion of optimality. For example, a design that is optimal or close to optimal in the single parameter case, may not be the best choice in an overall multiparameter sense. Obviously, when we choose our design we must find a compromise based on some kind of multivariate consideration.
25
It seems as if there are three often used ways to approach the multivariate problem in practice: (i) Ignore it and reduce it to a single variable problem through a snap judgment based on ‘experience’. (ii) For each of the important parameters, study the effect of various single-variable dependent actions, and then select a design that seems to be the best compromise. (iii) Find some kind of overall multivariate criterion, perhaps mechanically by generalizing single variable concepts, and choose the action that optimises this criterion. The first choice may not be as bad as it may sound. In many situations, a few important study variables may have similar properties, and the total loss in precision following from using one of them might be acceptable. However, at the planning stage, time should be spent carefully studying the overall effects of choosing such a simplistic approach. Otherwise, it might turn out to be a bad practice. The second approach is better since it means that a design choice is made on the basis of considerations of the multivariate situation. An example that fits into this approach is described by Kott and Bailey (2000) called PD[LPDO %UHZHU VHOHFWLRQ. It is not an optimisation-based method, but it is simple to implement and it guarantees the desired precision of all estimators considered in the planning. Various methods of averaging key measures may also be placed in this category of approaches. One is to select a stratified sample with an allocation based on the average of various single-variable optimum allocations, another, in the case of ?SV (? ) sampling, is to choose the size
à
measure as the average or median of 4 different single variable ? ’s.
à
For multivariate stratified sampling a number of authors have considered the third (optimisation-based) approach category, e.g. Dalenius (1957), Chat26
terjee (1968, 1972), Huddleston, Claypool and Hocking (1970), Danielsson (1975), Hughes and Rao (1979), Bethel (1989). In this thesis, the multivariate problem is also treated using an optimisation-based approach. The results on univariate optimal designs from section 1.4 are extended to the multivariate case. The advantage of using this kind of approach is that, at least theoretically, optimal solutions are obtained. Moreover, as we choose the optimisation criterion, we also attain certain control of the desired design properties, as opposed to relying on intuitive ad hoc solutions. However, as for all other approaches the success will ultimately depend on the chosen criterion and the underlying assumptions made by the survey designer. A possible drawback with an optimisation-based approach is cumbersome computations. However, in the present case the methods proposed for the multivariate situation are only slightly more complicated than single-variable methods. In addition, the methods suggested in papers III–IV are to a large degree based on computations that should be executed anyway in any serious planning effort.
27
2 Summary of the papers
In this section, the aims and the main conclusions of the four papers are summarized in chronological order.
2.1 Reflections on unequal probability sampling strategies
*
In section 1.4.4, it was noted that an optimal design in the sense of minimiz-
ˆ ) (an approximation to the anticipated variance of a GREG ing $19 (W
â ã ä á
estimator) is such that the first-order inclusion probabilities are chosen proportional to ? , i.e. a ?SV (? ) design. If a ?SV (? ) design is a preferred
å æ å å
design, the designs (sampling schemes) studied in paper I can be applied in an attempt to implement it. However, none of the fixed size designs will yield an exact ?SV (? ) design, and therefore it is interesting to compare
å
their properties. Another aim is to study how robust the sampling designs are against misjudgements of ?
å æ
made at the planning stage.
Paper I is a reprint of Holmberg and Swensson (2001), which (when originally published) mistakenly was entitled ‘On Pareto ?ps sampling: Reflections on unequal probability sampling’.
*
28
The paper compares the properties of ?SV sampling designs such as Poisson ?SV and Pareto ?SV, with Poisson mixture (3R0L[) sampling and modelbased simple random stratified sampling (PE676,). Combined with a GREG estimator, these sampling designs are serious alternatives for survey designers who wish to use auxiliary information in the sampling design as well as at the estimation stage. We compare the design properties by defining a penalty measure as the
ˆ ) of a specific design and the minimum anticiratio between the $19 (W
â ã ä á
pated variance $19
è
min
ˆ ) , i.e., (W
é ê ç
ˆ ) (? ) $19 (W $93(? , ? ) = . ˆ ) $19 min (W
ì í î ï ë ì í î ë
(9)
By using (9), we can study the increase in approximate anticipated variance obtained when a non-optimal ?SV ( ] ) design is used. ? is a guesstimate of the degree of heteroscedasticity in the supposed variance structures, and
? is used to differentiate between the studied 3R0L[ designs. ? is a (transformation) parameter called the Bernoulli width. (If ? = 0 , 3R0L[ is equal to the original design induced by the sampling scheme, e.g. Pareto or Poisson sampling.) For PE676, a corresponding penalty measure without ? is used. In tables 1-5 of the paper, we see the effects of different assumed variance structures in terms of the penalty measure. We note the following: (i) Compared to the ?SV designs, PE676, seems to be less sensitive to a large value of ? ? ? (the absolute error of the heteroscedastic pattern guesstimate). (ii) The variance increase from using 3R0L[ sampling is larger when ? < ? 29
than when ? > ? . In fact, it is only when ? > ? that a value other than
? = 0 is of assistance, i.e. 3R0L[ is only of potential interest if the survey
designer in the planning process happens to overestimate ? . (iii) However, when ? > ? there exists an optimal ? which makes the variance penalty of a 3R0L[ design very small. Moreover, when ? > ? , 3R0L[ sampling with
? = 0.3 I (where I = Q / 1 , i.e. the sampling fraction) is less sensitive to
deviations of ? from ? than direct ?SV ( ] ? / 2 ) sampling. Finally, (iv) since
ñ
different study variables may have different degrees of heteroscedasticity, planning for good variance properties of one estimator may be bought at a high price for estimators of other study variable totals.
2.2 On the choice of strategy in unequal probability sampling
The second paper is a follow-up study of the first paper. Again, Pareto ?SV, 3R0L[ and PE676, are compared. However, in this paper different strategy choices in a bivariate situation are looked at. There are two key parameters and strong auxiliary information at hand for the design as well as for estimation purposes. Besides a discussion of Pareto ?SV, 3R0L[ and PE676,, there is an examination of some other questions. When we have strong auxiliary information, will there be any significant variance decrease from using a ?SV (? )
ò
design and a GREG estimator instead of (i) using auxiliary information for estimation purposes only, (ii) using a classical (non-optimal) ?SV design 30
combined with the ? estimator, as discussed in connection with equation (7) in section 1.4.3? Moreover, is it possible to more or less fully compensate a non-optimal design choice by a suitable estimator choice? As a final point, which of the studied strategies is the best in an overall perspective? As in paper I, the comparisons indicate that we cannot rule out the use of PE676,. By means of results from a Monte Carlo study, we note that among the studied GREG estimators it does not seem to matter much which estimator we choose. Using auxiliary information in the estimator only is usually better than using auxiliary information in the design only. However, using auxiliary information in both is the best strategy, yielding the smallest variances. 3R0L[ works better than Pareto ?SV when the heteroscedastic pattern is stronger. Nevertheless, for most strategies both these sampling designs yield somewhat higher variances than PE676,. In paper II as well as in the first paper, we see that we obtain small variances by using the auxiliary information and choosing a ?SV (? ) design.
ò
However, if we specify the variance structure badly and, as a result, choose a non-optimal design, the variance will increase and a good estimator choice is not enough to fully compensate. Another concern appears when we have several study variables with different variances structures. Then, a ?SV (? )
ò
design based on a single study variable, \ , might work well for that varió
able, but for other variables the variances may become larger than acceptable. In the third and fourth papers there is a discussion of designs adapted to the multivariate situation.
31
2.3 A multiparameter perspective on the choice of sampling design in surveys
When we choose a strategy in the multivariate situation, we can always choose different estimators for different study variable totals, W
õ ô
(T = 1, ! , 4 ) . However, we can only choose one design and that choice will
affect all the parameter estimates of the survey. One goal might be to try to find a design such that the estimator variances deviate as little as possible from the estimator variances resulting from single-variable optimal designs. A good design in a multivariate situation should not yield very good precision for some estimators at the expense of unacceptably high variances for others. In other words, we should search for a ‘best’ compromise. Three approaches based on different criteria are presented in order to achieve a compromise design for a multivariate situation. In principle, all methods presented are generalizations of the single variable solution outlined in section 1.4.4. Hence, they are based on approximations of the anticipated variance of the GREG estimator and they are optimal under the given criterion. For each study variable, auxiliary information is used to specify a linear model with a specific variance structure, as in equation (1). Thereafter, we
ˆ ) (T = 1,! , 4) , and present derive the anticipated variances, $19 (W
÷ ø ù ö
ˆ ) , to attain a compromise three approaches based on functions of $19 (W
÷ ø ù ö
design.
32
The approach that seems most promising is a design that minimizes a loss function, the anticipated overall relative efficiency loss ($125(/), defined as
ˆ ) ? $19 min (W ˆ ) $19 (W $125(/ = ? + ˆ ) $19 min (W =1
û ü ý ÿ ü ý ÿ ú ú ü ü ü ý ÿ ú
(10)
where +
(T = 1,! , 4) are weights (summing up to unity) that reflect the
relative importance of the parameters to be estimated. The measure in equation (10) is similar to the measure discussed by Dalenius (1957, chapter 9) to find a compromise allocation for stratified sampling in the multiparameter case. Furthermore, we show that equation (10) is minimized if the design is such that the first-order inclusion probabilities are given by
¡
Q
?+
¢ ¢
? =
£
=1
¡
? (?
¢ ¢
?2
¢ £
?1 (
¥ §
¨
)
£
? 1)? 2
¢ £ ¢ £
? ?+
¢
=1
? (?
¢
?2
?1 (
¥ § ¨
.
¢ £
(11)
)
£
? 1)? 2
This expression is not as cumbersome as it may appear. It involves the single-variable optimal inclusion probabilities ?
(
)
and the variance struc-
ture ? 2 . In a carefully planned multiparameter survey, guesstimates, i.e.
?
(
)
and ? 2 , of both these quantities should be computed in any case.
We study the suggested compromise approaches for a practical case. Subsequently, auxiliary information is used to make computations and illustrate suitable planning phase analysis. A diagnostic table is used for comparisons between the designs at our disposal and for a final choice of design. 33
2.4 On the choice of optimal sampling design in business surveys with several important study variables
A general aim of a multiparameter survey is to find a design that meets the desired levels of precision for all important target parameters simultaneously. If we choose our design according to one of the criteria presented in the third paper, we will achieve a minimization of a summary measure without taking into account that precision requirements may be different for different estimates. In paper IV, the approaches of paper III are generalized (and slightly extended) by applying non-linear programming algorithms. Non-linear programming methods have previously been used in the case of similar problems related to optimal allocation in stratified sampling. Cochran (1977, section 5A4) provides a list of early references, Chromy (1987) and Bethel (1989) are references of later date. For Poisson ?SV sampling and the ? estimator, Sigman and Monsour (1995) sketched a procedure using non-linear programming similar to that of paper IV. Saavedra (1999) applied these ideas using the algorithm proposed by Chromy to determine probabilities to be used for Pareto ?SV sampling in a price and volume petroleum product survey. The general problem studied in paper IV can be described as follows. A sampling design is to be determined to minimize a function I for all 4 estimator variances and fulfil specified restrictions, Y , made on function J of
each estimator variance (or variance approximation), i.e. minimize
ˆ I J 9 W 1
((
( )), J (9 (Wˆ )),! , J (9 (Wˆ ))) with restrictions
2
34
ˆ J 9 W
( ( )) ? Y
(T = 1,! , 4 ) .
(12)
For the general problem above, the paper suggests solutions for the cases when I is either the weighted arithmetic or geometric mean and J is one of the functions given by the three approaches in paper III, i.e. (i) a variance, (ii) a relative variance, or (iii) the (recommended) relative efficiency loss. Moreover, since the methods are planning stage means, the unknown estima-
ˆ ) tor variances in (12) are replaced by $19 (W
(T = 1,!, 4 ) .
The theoretical background of the applied non-linear programming model is given as well as an application using a Swedish business population. The studies indicate that the methods work both theoretically and practically, hence giving us the best compromise design in a multivariate situation, at least for not too large populations. The two foremost points of the paper are to suggest a flexible solution regarding how to use auxiliary information exhaustively in the design planning, and to provide diagnostic support for the final design choice.
2.5 Minor comments and concluding remarks
Finally, some minor but interesting observations made during the work on this thesis are offered. The practical applicability of the studied methods has been an underlying guide in the work on these four papers. The results and methods of the last two papers are general in the sense that they allow a flexible use of the auxiliary information, estimators and sampling schemes. On the issue of finding a
35
‘best’ strategy for multiparameter surveys, the design should be in focus, primarily because the overall loss in precision due to a poorly chosen design seems to be relatively larger than the loss obtained by using a less efficient member (-s) of the GREG estimator family. We can implement the presented compromise designs by taking advantage of the recent developments in ?SV sampling, which provide us with practically feasible sampling schemes, for instance the Pareto ?SV sampling scheme. The first and the second paper give us experience of advantages and drawbacks of such schemes. As far as the estimators are concerned, the results have been presented on the basis of the class of GREG estimators. Different GREG estimators are explicitly compared in paper II, but in the other papers the specific expression of the GREG estimator is of secondary importance, and in the third and fourth papers it has been tacitly understood that the models are well specified and that the survey designer can make a good estimator choice. In our setting, i.e. with well-chosen models and reasonably large, direct element samples, the designs presented will provide efficient estimates combined with a GREG estimator. It is not likely that there is much to gain in variance reduction by using estimators outside the GREG family, for instance calibration estimators (Deville and Särndal (1992), Lundström and Särndal (1999) and Estevao and Särndal (2000)) or the (less practical) optimal regression estimator (see Rao (1992, 1994, 1997) and Montanari (1987, 1998)).
ˆ ) are conceptually Although the anticipated variance and the $19 (W
ˆ ) in equation different from the design variance of an estimator (e.g. 9 (W
! " $
(4) in section 1.4.2), it is of interest to note the similarities between table 2 and table 3 given in paper III (as well as similarities between the tables in 36
paper IV). They suggest that planning stage computations such as those illustrated could be a helpful tool to anticipate how different design choices affect estimator variances.
37
References
Bethel, J. (1989). Sample Allocation in Multivariate Surveys, 6XUYH\ 0HWK RGRORJ\ , 47–57 Brewer, K.R.W. (1963). Ratio Estimation and Finite population: Some results deductible from the assumption of an underlying stochastic process. $XVWUDOLDQ -RXUQDO RI 6WDWLVWLFV , 93–105. Brewer, K.R.W. (2002). &RPELQHG 6XUYH\ 6DPSOLQJ ,QIHUHQFH :HLJKWLQJ %DVX¶V (OHSKDQWV Arnold, London. Brewer, K.R.W. and Hanif, F. (1983). 6DPSOLQJ ZLWK XQHTXDO SUREDELOLWLHV Springer-Verlag, New York. Cassel, C.M., Särndal, C-E. and Wretman, J. (1976). Some results on generalized difference estimators and generalized regression estimators for finite populations. %LRPHWULND , 615–620 Cassel, C.M., Särndal, C-E. and Wretman, J. (1977). )RXQGDWLRQV RI ,QIHU HQFH LQ 6XUYH\ 6DPSOLQJ. Wiley & Sons, New York Chatterjee, S. (1968). Multivariate stratified surveys. -RXUQDO RI WKH $PHUL FDQ 6WDWLVWLFDO $VVRFLDWLRQ , 530–534. Chatterjee, S. (1972). A study of optimum allocation in multivariate stratified surveys. 6NDQGLQDYLVN DNWXDULHWLGVNULIW , 73–80. Chromy, J. (1987). Design Optimization with Multiple Objectives. 3URFHHG LQJV RI WKH 6HFWLRQ RQ 6XUYH\ 5HVHDUFK 0HWKRGV $PHULFDQ 6WDWLVWLFDO $VVRFLDWLRQ 194–199. Cochran, W.G. (1942). Sampling theory when the sampling units are of unequal sizes. -RXUQDO RI WKH $PHULFDQ 6WDWLVWLFDO $VVRFLDWLRQ, , 199– 212. Cochran, W.G. (1977). 6DPSOLQJ 7HFKQLTXHV. Wiley, New York Dalenius, T. (1950). The problem of optimum stratification. 6NDQGLQDYLVN DNWXDULHWLGVNULIW , 203–213. Dalenius, T. (1957). 6DPSOLQJ LQ 6ZHGHQ. Almquist & Wiksell, Stockholm Dalenius, T. (1992). Introduction to Neyman (1934) On the two different aspects of the representative method: The method of stratification and the method of purposive selection. In Kotz, S. and Johnson, N. L., (eds.) %UHDNWKURXJKV LQ 6WDWLVWLFV YROXPH ,, 0HWKRGRORJ\ DQG 'LVWULEXWLRQ. Springer, New York. 114–122 Danielsson, S. (1975). Optimal allokering vid vissa klasser av urvalsförfaranden. Ph.D. Thesis, Department of Mathematics, University of Linköping, Sweden. Deville, J-C. and Särndal, C-E. (1992). Calibration Estimators in Survey Sampling, -RXUQDO RI WKH $PHULFDQ 6WDWLVWLFDO $VVRFLDWLRQ, , 376-382. 38
Estevao, V. and Särndal, C.E. (2000). A Functional Form Approach to Calibration. -RXUQDO RI 2IILFLDO 6WDWLVWLFV, , No. 4, 379–399. Estevao, V. and Särndal, C.E. (2002). The ten cases of auxiliary information for calibration in two-phase sampling. -RXUQDO RI 2IILFLDO 6WDWLVWLFV, , No. 2, 233–257. Godambe, V.P. (1955). A unified theory of sampling from finite populations. -RXUQDO RI WKH 5R\DO 6WDWLVWLFDO 6RFLHW\ 6HULHV % 269–278. Hajek, J., (1959). Optimum strategy and other problems in probability sampling. &DVRSLV SUR 3HVWRYDQL 0DWHPDWLN\ 387–421. Hansen, M.H., Madow, W.G. and Tepping, B.J. (1983). An Evaluation of model-dependent and probability-sampling inferences in sample surveys. -RXUQDO RI WKH $PHULFDQ 6WDWLVWLFDO $VVRFLDWLRQ , 776–793. Holmberg, A. and Swensson, B. (2001). On Pareto ?SV sampling: Reflections on unequal probability sampling. Theory of Stochastic Processes , no. 1–2, 142–155 Horvitz, D.G. and Thompson, D.J. (1952). A generalization of sampling without replacement from a finite universe. -RXUQDO RI WKH $PHULFDQ 6WD WLVWLFDO $VVRFLDWLRQ , 663–685. Huddleston, H.F., Claypool, P.L. and Hocking, R.R. (1970). Optimum allocation to strata using convex programming. $SSOLHG 6WDWLVWLFV , 273– 278. Hughes, E., and Rao, J.N.K. (1979) Some problems of optimal allocation in sample surveys involving inequality constraints. &RPPXQLFDWLRQV LQ 6WD WLVWLFV $ , 1551–1574. Isaki, C.T. (1970). 6XUYH\ GHVLJQV XWLOL]LQJ SULRU LQIRUPDWLRQ Doctoral thesis, Iowa State University, Ames, Iowa. Isaki, C.T. and Fuller, W.A. (1982). Survey design under the regression superpopulation model, -RXUQDO RI WKH $PHULFDQ 6WDWLVWLFDO $VVRFLDWLRQ, , 89–96 Kott, P.S. and Bailey, J.T. (2000). The Theory and Practice of Maximal Brewer Selection with Poisson PRN Sampling, 3URFHHGLQJV RI WKH VHF RQG ,QWHUQDWLRQDO &RQIHUHQFH RQ (VWDEOLVKPHQW 6XUYH\V, June 17–21, 2000, Buffalo 269–279. Kröger, H., Särndal, C-E. and Teikari, I. (1999). Poisson Mixture Sampling: A family of designs for Coordinated Selection Using Permanent Random Numbers, 6XUYH\ 0HWKRGRORJ\ , No 1, 3–11. Kröger, H., Särndal, C-E. and Teikari, I. (2000). Poisson Mixture Sampling Combined with Order Sampling: a Novel use of the Permanent Random Number Technique. Manuscript submitted for publication (date 00/08/30). (Forthcoming in Journal of Official Statistics with the title Poisson Mixture Sampling Combined with Order Sampling) Laaksonen, S. (2002). Need for high level auxiliary data service for improving the quality of editing and imputation. &RQWULEXWHG SDSHU RI WKH 81 (&( ZRUNLQJ VHVVLRQ LQ +HOVLQNL, 27-29 May. Lundström, S. and Särndal, C-E. (1999). Calibration as a standard method for treatment of nonresponse -RXUQDO RI 2IILFLDO 6WDWLVWLFV, , 305–327. Montanari, G.E. (1987). Post-sampling efficient QR-prediction in large-scale surveys. ,QWHUQDWLRQDO 6WDWLVWLFDO 5HYLHZ 191–202. Montanari, G.E. (1998). On Regression Estimation of Finite Population Means. 6XUYH\ 0HWKRGRORJ\ No 1, 69–77. 39
Neyman, J. (1934). On the two different aspects of the representative method: The method of stratified sampling and the method of of purposive selection. -RXUQDO RI WKH 5R\DO 6WDWLVWLFDO 6RFLHW\ 558–625. Ohlsson, E. (1990). Sequential Poisson Sampling from a Business Register and its Application to the Swedish Consumer Price Index. Statistics Sweden, R&D Report 1990:6. Ohlsson, E. (1998). Sequential Poisson Sampling. -RXUQDO RI 2IILFLDO 6WDWLV WLFV, , 135–158. Rao, J.N.K. (1992). Estimating Totals and Distribution Functions Using Auxiliary Information at the Estimation Stage. 3URFHHGLQJV RI :RUNVKRS RQ 8VHV RI $X[LOLDU\ ,QIRUPDWLRQ LQ 6XUYH\V, Statistics Sweden. Rao, J.N.K. (1994). Estimating Totals and Distribution Functions Using Auxiliary Information at the Estimation Stage. -RXUQDO RI 2IILFLDO 6WDWLV WLFV, , 153–165. Rao, J.N.K. (1997). Developments in sample survey theory: an appraisal. &DQDGLDQ -RXUQDO RI 6WDWLVWLFV, , 1–21. Rosén, B. (1997a). Asymptotic Theory for Order Sampling, -RXUQDO RI 6WD WLVWLFDO 3ODQQLQJ DQG ,QIHUHQFH, , 135–158. Rosén, B. (1997b). On sampling with Probability Proportional to Size, -RXU QDO RI 6WDWLVWLFDO 3ODQQLQJ DQG ,QIHUHQFH, , 159–191. Rosén, B. (2000). Generalized Regression Estimation and Pareto ? SV R & D Report 2000:5 Statistics Sweden. Saavedra, P.J. (1995). Fixed Sample Size PPS Approximations with a Permanent Random Number. 3URFHHGLQJV RI WKH VHFWLRQ RQ 6XUYH\ UHVHDUFK 0HWKRGV -RLQW 6WDWLVWLFDO 0HHWLQJV American Statistical Association 697–700 Saavedra, P.J. (1999). Application of the Chromy Algorithm with Pareto Sampling, 3URFHHGLQJV RI WKH 6HFWLRQ RQ 6XUYH\ 5HVHDUFK 0HWKRGV -RLQW 6WDWLVWLFDO Meetings, American Statistical Association 1999 355–358. Särndal, C-E., Swensson, B. and Wretman, J. (1992). 0RGHO $VVLVWHG 6XUYH\ 6DPSOLQJ. Springer, New York. Sigman, R.S. and Monsour, N.J. (1995). Selecting Samples from List Frames of Businesses, in Cox, B.G., Binder, D.A., Chinnappa, N., Christianson, A., Colledge, M.J., and Kott, P.S. (eds) %XVLQHVV 6XUYH\ 0HWKRGV, New York: Wiley, 153–169. Wright, R.L. (1983). Finite Population Sampling with Multivariate Auxiliary Information, -RXUQDO RI WKH $PHULFDQ 6WDWLVWLFDO $VVRFLDWLRQ, , 879– 884.
40
doc_582000108.pdf
Surveying or land surveying is the technique, profession, and science of accurately determining the terrestrial or three-dimensional position of points and the distances and angles between them.
Comprehensive Summaries of Uppsala Dissertations from the Faculty of Social Sciences 126
ESSAYS ON MODEL ASSISTED SURVEY PLANNING
BY
ANDERS HOLMBERG
ACTA UNIVERSITATIS UPSALIENSIS UPPSALA 2003
!!
"!#"$ % & ' % % (&
&) &
* +*&)
' ,) !! ) ++,-+ ./ 0.1 ,++2+ +34- (1,//2/5) ,
)
"6) 7!
)
) 2+8/ 9":$$7:$6 :$ & ;
%
' '
&
'
') & %
*
&
'
'# <= & %
' &
* & >
* &
: :>
' <?
'=
<=
%
& % % > ?
') & *
&
*&& % & & ; % &
% &
*&
&
%
& '
&
&
) &
& & %
&
&* &
%%
' ' *& &
* &
&
&
) & &
''
'
& & % '
&
% &
%
%%
'
) &
'
& &
%
%
'
%
; %
)
&
' &
:
& ;
) &'& & '
222
' :
'
' & ' & % &
) 8 *
%
+*&
& % &
'
&'& :
'
'
' & * & '
222
*
'
&
& & % *
) &
''
% '
' &*
%
&
'
% & %
' &)
!" #$%
&'(#$)* + @ , ' !! 2++/ !A:B79 2+8/ 9":$$7:$6 :$ ####
: 7"B <&
#CC))CDE####
: 7"B=
PAPERS INCLUDED IN THE THESIS
This thesis is based on the following papers, which are referred to in the text by their Roman numerals:
I
Holmberg, A. and Swensson, B. (2001). On Pareto ?ps sampling: Reflections on unequal probability sampling strategies. In Theory of Stochastic Processes vol. 7(23), no. (1-2): 142–155.
II Holmberg, A. (2001). On the choice of strategy in unequal probability sampling. In Proceedings of the Annual meeting of the American Statistical Association, August 5–9, 2001, Section on Survey Research Methods. American Statistical Association, CD media (Amended version). III Holmberg, A. (2002). A multiparameter perspective on the choice of sampling design in surveys. In Statistics in Transition vol. 5, no. (6): 969–994. IV Holmberg. A., Flisberg, P. and Rönnqvist, M. (2002). On the choice of optimal design in business surveys with several important study variables. In R&D Report 2002:3, Statistiska Centralbyrån (SCB), metodenheten (Submitted).
Reprints were made with the permission of the publishers.
Contents
Acknowledgements ........................................................................................ 5 1 Introduction........................................................................................... 7 1.1 Concepts of survey planning in this thesis................................... 7 1.1.1 On the use of auxiliary, or supplementary, information ......... 9 1.2 Using models to assist survey planning..................................... 12 1.3 General definitions..................................................................... 14 1.3.1 Sampling design and inclusion probabilities......................... 14 1.3.2 Descriptive planning stage models........................................ 15 1.4 Auxiliary variables and the strategy choice............................... 16 1.4.1 The estimator choice ............................................................. 17 1.4.2 The GREG estimator for population totals ........................... 17 1.4.3 Use of auxiliary variables in the sampling design, the single parameter case..................................................................................... 18 1.4.4 Probability proportional-to-size sampling (?SV sampling).... 19 1.4.4 The Anticipated Variance ..................................................... 22 1.4.5 Planning values and guesstimates ......................................... 24 1.5 The multivariate sampling problem........................................... 25 Summary of the papers ....................................................................... 28 2.1 Reflections on unequal probability sampling strategies ............ 28 2.2 On the choice of strategy in unequal probability sampling ....... 30 2.3 A multiparameter perspective on the choice of sampling design in surveys................................................................................................. 32 2.4 On the choice of optimal sampling design in business surveys with several important study variables .................................................... 34 2.5 Minor comments and concluding remarks................................. 35
2
References .................................................................................................... 38
4
Acknowledgements
It is a special pleasure to write this part of the thesis, namely the expression of my indebtedness to people and institutions whose help has made this work possible. This thesis marks the end of my graduate studies, and over the years many have directly or indirectly contributed to the outcome. However, as always the author accepts the full responsibility for any inadequacies. The most important person to whom I wish to express my true and deepest gratitude is Professor Bengt Swensson. From the beginning, he has had major a influence on this work and on my development within the field of statistics. With great knowledge, insight, scrutiny and patience he has read and commented on numerous manuscripts. This thesis and the papers on which it is based would not have developed to their present standard, had it not been for Bengt. I also wish to thank Bengt’s wife Lena for her patience and understanding of my need to sometimes consult Bengt outside office hours. Bengt has also co-authored one paper. Special credit must be given to him as well as to Patrik Flisberg and Mikael Rönnqvist co-authoring another paper. Their efforts have had a direct impact on paper IV and it has been stimulating and rewarding working with them. This work is the fruit of efforts I have made both at the Department of Statistics, Örebro University, and Statistics Sweden’s R&D Department. Consequently, there are many people who have actively and morally supported my efforts. I am especially grateful to my present and former managers at Statistics Sweden, Eva Elvers, Agneta Lissmats, Evert Blom and Lars Lyberg, for their encouragement and for allowing me to spend parts of my working hours on this thesis. In a similar way, I am also indebted to Olle Häggbom and Erik Wallgren at Örebro University. Olle, as head of department, helped by finding means for adequate funding and Erik, as director of studies, helped by planning and administering reasonable teaching schedules. I also gratefully acknowledge the financial support provided by $GROI /LQGJUHQV 6WLIWHOVH 5
In addition, I want to thank personnel at Uppsala University, my formal supervisor Anders Christoffersson (who took important initial steps regarding my admission as a graduate student) and Gunilla Klaar as well as Anders Ågren (for appreciated help on practical as well as administrative matters throughout the years). There are many colleagues and friends I wish to thank for showing understanding and interest in my work. In particular, I thank Martin Axelson. Over the years we have shared statistical problems, offices, train compartments and leisure, and without his sense of humour and readiness to discuss both serious and tiny matters, this work would have been less joyful. As dear colleagues and friends (also teachers) at Örebro University, Olle Carlsson, Hans Hedén, Åke Holmén (who, alas, did not live to read this), Niklas Karlsson and Erik Wallgren have given me invaluable experience in various fields of statistics as well as explaining how it is taught and practised. One of the privileges at Statistics Sweden has been the close access to inspiring discussions with skilful survey statisticians. I am obliged to Claes Andersson (for sharing his vast experience and all his tips and solutions helpful to me when I have got stuck in daily work), to Lennart Nordberg (for encouraging detailed discussions and for providing me with realistic data sets), to Sixten Lundström and Håkan L. Lindström (for their excellent tutoring which inspired my career choice) and to Carl-Erik Särndal (for always showing an interest in my work, giving me constructive criticism, discussing innovative ideas and sharing his knowledge). Collectively, I also acknowledge the support from my colleagues at the R&D Department and those I have closely worked with from other departments of Statistics Sweden as well as at Örebro University. Without your assistance and fine co-operation in different projects, this work would have been harder. Finally, I wish to give tribute to my family and closest friends, especially my parents who did not live to see this, my two sons John and Filip and (at the final stages) Britta. To have your trust and support has been priceless. Thank you! Vadstena, March 2003 Anders Holmberg
6
1 Introduction
Whenever there is a need to obtain information on characteristics of a large finite population of people, households, business enterprises or other elementary units, it is natural to consider a sample survey. In most cases, a carefully planned and well-executed sample survey will provide more precise and timelier information at a lower cost than a complete enumeration would. Obviously, taking a sample means that less time to collect and process data is required, but the huge and unwieldy organization required for a complete enumeration often also implies less control of data collection, resulting in many errors. The present thesis treats some of the statistical questions that arise in the planning process of a sample survey, in particular questions concerning the combined choice of a sampling design, a sample selection scheme and estimator(s).
1.1 Concepts of survey planning in this thesis
To plan a medium-sized or large sample survey is a complex undertaking involving a large number of considerations, decisions and choices on the part of a team of – among others – subject matter experts, survey measurement specialists, computer specialists and survey statisticians. In the discussion of sample survey planning in the present thesis the focus is on some of the sta-
7
tistical tasks and the assumption is that much planning work already has been done, which means that far from all aspects of survey planning will be covered. For example, neither tasks such as the formulation of a subject matter problem and its translation to a statistical problem, nor the choice of measurement device and editing procedures (although they are all fundamental in the planning process) will be discussed here. The point of departure for the discussion in the thesis is as follows: a sample survey is to be taken in order to estimate certain well-defined characteristics (parameters) of a welldefined finite population for which an up-to-date good sampling frame exists, i.e. a frame which besides information that makes it possible to get observational contact with selected elements, contains useful data on a set of auxiliary variables. The focus will be on the case where the frame is a direct listing of the population elements, making direct sampling of elements possible; examples are business registers kept by many countries, and registers of the total population found in some European countries. A typical survey has many study variables and aims to estimate many parameters. Common parameters are totals (e.g. the total monthly value of imported commodities) and functions of totals such as means (e.g. the average income of adult wage-earners), proportions (e.g. the proportion of cultivated area devoted to wheat) and ratios (e.g. the ratio of rental income of commercial premises to that of rented area). Hence, one important goal for the survey statistician in the planning phase of the survey is to find a VWUDWHJ\ – a combination of sampling design and estimators for the most important parameters – which gives the best possible (most precise) parameter estimates for a given cost, or conversely to obtain estimates with a desired precision at lowest possible cost. Survey sampling 8
theory offers instruments to achieve such a goal. However, for the purpose of coming to a decision on the best feasible strategy, survey theory should be applied as early as possible in the planning process, and it must be balanced by practical and budgetary restrictions.
1.1.1 On the use of auxiliary, or supplementary, information
In a broad sense, auxiliary information – supplementary information is an often-used essentially synonymous expression – is any information of assistance about the population available prior to a survey. For instance, in questionnaire design the language and the ordering of questions could be tailored to suit some characteristics of the respondents, and when data collection methods are specified, decisions about the measurement instrument and decisions on how to implement technical equipment could be greatly improved. In the present thesis, ‘supplementary information’ will be used in a broad sense. In contrast, ‘auxiliary information’ will be reserved for the following specific kind of information, consisting of two related components: (i) data on a set of variables (which will be called auxiliary variables) available at the planning phase as well as at the estimation phase for every population element; (ii) appropriate a priori knowledge of the structure of relationships between important study variables and (subsets of) the auxiliary variables. In the search for estimators of high precision, survey sampling theory strongly emphasizes the use of auxiliary information. Often, the sampling frame is equipped with at least a few auxiliary variables that may be useful, while subject matter experts and the survey statistician may have knowledge of the relationship between study variables and auxiliary variables from
9
similar surveys of similar populations. If no such useful knowledge is available, it will often be worthwhile to make a pilot survey. The use of auxiliary information can be traced back to the early days when the foundations of sampling theory were laid. In Neyman’s classical paper from 1934, one of the major contributions is the explicit use of auxiliary information. Neyman suggested a partition of the population into strata followed by the selection of a probability sample in each stratum. Cochran (1942) is an early reference which introduces auxiliary information in the estimation procedure. Cochran discussed the regression method of estimation in detail. Since Neyman, much of the development of sampling theory has evolved around the use of auxiliary information and auxiliary variables. In an introduction to Neyman’s paper, Tore Dalenius states:
“As demonstrated by the developments in the last half-century, supplementary information may be exploited for all aspects of the sample design, the definition of sampling units, the selection design and the estimation method.” (Dalenius 1992)
Because auxiliary variables can be used in a wide variety of ways at the design stage as well as at the estimation stage, efforts to use this information to find efficient estimators lead to many techniques for sampling and estimation. Results on optimum allocation in stratified sampling, on probability proportional-to-size sampling, and on regression or calibration estimators all rely on properties of auxiliary variables. Recent references that illustrate this wide variety are Laaksonen (2002), who attempts to systematize auxiliary variables into different types, and Estevao and Särndal (2002), who show
10
that there are ten different cases where auxiliary information may be used for calibration in two-phase sampling. In all of the papers in the present thesis, auxiliary information plays a key role in the planning process. The first paper studies two issues related to the planning stage: (i) the sensitivity of model assumptions concerning the relation between the size measure and a study variable in without replacement probability proportional-to-size sampling (?SV sampling), and (ii) properties of practicable sample selection schemes for fixed size ?SV sampling. These two issues are also addressed in the second paper, which furthermore discusses the consequences of the presence of more than one study variable and to what extent the auxiliary information used in the design and that used in the estimators interact. The evident problem in both the first and the second paper is how to choose an overall efficient VDPSOLQJ GHVLJQ when there are several important study variables with various relationships to the available auxiliary variables. The third paper suggests a diagnostic tool to support the choice of design, and on the basis of three criteria of overall efficiency optimal designs are derived. The optimal designs presented in the third paper may not be fully satisfactory in meeting specified precision requirements for separate estimators. To achieve a design that is tailor-made to meet such requirements, optimisation must be done under restrictions. Though the underlying optimisation problem is only outlined in paper III, a solution involving non-linear programming methods is given in the fourth paper. By way of an example based on an application to a Swedish business population, the fourth paper also compares a design obtained through non-linear programming algorithms with 11
designs discussed in paper III as well as designs based on the same basic ideas as those discussed in the first two papers. The results presented in the thesis are applicable in multipurpose surveys with several important study variables, where auxiliary information is available and a probability proportional-to-size sampling design is considered, e.g. business and agricultural surveys repeated over time. The discussion, notation and methods presuppose that without replacement direct element sampling is used. However, with proper adjustments, the results should also be of potential use in multipurpose, multistage area sampling, where the first-stage units are chosen proportional to size. In the following, there is first given what would appear to be a necessary and more detailed theoretical background. In conclusion, there are then offered a summary of the major contributions of the four papers and some comments and remarks.
1.2 Using models to assist survey planning
“Models are appropriately used to guide and evaluate the design of probability samples, but with large samples the inferences should not depend on the model.” Hansen, Madow and Tepping (1983)
One thing that distinguishes survey sampling theory from the mainstream of statistical theory is that inference concerns parameters of an actual finite population and not unknown parameters of an otherwise (assumed to be) known frequency function of some hypothetical population. The latter is commonly the case in general statistical theory and means that the interest 12
lies in inference about model parameters. However, since the usual target of our inference in survey sampling is unknown parameters of the finite population, models are usually used differently. Two main types of inferences are considered in the survey literature, design-based and model-based inference. In design-based inference, the finite population is considered fixed and the variable values are fixed. In modelbased inference, the values of the variables in the finite population are assumed realizations from a superpopulation model. In this thesis, the concern is with design-based inference. The population is fixed, and the study variable values are fixed (but unknown). The stochastic structure is induced by the sampling design, and the sample itself is the random element. However, the superpopulation concept will be used as well, but differently from its use in model-based inference. Superpopulation models will be used merely (i) as a tool for giving compact descriptions of useful a priori knowledge about relations between the study variables and available auxiliary variables, e.g. for describing basic features of finite population scatterplots, and (ii) as guides in the search for efficient strategies. Hence, inference will not depend on the model, but the model will assist it in the sense that the better the superpopulation model depicts the finite population, the better we can hope to succeed in our task of estimating the finite population parameters.
13
1.3 General definitions
Let 8 = { 1,! , N ,! , 1 } be our finite population consisting of 1 elements. Consider 4 study variables \1 ,! , \ ,! , \ , and let \
¡ ¢ ¥ £
denote the un-
known value of \ for population element N . In addition, we suppose that there are 3 auxiliary variables at hand. They are denoted X1 ,! , X ,! , X
¦ §
and their values X
¦
( S = 1,! , 3 )
\
©
are known for every element N in the
population. Furthermore, we will concentrate on the estimation of population totals, i.e. W
¨
=?
©
?
=? \ .
©
1.3.1 Sampling design and inclusion probabilities
To select a set sample V ? 8 of size Q , a without replacement sampling
design, S (< ) , will be used. The sampling design is the function that gives each possible sample its probability of being selected, and it is essential for determining the properties of estimators to be used in the survey. Fundamental characteristics of a given sampling design are the inclusion probabilities of first and second order. The first-order inclusion probability of element N is the probability that a sample containing the N:th population element is realized. We denote this by ?
(N = 1,! , 1 ) . The second-order
inclusion probability (the probability that a sample containing elements N and O is realized) will be denoted by ?
(N , O = 1,! , 1 ) . They are mainly of
14
importance for variance calculations. Throughout the thesis it is assumed that ? > 0 and ?
> 0 for every N and O.
Since S (< ) often is a complicated function it is of little direct practical use. However, knowledge of the ? and the ?
alone is normally sufficient
for our needs, e.g. for determining expected values and variances of estimators. The planning approach to come up with a good strategy for a sample survey adopted in this thesis results in a decision on a class of SUHIHUUHG VDP SOLQJ GHVLJQV characterized by a preferred set of first-order inclusion probabilities. To implement a design conforming to this set we need a VDPSOH VHOHFWLRQ VFKHPH. Often, but not always, we can find such a scheme. Sometimes, however, there is no (practical) scheme that implements a design with a set of first-order inclusion probabilities which exactly match the preferred ones. The latter is the case for (medium to large) IL[HG size probability proportional-to-size sampling designs. Without replacement, probability proportional-to-size sampling designs ( ?SV designs) are present in all four papers of this thesis. Brief summaries of the basic ideas behind these designs are given in the papers and can be found in most standard textbooks.
1.3.2 Descriptive planning stage models
There must be a useful and well-guessed relationship between the auxiliary variables and the study variables if the auxiliary variables are to be of assistance in the choice of strategy. In the following, we assume that the structure of such relationships can be caught by linear models ?
such that
15
\ = [?
& ' )
+?
with
(? (?
"
)= 0,
9? (?
# $
%
)=?
$
2
%
and
(? (? ?
'
+
) = 0 (N ? O ), i.e. for (N = 1,!, 1 ) ,
? (? ? ? 9 ? ? ?
-
( \ ) = [? (\ ) = ?
. / . . / .
/
.
2
/
(1)
where [? = [1 ,! , [ ,! , [
2 3 3 4 3 6 1 1 1
(
3
) is a suitable set of 7 8 9
(positive) auxiland where
iary
variables
formed
from
2 1
>
X1 ,! , X ,! , X ,
;
= ?1 , ! , ? , ! , ?
< = : :
(
:
)? and ?
,! , ? 2 are model parameters.
> ?
1.4 Auxiliary variables and the strategy choice
As mentioned previously, the available auxiliary information will influence the survey statistician’s final choice of strategy. The general problem of finding the, in some sense, best strategy has been studied by different authors. In such studies, it seems necessary to accept certain restrictions of the general problem. Sometimes restrictions are made regarding specific classes of estimators, and sometimes the emphasis is placed solely on finding an efficient sampling design with certain properties. Horvitz and Thompson (1952) and Godambe (1955) are classical references that approach the problem, restricting themselves to a class of linear estimators, whilst Neyman (1934) and Dalenius (1950) are examples of references where the sampling design has been focused upon.
16
1.4.1 The estimator choice
In this thesis, one restriction is that we consider the family of estimators known as GREG (generalized regression) estimators (for details see Särndal, Swensson and Wretman (1992), sections 6.4-6.7). It is a wide family of estimators, however, which includes several well-known estimators used in practice. With use of a model like the one given in equation (1) as a starting point for a GREG estimator, different model expressions give us estimators such as the post-stratified estimator and the simple ratio estimator. A GREG estimator is approximately unbiased even with poorly chosen models, but with a strong relationship between \ and [ , and a fair knowledge of that relationship, the GREG estimator will, as far as efficiency is concerned, outperform the well-known Horvitz-Thompson estimator which does not use auxiliary information.
1.4.2 The GREG estimator for population totals
The GREG estimator for a population total is defined as
ˆ W
B A
C
ˆ =W
B A
?
ˆ + W ?W
D D A
(
I
A
?
ˆ . )? %
E
(2)
ˆ Here, W
G F
?
=?
I O
?
J
\ /? = ? \ /?
H I I H I J O O O
is the Horvitz-Thompson, or ?
estimator, W
L
= W 1 ,! , W ,! , W
L M L
(
N
L
)?
is a - -dimensional vector of [
P Q
ˆ totals, W
S
R
?
is a vector of the corresponding ? estimators and
? [ [? ? ˆ =? % ? ? ? F ? ? ? ?
T U T U T V T U U
?1
?
V
[ \
T U T U
F ?
T U U
(3)
17
is an estimated vector of regression coefficients, where F
X Y
is a suitable con-
stant.
ˆ The Taylor expansion (design) variance of W
Z [
\
is given by
^ _
ˆ ) = ?? 9 (W
a b d ] ^
(?
^
_
?? ?
?
f
_
?
f
??
^ _
)(
e
^
(
e
_
(4)
where (
j
k
= \ ? [? %
j k j k j
(N = 1,! , 1 )
m n
are finite population fit residuals, a finite population regresn
with % =
m
(?
q
[ [? / F
m n m n
) ?
?1
q
\ [? / F
m n m n m
sion coefficient. The issue of variance estimation will not be discussed in detail in this thesis. However, for the strategies we consider, we require that at least approximately unbiased estimators for (4) are available.
1.4.3 Use of auxiliary variables in the sampling design, the single parameter case
By far the most important application of auxiliary variables in the sampling design is that of stratification. As mentioned above, it goes back to Neyman and it can be justified for several reasons. General reasons given in Särndal et al. (section 3.7) are: 1. Precision requirements of estimates for certain subpopulations, domains, can be assured by using domains as strata. 2. Practical aspects such as nonresponse rates, method of measurement and the quality of auxiliary information may differ between subpopulations, and can be efficiently handled by stratification.
18
3. The survey organization may be divided into geographical districts, which makes it natural for administrative reasons to let each district be a stratum. In addition, when the stratification is made on an auxiliary variable which is well-correlated with the important study variables, stratified sampling ensures a good basis for obtaining efficient estimates. Although it is elementary on its own, stratified sampling may be combined with other methods of sampling yielding rather complex sampling designs, e.g. one may consider stratified sampling with probability proportional-to-size sampling (?SV sampling) within strata.
1.4.4 Probability proportional-to-size sampling (?SV sampling)
As the name proportional-to-size indicates, the elements for a ?SV sample are selected with probabilities proportional to a size measure. The size measure, ], is one of the auxiliary variables X , or a function (transformation) of one
r
or several auxiliary variables, i.e. ] = K(X) . Its value has to be known for every population element, and ideally, if the simple ? estimator is to be used, it should be proportional to the study variable. We illustrate the ideas behind ?SV sampling by first looking at the
ˆ ? estimator for a single study variable \ , W
s u t
?
= ? \ / ? . We note
v w w x y z
that, if the inclusion probabilities ?
w
could be made proportional to \ ,
then the ratio \ / ?
{ | |
is constant, say F, and for every (fixed size design)
u t
ˆ sample we would get W
?
= ? \ / ? = QF , which has zero variance.
v w w x
However, proportional inclusion probabilities cannot be derived using the 19
study variable since all \
{
|
are unknown. Instead, we choose inclusion
probabilities proportional to ] whose values in turn should be nearly proportional to the values of \ . The first-order inclusion probabilities are then
}
given by
? =
~
? ?
Q]
~
]
~
(N = 1,! , 1 )
(5)
where ] is strictly positive,
?
? = Q and ? ? 1 .
? ?
Occasionally, some ] values are very large and equation (5) violates
? ? 1 . One common method to deal with this is to create a ‘take-all’ stra?
tum, $ , of those elements and choose them with certainty (i.e. ? = 1 for
?
N ? $ ). Then, for the remaining elements ( N ? 8 ? $ ), the ? are chosen
?
to be proportional to ]. The method presented in the fourth paper of this thesis offers another solution, where the numerical optimization method automatically fulfills the restriction ? ? 1 .
?
Henceforth, without-replacement probability proportional-to-size sampling designs based on the size variable z will be denoted by ?SV ( ] ) , and we will assume that ? ? 1 .
?
There are a large number of sample selection schemes to implement a
?SV( ] ) design. For example, Brewer and Hanif (1983) list 50 schemes.
However, if we exclude random size designs, it has turned out to be hard to devise a scheme for arbitrary sample size Q that has a number of desirable properties, e.g. (a) the actual selection of the sample is relatively simple, (b) all first-order inclusion probabilities are strictly proportional to the size vari20
able, (c) the design admits (at least approximately) unbiased estimation of estimator variances. If we also want to be able to base the sample selection on the technique of permanent random numbers (PRN), which is desirable in large survey organizations taking many surveys, some of which are repeated over time, it will be even harder. Relatively new ?SV designs, attractive to practitioners, are the order sampling designs like sequential Poisson sampling (see Ohlsson (1990, 1998)) and Pareto ?SV (see Rosén (1997a,b) and Saavedra (1995)), and the 3R0L[ design proposed by Kröger, Särndal and Teikari (1999, 2000). In papers I–II, the properties of strategies based on these designs are compared with PRGHO EDVHG VWUDWLILHG VLPSOH UDQGRP VDPSOLQJ (PE676,) as proposed by Wright (1983). When ?SV sampling is relevant, we have no reason to restrict ourselves to the ? estimator. Since we have auxiliary information at hand we can use the GREG estimator. Let us study ?SV sampling in the light of a special case of model (1) and suppose that the following first and second moments of the superpopulation make sense (where ? 2 is a (possibly unknown) constant, ? is a constant reflecting the degree of heteroscedasticity and X > 0 ):
?
? (? ? ? 9 ? ? ?
? ?
(\ ) = ?X (\ ) = ? X
? ? ? ? ?
2 ?
?
(6)
With this description of a population structure, the observations in a finite population scatterplot of \
? ?
and X would gather around the line ? X ,
? ?
with a larger spread around the line for larger values of X if ? > 0 . When
? = 1 and F = ? 2 = ? 2X , the GREG estimator following from this
21
model (sometimes called the common ratio model) is the well-known ratio estimator
ˆ W
? ?
?
=? X
? ?
? ?
?
\ ?
? ? ?
X ?
? ?
.
If the model structure given by (6) reasonably well portrays the finite population scatter, it indicates a nearly proportional relationship between
\
?
?
and X in the finite population. The X
? ?
(N = 1,! , 1 ) are known and a
(7)
?SV(X ) design yields
ˆ W
? ?
= ? \ /? .
? ? ?
Hence, the ratio estimator in this case coincides with the ? estimator. Heteroscedastic patterns are common in actual populations. Brewer (1963, 2002) discusses this in detail and suggests that ? often is in the interval [1, 2]. 5HPDUN 7H[WERRNV RIWHQ UHIHU WR D ?SV (X ) GHVLJQ FRPELQHG ZLWK WKH
? HVWLPDWRU DV D YHU\ HIILFLHQW VWUDWHJ\ SDUWLFXODUO\ ZKHQ WKH YDULDQFH
VWUXFWXUH LV VWURQJO\ KHWHURVFHGDVWLF VD\ 9?
¡
(\ ) = ?
¢ £
2
X 2 LH ? = 2 +RZ
£
HYHU IURP WKH VWXGLHV LQ WKLV WKHVLV ZH QRWH WKDW LW XVXDOO\ LV SRVVLEOH WR ILQG RWKHU VWUDWHJLHV WKDW ZRUN EHWWHU
1.4.4 The Anticipated Variance
When we plan a survey, we need some measure that can help us to discriminate and choose between alternative strategies. Obviously, since only auxiliary information is available, this measure must be a function of auxiliary information. One such measure is the anticipated variance, defined by Isaki 22
ˆ (1970) (see also Isaki and Fuller (1982)). If W
¤ ¥
¦
is an estimator of W and if
¨ §
(1) is interpreted as moments of a superpopulation from which a finite popu-
ˆ lation has been selected, the anticipated variance is the variance of W
© ª
«
?W
ª
©
over both the model ? and the sampling design, i.e.
ˆ (? ( ? W ? ?
¯ ° ® ®
(
±
?W
°
®
)
2
? ? ?( ( W ˆ ?W ? ? ? ?
¯ ° ± ° ® ®
(
®
)
? . ?
2
If the model ? is well specified, then an approximation to the anticipated variance can be written as (see Särndal et al. pp 450–451),
ˆ ) = ? ? ?1 ? 1 ? 2 . $19 (W
µ ¶ ¸ ¹ µ ¹ º ´
(
)
(8)
By using equation (8), we find a justification for choosing a ?SV design with a certain size measure. For a sampling design such that ( (Q ) = Q ,
» ¼
Result 12.2.1 in Särndal et al. shows that equation (8) is minimized if the inclusion probabilities, ? , are chosen proportional to ? , i.e a ? SV (? )
¾ ¿ À ¿
ˆ ) is then design. The minimum value $19 (W
Â Ã Ä Á
$19
Æ
min
ˆ ) = (W
Ç È Å
? (?
Æ Ï
?1 (
É Ê
Ë
)
Í
?1 ? 2
Æ Í
)
Ï
=
1 Q
(?
Ï
?
Æ
Í
)
2
?? ?2 .
Æ Í
(9)
If we apply this result to the common ratio model in equation (6), the optimal inclusion probabilities that minimize $19
Ñ
min
ˆ ) are, (W
Ò Ó Ð
23
? =?
Õ Ô
(
Ö
×
Ø
)
Õ
=
?
QX? / 2
Ô Õ Ù
X? / 2
Ô Õ
(N = 1,! , 1 )
hence, ? ? ] = X ? / 2 .
Ú Ú Û Ú
For fixed size designs, other authors have presented results also indicating that an optimal design is attained when ? ? ?
Ü Ý Ü
– Hajek (1959) (linear
design-unbiased estimators), Brewer (1963) (ratio estimation) and Cassel, Särndal and Wretman (1976, 1977) (generalized difference estimators). When the ? estimator coincides with the ratio estimator (see equation (7)), Godambe also gives a similar result. Rosén (2000) used the result by Cassel, Särndal and Wretman to study optimal strategies by combining GREG estimation and Pareto ?SV sampling. Wright’s proposal of PE676, is inspired by the wish to find a simplified design with inclusion probabilities that come close to the optimal.
1.4.5 Planning values and guesstimates
We need a notation to distinguish quantities which are used in the planning process from values which hypothetically are considered to be true values. Let us call these planning values ‘guesstimates’, since they reflect quantifications of the surveyor’s belief or guess about the structure in the finite population. In this summary and in papers II–IV, guesstimates that are used for calculations will be marked by the tilde symbol. For example, a guesstimate of the variance structure will be denoted by ? 2 and a guesstiÞ ß
mate of the ? parameter discussed earlier will be denoted by ? . In the first paper a somewhat different notation is used. 24
1.5 The multivariate sampling problem
“It is often better to have reasonably good solutions of the proper problems than optimum solutions of the wrong problems.” Herbert Robbins
Most sample surveys are multivariate. When a multivariate survey is planned, the design choice is – since it affects all estimates – relatively more important than the choice of estimators. Although a good estimator sometimes can compensate for a poorly chosen design and give an acceptable precision, it is also likely that precision is lost compared to what is achievable with a better design. The theoretical results described in section 1.4 and in the survey literature are often not sufficient guides for finding a good sampling design for the multivariate situation. In most presentations, the concept of ‘optimum’ is tied to a single study variable. Examples are to be found in the theory above (choice of unequal inclusion probabilities), in the theory of optimum stratification and optimum allocation in stratified sampling and in the theory of multi-stage sampling, where the number of units in each stage can be determined optimally. A simple reason why few papers deal with the multivariate survey problem might be that there is no evident criterion of optimality. For example, a design that is optimal or close to optimal in the single parameter case, may not be the best choice in an overall multiparameter sense. Obviously, when we choose our design we must find a compromise based on some kind of multivariate consideration.
25
It seems as if there are three often used ways to approach the multivariate problem in practice: (i) Ignore it and reduce it to a single variable problem through a snap judgment based on ‘experience’. (ii) For each of the important parameters, study the effect of various single-variable dependent actions, and then select a design that seems to be the best compromise. (iii) Find some kind of overall multivariate criterion, perhaps mechanically by generalizing single variable concepts, and choose the action that optimises this criterion. The first choice may not be as bad as it may sound. In many situations, a few important study variables may have similar properties, and the total loss in precision following from using one of them might be acceptable. However, at the planning stage, time should be spent carefully studying the overall effects of choosing such a simplistic approach. Otherwise, it might turn out to be a bad practice. The second approach is better since it means that a design choice is made on the basis of considerations of the multivariate situation. An example that fits into this approach is described by Kott and Bailey (2000) called PD[LPDO %UHZHU VHOHFWLRQ. It is not an optimisation-based method, but it is simple to implement and it guarantees the desired precision of all estimators considered in the planning. Various methods of averaging key measures may also be placed in this category of approaches. One is to select a stratified sample with an allocation based on the average of various single-variable optimum allocations, another, in the case of ?SV (? ) sampling, is to choose the size
à
measure as the average or median of 4 different single variable ? ’s.
à
For multivariate stratified sampling a number of authors have considered the third (optimisation-based) approach category, e.g. Dalenius (1957), Chat26
terjee (1968, 1972), Huddleston, Claypool and Hocking (1970), Danielsson (1975), Hughes and Rao (1979), Bethel (1989). In this thesis, the multivariate problem is also treated using an optimisation-based approach. The results on univariate optimal designs from section 1.4 are extended to the multivariate case. The advantage of using this kind of approach is that, at least theoretically, optimal solutions are obtained. Moreover, as we choose the optimisation criterion, we also attain certain control of the desired design properties, as opposed to relying on intuitive ad hoc solutions. However, as for all other approaches the success will ultimately depend on the chosen criterion and the underlying assumptions made by the survey designer. A possible drawback with an optimisation-based approach is cumbersome computations. However, in the present case the methods proposed for the multivariate situation are only slightly more complicated than single-variable methods. In addition, the methods suggested in papers III–IV are to a large degree based on computations that should be executed anyway in any serious planning effort.
27
2 Summary of the papers
In this section, the aims and the main conclusions of the four papers are summarized in chronological order.
2.1 Reflections on unequal probability sampling strategies
*
In section 1.4.4, it was noted that an optimal design in the sense of minimiz-
ˆ ) (an approximation to the anticipated variance of a GREG ing $19 (W
â ã ä á
estimator) is such that the first-order inclusion probabilities are chosen proportional to ? , i.e. a ?SV (? ) design. If a ?SV (? ) design is a preferred
å æ å å
design, the designs (sampling schemes) studied in paper I can be applied in an attempt to implement it. However, none of the fixed size designs will yield an exact ?SV (? ) design, and therefore it is interesting to compare
å
their properties. Another aim is to study how robust the sampling designs are against misjudgements of ?
å æ
made at the planning stage.
Paper I is a reprint of Holmberg and Swensson (2001), which (when originally published) mistakenly was entitled ‘On Pareto ?ps sampling: Reflections on unequal probability sampling’.
*
28
The paper compares the properties of ?SV sampling designs such as Poisson ?SV and Pareto ?SV, with Poisson mixture (3R0L[) sampling and modelbased simple random stratified sampling (PE676,). Combined with a GREG estimator, these sampling designs are serious alternatives for survey designers who wish to use auxiliary information in the sampling design as well as at the estimation stage. We compare the design properties by defining a penalty measure as the
ˆ ) of a specific design and the minimum anticiratio between the $19 (W
â ã ä á
pated variance $19
è
min
ˆ ) , i.e., (W
é ê ç
ˆ ) (? ) $19 (W $93(? , ? ) = . ˆ ) $19 min (W
ì í î ï ë ì í î ë
(9)
By using (9), we can study the increase in approximate anticipated variance obtained when a non-optimal ?SV ( ] ) design is used. ? is a guesstimate of the degree of heteroscedasticity in the supposed variance structures, and
? is used to differentiate between the studied 3R0L[ designs. ? is a (transformation) parameter called the Bernoulli width. (If ? = 0 , 3R0L[ is equal to the original design induced by the sampling scheme, e.g. Pareto or Poisson sampling.) For PE676, a corresponding penalty measure without ? is used. In tables 1-5 of the paper, we see the effects of different assumed variance structures in terms of the penalty measure. We note the following: (i) Compared to the ?SV designs, PE676, seems to be less sensitive to a large value of ? ? ? (the absolute error of the heteroscedastic pattern guesstimate). (ii) The variance increase from using 3R0L[ sampling is larger when ? < ? 29
than when ? > ? . In fact, it is only when ? > ? that a value other than
? = 0 is of assistance, i.e. 3R0L[ is only of potential interest if the survey
designer in the planning process happens to overestimate ? . (iii) However, when ? > ? there exists an optimal ? which makes the variance penalty of a 3R0L[ design very small. Moreover, when ? > ? , 3R0L[ sampling with
? = 0.3 I (where I = Q / 1 , i.e. the sampling fraction) is less sensitive to
deviations of ? from ? than direct ?SV ( ] ? / 2 ) sampling. Finally, (iv) since
ñ
different study variables may have different degrees of heteroscedasticity, planning for good variance properties of one estimator may be bought at a high price for estimators of other study variable totals.
2.2 On the choice of strategy in unequal probability sampling
The second paper is a follow-up study of the first paper. Again, Pareto ?SV, 3R0L[ and PE676, are compared. However, in this paper different strategy choices in a bivariate situation are looked at. There are two key parameters and strong auxiliary information at hand for the design as well as for estimation purposes. Besides a discussion of Pareto ?SV, 3R0L[ and PE676,, there is an examination of some other questions. When we have strong auxiliary information, will there be any significant variance decrease from using a ?SV (? )
ò
design and a GREG estimator instead of (i) using auxiliary information for estimation purposes only, (ii) using a classical (non-optimal) ?SV design 30
combined with the ? estimator, as discussed in connection with equation (7) in section 1.4.3? Moreover, is it possible to more or less fully compensate a non-optimal design choice by a suitable estimator choice? As a final point, which of the studied strategies is the best in an overall perspective? As in paper I, the comparisons indicate that we cannot rule out the use of PE676,. By means of results from a Monte Carlo study, we note that among the studied GREG estimators it does not seem to matter much which estimator we choose. Using auxiliary information in the estimator only is usually better than using auxiliary information in the design only. However, using auxiliary information in both is the best strategy, yielding the smallest variances. 3R0L[ works better than Pareto ?SV when the heteroscedastic pattern is stronger. Nevertheless, for most strategies both these sampling designs yield somewhat higher variances than PE676,. In paper II as well as in the first paper, we see that we obtain small variances by using the auxiliary information and choosing a ?SV (? ) design.
ò
However, if we specify the variance structure badly and, as a result, choose a non-optimal design, the variance will increase and a good estimator choice is not enough to fully compensate. Another concern appears when we have several study variables with different variances structures. Then, a ?SV (? )
ò
design based on a single study variable, \ , might work well for that varió
able, but for other variables the variances may become larger than acceptable. In the third and fourth papers there is a discussion of designs adapted to the multivariate situation.
31
2.3 A multiparameter perspective on the choice of sampling design in surveys
When we choose a strategy in the multivariate situation, we can always choose different estimators for different study variable totals, W
õ ô
(T = 1, ! , 4 ) . However, we can only choose one design and that choice will
affect all the parameter estimates of the survey. One goal might be to try to find a design such that the estimator variances deviate as little as possible from the estimator variances resulting from single-variable optimal designs. A good design in a multivariate situation should not yield very good precision for some estimators at the expense of unacceptably high variances for others. In other words, we should search for a ‘best’ compromise. Three approaches based on different criteria are presented in order to achieve a compromise design for a multivariate situation. In principle, all methods presented are generalizations of the single variable solution outlined in section 1.4.4. Hence, they are based on approximations of the anticipated variance of the GREG estimator and they are optimal under the given criterion. For each study variable, auxiliary information is used to specify a linear model with a specific variance structure, as in equation (1). Thereafter, we
ˆ ) (T = 1,! , 4) , and present derive the anticipated variances, $19 (W
÷ ø ù ö
ˆ ) , to attain a compromise three approaches based on functions of $19 (W
÷ ø ù ö
design.
32
The approach that seems most promising is a design that minimizes a loss function, the anticipated overall relative efficiency loss ($125(/), defined as
ˆ ) ? $19 min (W ˆ ) $19 (W $125(/ = ? + ˆ ) $19 min (W =1
û ü ý ÿ ü ý ÿ ú ú ü ü ü ý ÿ ú
(10)
where +
(T = 1,! , 4) are weights (summing up to unity) that reflect the
relative importance of the parameters to be estimated. The measure in equation (10) is similar to the measure discussed by Dalenius (1957, chapter 9) to find a compromise allocation for stratified sampling in the multiparameter case. Furthermore, we show that equation (10) is minimized if the design is such that the first-order inclusion probabilities are given by
¡
Q
?+
¢ ¢
? =
£
=1
¡
? (?
¢ ¢
?2
¢ £
?1 (
¥ §
¨
)
£
? 1)? 2
¢ £ ¢ £
? ?+
¢
=1
? (?
¢
?2
?1 (
¥ § ¨
.
¢ £
(11)
)
£
? 1)? 2
This expression is not as cumbersome as it may appear. It involves the single-variable optimal inclusion probabilities ?
(
)
and the variance struc-
ture ? 2 . In a carefully planned multiparameter survey, guesstimates, i.e.
?
(
)
and ? 2 , of both these quantities should be computed in any case.
We study the suggested compromise approaches for a practical case. Subsequently, auxiliary information is used to make computations and illustrate suitable planning phase analysis. A diagnostic table is used for comparisons between the designs at our disposal and for a final choice of design. 33
2.4 On the choice of optimal sampling design in business surveys with several important study variables
A general aim of a multiparameter survey is to find a design that meets the desired levels of precision for all important target parameters simultaneously. If we choose our design according to one of the criteria presented in the third paper, we will achieve a minimization of a summary measure without taking into account that precision requirements may be different for different estimates. In paper IV, the approaches of paper III are generalized (and slightly extended) by applying non-linear programming algorithms. Non-linear programming methods have previously been used in the case of similar problems related to optimal allocation in stratified sampling. Cochran (1977, section 5A4) provides a list of early references, Chromy (1987) and Bethel (1989) are references of later date. For Poisson ?SV sampling and the ? estimator, Sigman and Monsour (1995) sketched a procedure using non-linear programming similar to that of paper IV. Saavedra (1999) applied these ideas using the algorithm proposed by Chromy to determine probabilities to be used for Pareto ?SV sampling in a price and volume petroleum product survey. The general problem studied in paper IV can be described as follows. A sampling design is to be determined to minimize a function I for all 4 estimator variances and fulfil specified restrictions, Y , made on function J of
each estimator variance (or variance approximation), i.e. minimize
ˆ I J 9 W 1
((
( )), J (9 (Wˆ )),! , J (9 (Wˆ ))) with restrictions
2
34
ˆ J 9 W
( ( )) ? Y
(T = 1,! , 4 ) .
(12)
For the general problem above, the paper suggests solutions for the cases when I is either the weighted arithmetic or geometric mean and J is one of the functions given by the three approaches in paper III, i.e. (i) a variance, (ii) a relative variance, or (iii) the (recommended) relative efficiency loss. Moreover, since the methods are planning stage means, the unknown estima-
ˆ ) tor variances in (12) are replaced by $19 (W
(T = 1,!, 4 ) .
The theoretical background of the applied non-linear programming model is given as well as an application using a Swedish business population. The studies indicate that the methods work both theoretically and practically, hence giving us the best compromise design in a multivariate situation, at least for not too large populations. The two foremost points of the paper are to suggest a flexible solution regarding how to use auxiliary information exhaustively in the design planning, and to provide diagnostic support for the final design choice.
2.5 Minor comments and concluding remarks
Finally, some minor but interesting observations made during the work on this thesis are offered. The practical applicability of the studied methods has been an underlying guide in the work on these four papers. The results and methods of the last two papers are general in the sense that they allow a flexible use of the auxiliary information, estimators and sampling schemes. On the issue of finding a
35
‘best’ strategy for multiparameter surveys, the design should be in focus, primarily because the overall loss in precision due to a poorly chosen design seems to be relatively larger than the loss obtained by using a less efficient member (-s) of the GREG estimator family. We can implement the presented compromise designs by taking advantage of the recent developments in ?SV sampling, which provide us with practically feasible sampling schemes, for instance the Pareto ?SV sampling scheme. The first and the second paper give us experience of advantages and drawbacks of such schemes. As far as the estimators are concerned, the results have been presented on the basis of the class of GREG estimators. Different GREG estimators are explicitly compared in paper II, but in the other papers the specific expression of the GREG estimator is of secondary importance, and in the third and fourth papers it has been tacitly understood that the models are well specified and that the survey designer can make a good estimator choice. In our setting, i.e. with well-chosen models and reasonably large, direct element samples, the designs presented will provide efficient estimates combined with a GREG estimator. It is not likely that there is much to gain in variance reduction by using estimators outside the GREG family, for instance calibration estimators (Deville and Särndal (1992), Lundström and Särndal (1999) and Estevao and Särndal (2000)) or the (less practical) optimal regression estimator (see Rao (1992, 1994, 1997) and Montanari (1987, 1998)).
ˆ ) are conceptually Although the anticipated variance and the $19 (W
ˆ ) in equation different from the design variance of an estimator (e.g. 9 (W
! " $
(4) in section 1.4.2), it is of interest to note the similarities between table 2 and table 3 given in paper III (as well as similarities between the tables in 36
paper IV). They suggest that planning stage computations such as those illustrated could be a helpful tool to anticipate how different design choices affect estimator variances.
37
References
Bethel, J. (1989). Sample Allocation in Multivariate Surveys, 6XUYH\ 0HWK RGRORJ\ , 47–57 Brewer, K.R.W. (1963). Ratio Estimation and Finite population: Some results deductible from the assumption of an underlying stochastic process. $XVWUDOLDQ -RXUQDO RI 6WDWLVWLFV , 93–105. Brewer, K.R.W. (2002). &RPELQHG 6XUYH\ 6DPSOLQJ ,QIHUHQFH :HLJKWLQJ %DVX¶V (OHSKDQWV Arnold, London. Brewer, K.R.W. and Hanif, F. (1983). 6DPSOLQJ ZLWK XQHTXDO SUREDELOLWLHV Springer-Verlag, New York. Cassel, C.M., Särndal, C-E. and Wretman, J. (1976). Some results on generalized difference estimators and generalized regression estimators for finite populations. %LRPHWULND , 615–620 Cassel, C.M., Särndal, C-E. and Wretman, J. (1977). )RXQGDWLRQV RI ,QIHU HQFH LQ 6XUYH\ 6DPSOLQJ. Wiley & Sons, New York Chatterjee, S. (1968). Multivariate stratified surveys. -RXUQDO RI WKH $PHUL FDQ 6WDWLVWLFDO $VVRFLDWLRQ , 530–534. Chatterjee, S. (1972). A study of optimum allocation in multivariate stratified surveys. 6NDQGLQDYLVN DNWXDULHWLGVNULIW , 73–80. Chromy, J. (1987). Design Optimization with Multiple Objectives. 3URFHHG LQJV RI WKH 6HFWLRQ RQ 6XUYH\ 5HVHDUFK 0HWKRGV $PHULFDQ 6WDWLVWLFDO $VVRFLDWLRQ 194–199. Cochran, W.G. (1942). Sampling theory when the sampling units are of unequal sizes. -RXUQDO RI WKH $PHULFDQ 6WDWLVWLFDO $VVRFLDWLRQ, , 199– 212. Cochran, W.G. (1977). 6DPSOLQJ 7HFKQLTXHV. Wiley, New York Dalenius, T. (1950). The problem of optimum stratification. 6NDQGLQDYLVN DNWXDULHWLGVNULIW , 203–213. Dalenius, T. (1957). 6DPSOLQJ LQ 6ZHGHQ. Almquist & Wiksell, Stockholm Dalenius, T. (1992). Introduction to Neyman (1934) On the two different aspects of the representative method: The method of stratification and the method of purposive selection. In Kotz, S. and Johnson, N. L., (eds.) %UHDNWKURXJKV LQ 6WDWLVWLFV YROXPH ,, 0HWKRGRORJ\ DQG 'LVWULEXWLRQ. Springer, New York. 114–122 Danielsson, S. (1975). Optimal allokering vid vissa klasser av urvalsförfaranden. Ph.D. Thesis, Department of Mathematics, University of Linköping, Sweden. Deville, J-C. and Särndal, C-E. (1992). Calibration Estimators in Survey Sampling, -RXUQDO RI WKH $PHULFDQ 6WDWLVWLFDO $VVRFLDWLRQ, , 376-382. 38
Estevao, V. and Särndal, C.E. (2000). A Functional Form Approach to Calibration. -RXUQDO RI 2IILFLDO 6WDWLVWLFV, , No. 4, 379–399. Estevao, V. and Särndal, C.E. (2002). The ten cases of auxiliary information for calibration in two-phase sampling. -RXUQDO RI 2IILFLDO 6WDWLVWLFV, , No. 2, 233–257. Godambe, V.P. (1955). A unified theory of sampling from finite populations. -RXUQDO RI WKH 5R\DO 6WDWLVWLFDO 6RFLHW\ 6HULHV % 269–278. Hajek, J., (1959). Optimum strategy and other problems in probability sampling. &DVRSLV SUR 3HVWRYDQL 0DWHPDWLN\ 387–421. Hansen, M.H., Madow, W.G. and Tepping, B.J. (1983). An Evaluation of model-dependent and probability-sampling inferences in sample surveys. -RXUQDO RI WKH $PHULFDQ 6WDWLVWLFDO $VVRFLDWLRQ , 776–793. Holmberg, A. and Swensson, B. (2001). On Pareto ?SV sampling: Reflections on unequal probability sampling. Theory of Stochastic Processes , no. 1–2, 142–155 Horvitz, D.G. and Thompson, D.J. (1952). A generalization of sampling without replacement from a finite universe. -RXUQDO RI WKH $PHULFDQ 6WD WLVWLFDO $VVRFLDWLRQ , 663–685. Huddleston, H.F., Claypool, P.L. and Hocking, R.R. (1970). Optimum allocation to strata using convex programming. $SSOLHG 6WDWLVWLFV , 273– 278. Hughes, E., and Rao, J.N.K. (1979) Some problems of optimal allocation in sample surveys involving inequality constraints. &RPPXQLFDWLRQV LQ 6WD WLVWLFV $ , 1551–1574. Isaki, C.T. (1970). 6XUYH\ GHVLJQV XWLOL]LQJ SULRU LQIRUPDWLRQ Doctoral thesis, Iowa State University, Ames, Iowa. Isaki, C.T. and Fuller, W.A. (1982). Survey design under the regression superpopulation model, -RXUQDO RI WKH $PHULFDQ 6WDWLVWLFDO $VVRFLDWLRQ, , 89–96 Kott, P.S. and Bailey, J.T. (2000). The Theory and Practice of Maximal Brewer Selection with Poisson PRN Sampling, 3URFHHGLQJV RI WKH VHF RQG ,QWHUQDWLRQDO &RQIHUHQFH RQ (VWDEOLVKPHQW 6XUYH\V, June 17–21, 2000, Buffalo 269–279. Kröger, H., Särndal, C-E. and Teikari, I. (1999). Poisson Mixture Sampling: A family of designs for Coordinated Selection Using Permanent Random Numbers, 6XUYH\ 0HWKRGRORJ\ , No 1, 3–11. Kröger, H., Särndal, C-E. and Teikari, I. (2000). Poisson Mixture Sampling Combined with Order Sampling: a Novel use of the Permanent Random Number Technique. Manuscript submitted for publication (date 00/08/30). (Forthcoming in Journal of Official Statistics with the title Poisson Mixture Sampling Combined with Order Sampling) Laaksonen, S. (2002). Need for high level auxiliary data service for improving the quality of editing and imputation. &RQWULEXWHG SDSHU RI WKH 81 (&( ZRUNLQJ VHVVLRQ LQ +HOVLQNL, 27-29 May. Lundström, S. and Särndal, C-E. (1999). Calibration as a standard method for treatment of nonresponse -RXUQDO RI 2IILFLDO 6WDWLVWLFV, , 305–327. Montanari, G.E. (1987). Post-sampling efficient QR-prediction in large-scale surveys. ,QWHUQDWLRQDO 6WDWLVWLFDO 5HYLHZ 191–202. Montanari, G.E. (1998). On Regression Estimation of Finite Population Means. 6XUYH\ 0HWKRGRORJ\ No 1, 69–77. 39
Neyman, J. (1934). On the two different aspects of the representative method: The method of stratified sampling and the method of of purposive selection. -RXUQDO RI WKH 5R\DO 6WDWLVWLFDO 6RFLHW\ 558–625. Ohlsson, E. (1990). Sequential Poisson Sampling from a Business Register and its Application to the Swedish Consumer Price Index. Statistics Sweden, R&D Report 1990:6. Ohlsson, E. (1998). Sequential Poisson Sampling. -RXUQDO RI 2IILFLDO 6WDWLV WLFV, , 135–158. Rao, J.N.K. (1992). Estimating Totals and Distribution Functions Using Auxiliary Information at the Estimation Stage. 3URFHHGLQJV RI :RUNVKRS RQ 8VHV RI $X[LOLDU\ ,QIRUPDWLRQ LQ 6XUYH\V, Statistics Sweden. Rao, J.N.K. (1994). Estimating Totals and Distribution Functions Using Auxiliary Information at the Estimation Stage. -RXUQDO RI 2IILFLDO 6WDWLV WLFV, , 153–165. Rao, J.N.K. (1997). Developments in sample survey theory: an appraisal. &DQDGLDQ -RXUQDO RI 6WDWLVWLFV, , 1–21. Rosén, B. (1997a). Asymptotic Theory for Order Sampling, -RXUQDO RI 6WD WLVWLFDO 3ODQQLQJ DQG ,QIHUHQFH, , 135–158. Rosén, B. (1997b). On sampling with Probability Proportional to Size, -RXU QDO RI 6WDWLVWLFDO 3ODQQLQJ DQG ,QIHUHQFH, , 159–191. Rosén, B. (2000). Generalized Regression Estimation and Pareto ? SV R & D Report 2000:5 Statistics Sweden. Saavedra, P.J. (1995). Fixed Sample Size PPS Approximations with a Permanent Random Number. 3URFHHGLQJV RI WKH VHFWLRQ RQ 6XUYH\ UHVHDUFK 0HWKRGV -RLQW 6WDWLVWLFDO 0HHWLQJV American Statistical Association 697–700 Saavedra, P.J. (1999). Application of the Chromy Algorithm with Pareto Sampling, 3URFHHGLQJV RI WKH 6HFWLRQ RQ 6XUYH\ 5HVHDUFK 0HWKRGV -RLQW 6WDWLVWLFDO Meetings, American Statistical Association 1999 355–358. Särndal, C-E., Swensson, B. and Wretman, J. (1992). 0RGHO $VVLVWHG 6XUYH\ 6DPSOLQJ. Springer, New York. Sigman, R.S. and Monsour, N.J. (1995). Selecting Samples from List Frames of Businesses, in Cox, B.G., Binder, D.A., Chinnappa, N., Christianson, A., Colledge, M.J., and Kott, P.S. (eds) %XVLQHVV 6XUYH\ 0HWKRGV, New York: Wiley, 153–169. Wright, R.L. (1983). Finite Population Sampling with Multivariate Auxiliary Information, -RXUQDO RI WKH $PHULFDQ 6WDWLVWLFDO $VVRFLDWLRQ, , 879– 884.
40
doc_582000108.pdf