This document is part of the Demographic and Health Survey’s DHS Toolkit of methodology
for the MEASURE DHS Phase III project, implemented from 2008-2013.
This publication was produced for review by the United States Agency for International
Development (USAID). It was prepared by MEASURE DHS/ICF International.
SAMPLING AND HOUSEHOLD
LISTING MANUAL
Demographic and Health
Surveys Methodology
[THIS PAGE IS INTENTIONALLY BLANK]
Demographic and Health Survey
Sampling and Household Listing Manual
ICF International
Calverton, Maryland USA
September 2012
MEASURE DHS is a five-year project to assist institutions in collecting and analyzing data needed to
plan, monitor, and evaluate population, health, and nutrition programs. MEASURE DHS is funded by the
U.S. Agency for International Development (USAID). The project is implemented by ICF International in
Calverton, Maryland, in partnership with the Johns Hopkins Bloomberg School of Public Health/Center
for Communication Programs, the Program for Appropriate Technology in Health (PATH), Futures
Institute, Camris International, and Blue Raster.
The main objectives of the MEASURE DHS program are to: 1) provide improved information through
appropriate data collection, analysis, and evaluation; 2) improve coordination and partnerships in data
collection at the international and country levels; 3) increase host-country institutionalization of data
collection capacity; 4) improve data collection and analysis tools and methodologies; and 5) improve the
dissemination and utilization of data.
For information about the Demographic and Health Surveys (DHS) program, write to DHS, ICF
International, 11785 Beltsville Drive, Suite 300, Calverton, MD 20705, U.S.A. (Telephone: 301-572-
0200; fax: 301-572-0999; e-mail: info@measuredhs.com; Internet: http://www.measuredhs.com).
Recommended citation:
ICF International. 2012. Demographic and Health Survey Sampling and Household Listing Manual.
MEASURE DHS, Calverton, Maryland, U.S.A.: ICF International
iii
TABLE OF CONTENTS
TABLES AND FIGURES ......................................................................................................... vii
1 DEMOGRAPHIC AND HEALTH SURVEYS SAMPLING POLICY ........................................ 1
1.1 General principles ............................................................................................ 1
1.1.1 Existing sampling frame ......................................................................... 1
1.1.2 Full coverage ........................................................................................ 1
1.1.3 Probability sampling .............................................................................. 2
1.1.4 Suitable sample size .............................................................................. 2
1.1.5 Simple design ....................................................................................... 2
1.1.6 Household listing and pre-selection of households ...................................... 2
1.1.7 Good sample documentation ................................................................... 2
1.1.8 Confidentiality ...................................................................................... 3
1.1.9 Exactness of survey implementation ........................................................ 3
1.2 Survey objectives and target population .............................................................. 3
1.3 Survey domain ................................................................................................ 4
1.4 Sampling frame ............................................................................................... 4
1.4.1 Conventional sampling frame .................................................................. 5
1.4.2 Alternative sampling frames ................................................................... 5
1.4.3 Evaluation of the sampling frame ............................................................ 6
1.5 Stratification ................................................................................................... 6
1.6 Sample size .................................................................................................... 7
1.6.1 Sample size and sampling errors ............................................................. 7
1.6.2 Sample size determination .................................................................... 10
1.7 Sample allocation ............................................................................................ 12
1.8 Two-stage cluster sampling procedure ............................................................... 15
1.9 Sample “take” per cluster ................................................................................. 16
1.9.1 Optimum sample take ........................................................................... 16
1.9.2 Variable sample take for self-weighting ................................................... 17
1.10 Household listing ............................................................................................ 19
1.11 Household selection in the central office ............................................................. 20
1.12 Household interviews ....................................................................................... 21
1.13 Sampling weight calculation .............................................................................. 22
1.13.1 Why we need to weight the survey data .................................................. 22
1.13.2 Design weights and sampling weights ..................................................... 22
1.13.3 How to calculate the design weights ........................................................ 23
iv
1.13.4 Correction of unit non-response and calculation of sampling weights............ 24
1.13.5 Normalization of sampling weights .......................................................... 26
1.13.6 Standard weights for HIV testing ............................................................ 27
1.13.7 De-normalization of standard weights for pooled data ................................ 28
1.14 Calibration of sampling weights in case of bias .................................................... 29
1.15 Data quality and sampling error reporting ........................................................... 30
1.16 Sample documentation .................................................................................... 31
1.17 Confidentiality ................................................................................................ 31
2 HOUSEHOLD LISTING OPERATION ........................................................................... 32
2.1 Introduction ................................................................................................... 32
2.2 Definition of terms .......................................................................................... 32
2.3 Responsibilities of the listing staff ...................................................................... 33
2.4 Locating the cluster ......................................................................................... 34
2.5 Preparing location and sketch maps ................................................................... 35
2.6 Collecting a GPS waypoint for each cluster .......................................................... 36
2.7 Listing of households ....................................................................................... 37
2.8 Segmentation of large clusters .......................................................................... 38
2.9 Quality control ................................................................................................ 39
2.10 Prepare the household listing forms for household selection .................................. 39
Appendix 2.1 Example listing forms ............................................................................. 41
Appendix 2.2 Symbols for mapping and listing .............................................................. 46
Appendix 2.3 Examples of completed mapping and listing forms ...................................... 48
3 SELECTED SAMPLING TECHNIQUES ......................................................................... 52
3.1 Simple random sampling .................................................................................. 52
3.2 Equal probability systematic sampling ................................................................ 53
3.2.1 Sampling theory .................................................................................. 53
3.2.2 Excel templates for systematic sampling .................................................. 55
3.3 Probability proportional to size sampling ............................................................. 64
3.3.1 Sampling theory .................................................................................. 64
3
.3.2 Operational description and examples ..................................................... 65
3.4 Complex sampling procedures ........................................................................... 70
4 SURVEY ERRORS ...................................................................................................... 73
4.1 Errors of coverage and non-response ................................................................. 73
4.1.1 Coverage errors ................................................................................... 73
4.1.2 Deliberate restrictions of coverage .......................................................... 74
4.1.3 Non-response ...................................................................................... 74
v
4.1.4 Response rates .................................................................................... 76
4.2 Sampling errors .............................................................................................. 78
5 SAMPLE DOCUMENTATION ....................................................................................... 80
5.1 Introduction ................................................................................................... 80
5.2 Sample design document ................................................................................. 80
5.2.1 Introduction ........................................................................................ 80
5.2.2 Sampling frame ................................................................................... 81
5.2.3 Structure of the sample and the sampling procedure ................................. 82
5.2.4 Selection probability and sampling weight ................................................ 84
5.3 Sample file ..................................................................................................... 85
5.4 Results of Survey implementation ..................................................................... 88
5.5 Sampling errors .............................................................................................. 90
5.6 Sampling parameters in DHS data files ............................................................... 91
Glossary of terms ................................................................................................................ 93
References .......................................................................................................................... 97
vii
TABLES AND FIGURES
Table 1.1 Sample size determination for estimating current use of a modern contraceptive
method among currently married women ................................................................. 10
Table 1.2 Sample size determination for estimating the prevalence of full vaccination coverage
among children aged 12-23 months ........................................................................ 11
Table 1.3 Sample allocation: Proportional allocation ................................................................ 14
Table 1.4 Sample allocation: Power allocation ......................................................................... 14
Table 1.5 Optimal sample take for currently married women 15-49 currently using any
contraceptive method based on intracluster correlation
ρ
and survey cost ratio
21
/ cc
from past surveys ................................................................................................. 17
Table 5.1 Distribution of EAs and average size of EA by region and by type of residence ............... 82
Table 5.2 Distribution of households by region and by type of residence ..................................... 82
Table 5.3 Sample allocation of clusters and households by region and by type of residence ........... 84
Table 5.4 Expected number of interviews by region and by type of residence .............................. 84
Table 5.5 An example sample file ......................................................................................... 87
Table 5.6 Example table for the results of survey implementation .............................................. 88
Table 5.7 Example appendix table for the results of the women’s survey implementation .............. 89
Table 5.8 Example appendix table for the results of the men’s survey implementation .................. 90
Table 5.9 Example table for sampling errors ........................................................................... 91
Figure 3.1 Simple household selection with a sub-sample .......................................................... 57
Figure 3.2 Selection of runs with a sub-sample ........................................................................ 58
Figure 3.3 Simple self-weighting selection without sample size control ........................................ 59
Figure 3.4 Self-weighting selection with runs and without sample size control .............................. 60
Figure 3.5 Self-weighting selection with sample size control ....................................................... 61
Figure 3.6 Self-weighting selection with runs and with sample size control ................................... 62
Figure 3.7 Manual household selection in the field .................................................................... 63
Figure 3.8 Part of an Excel template for stratified sampling ........................................................ 68
Figure 3.9 Part of an example for a province crossed urban-rural stratified PPS sampling ............... 69
Figure 3.10 Part of an example sample file from a stratified PPS sampling ..................................... 70
1
1 DEMOGRAPHIC AND HEALTH SURVEYS SAMPLING POLICY
1.1 General principles
Scientific sample surveys are cost-efficient and reliable ways to collect population-level
information such as social, demographic and health data. The MEASURE DHS project is a worldwide
project implemented across various countries and at multiple points in time within a country. In order
to achieve comparability, consistency and the best quality in survey results, sampling activities in
the Demographic and Health Surveys (DHS) should be guided by a number of general principles. This
manual presents general guidelines on sampling for DHS surveys, although modifications may be
required for country-specific situations. The key principles of DHS sampling include:
Use of an existing sampling frame
Full coverage of the target population
Probability sampling
Using a suitable sample size
Using the most simple design possible
Conducting a household listing and pre-selection of households
Providing good sample documentation
Maintaining confidentiality of individual’s information
Implementing the sample exactly as designed
1.1.1 Existing sampling frame
A probability sample can only be drawn from an existing sampling frame which is a complete
list of statistical units covering the target population. Since the construction of a new sampling frame
is likely to be too expensive, DHS surveys should use an adequate pre-existing sampling frame which
is officially recognized. This is possible for most of the countries where there has been a population
census in recent years. Census frames are generally the best available sampling frame in terms of
coverage, cartographic materials and organization. However, an evaluation of the quality and the
accessibility of the frame should be considered during the development of the survey design, and a
detailed study of the sampling frame is necessary before drawing the sample. In the absence of a
census frame, a DHS survey can use an alternative sampling frame, such as a complete list of villages
or communities in the country with all necessary identification information including a measure of
population size (e.g. number of households), or a master sample which is large enough to support the
DHS design.
1.1.2 Full coverage
A DHS survey should cover 100 percent of the target population in the country. The target
population for the DHS survey is all women age 15-49 and children under five years of age living in
residential households. Most surveys also include all men age 15-59
1
. The target population may vary
from country to country or from survey to survey, but the general sampling principles are the same.
In some cases, exclusion of some areas may be necessary because of extreme inaccessibility, violence
or instability, but these issues need to be considered at the very beginning of the survey, before the
sample is drawn.
1
The age range varies from survey to survey and may be 15-49, 15-54, 15-59 or 15-64.
2
1.1.3 Probability sampling
A scientific probability sampling methodology must be used in DHS surveys. A probability
sample is defined as one in which the units are selected randomly with known and nonzero
probabilities. This is the only way to obtain unbiased estimation and to be able to evaluate the
sampling errors. The term probability sampling excludes purposive sampling, quota sampling, and
other uncontrolled non-probability methods because they cannot provide evaluation of precision
and/or confidence of survey findings.
1.1.4 Suitable sample size
Sample size is a key parameter for DHS surveys because it is directly related to survey
budget, data quality and survey precision. Theoretically, the larger the sample size, the better the
survey precision, but this is not always true in practice. Survey budget is not the only important factor
in determining the sample size. Desired precision, the number of domains, capability of the
implementing organization, data quality concerns and cost effectiveness are essential constraints in
determining the total sample size. Thus a suitable sample size is also a key parameter to guarantee
data quality.
1.1.5 Simple design
In large-scale surveys, non-sampling errors (coverage errors, errors committed in survey
implementation and data processing, etc.) are usually the most important sources of error and are
expensive to control and difficult to evaluate quantitatively. It is therefore important to minimize them
in survey implementation. In order to facilitate accurate implementation of the survey, the sampling
design for DHS should be as simple and straightforward as possible. Macro’s experience from 25 years
of DHS surveys shows that a two-stage household-based sample design is relatively easy to
implement and that quality can be maintained.
1.1.6 Household listing and pre-selection of households
The DHS standard procedure recommends that households be pre-selected in the central office
prior to the start of fieldwork rather than by teams in the field who may have pressures to bias the
selection. The interviewers are asked to interview only the pre-selected households. In order to
prevent bias, no changes or replacements are allowed in the field. To perform pre-selection of
households, a complete list of all residential households in each of the selected sample clusters is
necessary. This list is usually obtained from a household listing operation conducted before the main
survey.
In some surveys, the household listing operation may be combined with the main survey to
form a single field operation, and households can be selected in the field from a complete listing.
Combining the household listing and survey data collection in one field operation is less expensive;
however, it provides incentive to leave households off the household list to reduce workload, thus
reducing the representativeness of the survey results. Close supervision is needed during the field
work to prevent this problem. Separate listing and data collection operations are thus required for this
reason. Interviewers selecting households in the field without a complete listing is not acceptable for
DHS surveys.
1.1.7 Good sample documentation
DHS surveys are usually year-long projects conducted by different people specialized in
different aspects of survey implementation, so good sample documentation is necessary to guarantee
the exact implementation of the project. The sample documentation should include a sample design
3
document and the list of primary sampling units. The sample design document should explain in detail
the methodology, the sampling procedure, the sample size, the sample allocation, the survey domains
and the stratification. This should also form the basis for an appendix to the DHS final report
describing the sample design. The sample list should include all identification information for all of the
selected sample points, along with their probability of selection.
1.1.8 Confidentiality
Confidentiality is a major concern in DHS, especially when human bio-markers are collected
such as blood samples for HIV testing. The DHS surveys are anonymous surveys which do not allow
any potential identification of any single household or individual in the data file. Confidentiality is also
a key factor affecting the response rate to sensitive questions regarding sexual activity and partners.
In particular, in surveys that include HIV testing DHS policy requires that PSU and household
codes are scrambled in the final data to further anonymize the data and the original sample list is
destroyed.
1.1.9 Exactness of survey implementation
Exactness of sample implementation is the last element in achieving good sampling precision.
No matter how carefully a survey is designed and how complete the materials for conducting sampling
activities are, if the implementation of the sampling activities by sampling staff (office staff
responsible for selecting sample units, field workers responsible for the mapping and household listing
and interviewers responsible for data collection) is not preformed exactly as designed, serious bias
and misleading results may occur.
In the sections that follow, DHS policies related to sample design and implementation are
described.
1.2 Survey objectives and target population
The main objective of DHS surveys is to collect up-to-date information on basic demographic
and health indicators, including housing characteristics, fertility, childhood mortality, contraceptive
knowledge and use, maternal and child health, nutritional status of mothers and children, knowledge,
attitudes and behavior toward HIV/AIDS and other sexually transmitted infections (STI), women’s
status. The target population for DHS is defined as all women of reproductive age (15-49 years old)
and their young children under five years of age living in ordinary residential households. However, in
some countries, the coverage may be restricted to ever-married women.
The main indicator topics include:
Total fertility and age specific fertility rates
Age at first sex, first birth, and first marriage
Knowledge and use of contraception
Unmet need for family planning
Birth spacing
Antenatal care
Place of delivery
Assistance from skilled personnel during delivery
Knowledge of HIV/AIDS and other STIs
Higher-risk sexual behavior
Condom use
Childhood vaccination coverage
4
Treatment of diarrhea, fever, and cough
Infant and under-five mortality rates
Nutritional status
Since the target population can be easily found in residential households, DHS is a household-
based survey.
1.3 Survey domain
In DHS surveys, an important objective is to compare the survey results for different
characteristics such as urban and rural residence, different administrative or geographic regions, or
different educational levels of respondents. A survey domain or study domain is a sub-population for
which separate estimation of the main indicators is required. There are two kinds of survey domains:
design domains and analysis domains. A design domain consists of a sub-population which can be
identified in the sampling frame and therefore can be handled independently in the sample size and
sampling procedures, usually consisting of geographic areas or administrative units. For example,
urban and rural differences are very frequently requested; therefore, urban and rural areas are usually
separate design domains for Demographic and Health Surveys. An analysis domain is a sub-population
which cannot be identified in the sampling frame, such as domains specified by individual
characteristics. These may include women with secondary or higher education, pregnant women,
children 12-23 months, and children having diarrhea in the two weeks preceding the survey.
In order for survey estimates to be reliable at the domain level, it is necessary to ensure that
the number of cases in each survey domain is sufficient, especially when desired levels of precision
are required for particular domains. For a design domain, adequate sample size is achieved by
allocating the target population at the survey design stage into the requested design domains, and
then calculating the sample size for the specific design domains by taking the precision required into
account. On the other hand, for an analysis domain, it is difficult to guarantee a specified precision
because it is difficult to control the sample size at the design stage. However, if prior estimates of the
average number of target individuals per household are available, then it is possible to control the
precision for an analysis domain. For example, if survey estimates are required for the nutritional
status of children under age 5 is required and estimates of the number of children under age 5 per
household are available, it is then possible to calculate a sample size to give a certain level of
precision.
DHS reports also produce some indicators for second level domains such as vaccination
coverage of children age 12-23 months within a region, where region is the first level domain, and
children 12-23 months is the second level domain. Caution must be paid to the precision required for
a second level domain because the second level domain usually includes a very small sub-population.
If domain-level estimates are required, it is better to avoid a large number of domains
because otherwise a very large sample size will be needed. The number of domains and the desired
level of precision for each must be taken into account in the budget calculation and assessment of the
implementation capabilities of the implementing organization. The total sample size needed is the sum
of sample sizes needed in all exclusive (first level) domains.
1.4 Sampling frame
A sampling frame is a complete list of all sampling units that entirely covers the target
population. The existence of a sampling frame allows a probability selection of sampling units. For a
multi-stage survey, a sampling frame should exist for each stage of selection. The sampling unit for
the first stage of selection is called the Primary Sampling Unit (PSU); the sampling unit for the second
stage of selection is called the Secondary Sampling Unit (SSU), and so on. In most cases, DHS
5
surveys are two-stage surveys. Note that each stage of sample selection will involve sampling errors,
so it is better to avoid more than two stages if additional stages of selection are not necessary.
The availability of a suitable sampling frame is a major determinant of the feasibility of
conducting a DHS survey. This issue should be addressed in the earliest stages of planning for a
survey. A sampling frame for a DHS survey could be an existing sampling frame, an existing master
sample, or a sample of a previously executed survey of sufficiently large sample size, which allows for
the selection of subsamples of desired size for the DHS survey.
1.4.1 Conventional sampling frame
The best frame is the list of Enumeration Areas (EAs) from a recently completed population
census. An EA is usually a geographic area which groups a number of households together for
convenient counting purposes for the census. A complete list of EAs which covers the survey area
entirely is the most ideal frame for DHS surveys.
In most cases, a list of EAs from a recent census is available. This list should be thoroughly
evaluated before it is used. The sampling frame used for DHS should be as up-to-date as possible. It
should cover the whole survey area, without omission or overlap. Basic cartographic materials should
exist for each area unit or at least for groups of units with clearly defined boundaries. Each area unit
should have a unique identification code or a series of codes that, when combined, can serve as a
unique identification code. Each unit should have at least one measure of size estimate (population
and/or number of households). If other characteristics of the area units (e.g., socioeconomic level)
exist, they should be evaluated and retained as they may be used for stratification.
A pre-existing master sample (which is a random sample from the census frame) can be
accepted only where there is confidence in the master sample design, including detailed sampling
design parameters such as sampling method, stratification, and inclusion probability for the selected
primary sampling units. The task for the DHS survey is then to design a sub-sampling procedure,
which produces a sample in line with DHS requirements. This will not always be possible. However,
the larger the master sample is in relation to the desired DHS sub-sample, the more flexibility there
will be for developing a sub-sampling design. A key question with a pre-existing sample is whether the
listing of dwellings/households is still current or whether it needs to be updated. If updating is
required, use of a pre-existing sample may not be economical. The potential advantages of using a
pre-existing sample are: 1) economy, and 2) increased analytic power through comparative analysis
of two or more surveys. The disadvantages are: 1) the problem of adapting the sample to DHS
requirements, and 2) the problem of repeated interviews with the same household or person in
different surveys, resulting in respondent fatigue or contamination. One way to avoid this last problem
is to keep just the primary sampling units from the pre-existing sample and reselect the households
for the DHS survey.
1.4.2 Alternative sampling frames
When neither a census frame nor a master sample is available then alternative frames should
be considered. Examples of such frames are:
A list of electoral zones with estimated number of qualified voters for each zone
A gridded high resolution satellite map with estimated number of structures for each grid
A list of administrative units such as villages with estimated population for each unit
A main concern when using alternative frames are coverage problems, that is, does the frame
completely cover the target population? Usually checking the quality of an alternative frame is more
difficult because of a lack of information either from the frame itself or from administrative sources.
6
Another problem is the size of the primary sampling unit. Since the alternative frame is not specifically
created for a population census or household based survey, the size of the PSUs of such frames may
be too large or too small for a DHS survey. A third problem is identifying the boundaries of the
sampling units due to the lack of cartographic materials.
In the first two examples of alternative sampling frames, the standard DHS two-stage
sampling procedure can be applied by treating the electoral zones or the grids of satellite map as the
PSUs. In the third case, when a list of administrative units larger than villages (e.g. sub-districts,
wards or communes) is available, for example, a complete list of all communes in a country may be
easier to get than a complete list of villages, then it is necessary to use a selection procedure that
includes more than two stages. In the first stage, select a number of communes; in each of the
selected communes, construct a complete list of all villages residing in the commune; select one
village per commune as a DHS cluster, then proceed with the subsequent household listing and
selection as in a standard DHS. This procedure works best when the number of communes is large and
the commune size is small. A list of administrative units that are small in number but large in size is
not suitable for a DHS sampling frame because this situation will result in large sampling errors, as
explained later in Section 1.9.
1.4.3 Evaluation of the sampling frame
No matter what kind of sampling frame will be used, it is always necessary to check the
quality of the frame before selecting the sample. Following are several things that need to be checked
when using a conventional sampling frame:
Coverage
Distribution
Identification and coding
Measure of size
Consistency
There are several easy but useful ways to check the quality of a sampling frame. For example,
for a census frame, check the total population of the sampling frame and the population distribution
among urban and rural areas and among different regions/administrative units obtained from the
frame with that from the census report. Any important differences may indicate that there may be
coverage problems. If the frame provides information on population and households for each EA, then
the average number of household members can be calculated, and a check for extreme values can
help to find incorrect measures of size of the PSUs. If information on population by sex is available for
each EA, then a sex ratio can be calculated for each EA, and a check for extreme values can help to
identify non-residential EAs. If the EAs are associated with an identification (ID) code, then check the
ID codes to identify miscoded or misplaced EAs. A sampling frame with full coverage and of good
quality is the first element for a DHS survey; therefore, efforts should be made to guarantee a good
start for the project.
For a nationally representative survey, geographic coverage of the survey should include the
entire national territory unless there are strong reasons for excluding certain areas. If areas must be
excluded, they should constitute a coherent domain. A survey from which a number of scattered zones
have been excluded is difficult to interpret and to use.
1.5 Stratification
Stratification is the process by which the survey population is divided into subgroups or strata
that are as homogeneous as possible using certain criteria. Explicit stratification is the actual sorting
and separating of the units into specified strata. Within each stratum, the sample is designed and
7
selected independently. It is also possible to systematically sample units from an ordered list (with a
fixed sampling interval between selected units) to achieve the effect of stratification. For example, in
DHS survey, it is not unusual for the PSUs within the explicit strata to be sorted geographically. This is
called implicit stratification.
The principal objective of stratification is to reduce sampling errors. In a stratified sample, the
sampling errors depend on the population variance existing within the strata but not between the
strata. For this reason, it pays to create strata with low internal variability (or high homogeneity).
Another major reason for stratification is that, where marked differences exist between subgroups of
the population (e.g., urban vs. rural areas), stratification allows for a flexible sample design that can
be different for each subgroup.
Stratification should be introduced only at the first stage of sampling. At the
dwelling/household selection stage, systematic sampling is used for convenience; however, no
attempt should be made to reorder the dwelling/household list before selection in the hope of
increasing the implicit stratification effect. Such efforts generally have a negligible effect.
Stratification can be single-level or multi-level. In single-level stratification, the population is
divided into strata according to certain criteria. In multi-level stratification, the population is divided
into first-level strata according to certain criteria, and then the first-level strata are subdivided into
second-level strata, and so on. A typical two-level stratification involves first stratifying the population
by region at the first level and then by urban-rural within each region. A DHS survey usually employs
multi-level stratification.
Strata should not be confused with survey domains. A survey domain is a population subgroup
for which separate survey estimates are desired (e.g., urban areas/rural areas). A stratum is a
subgroup of homogeneous units (e.g., subdivisions of an administrative region) in which the sample
may be designed differently and is selected separately. Survey domains and strata can be the same
but they need not be. For example, survey domains could be the first-level stratum in a multi-level
stratification. On the other hand, a survey domain could consist of one or several lower-level strata.
DHS surveys typically use explicit stratification by separating urban and rural residence within
each region. Where data are available, explicit stratification could also be done on the basis of socio-
economic zones or more directly relevant characteristics such as the level of female literacy or the
presence of health facilities in the areas. These kinds of information could be obtained from
administrative sources. Within each explicit stratum, the units can then be ordered according to
location, thus providing further implicit geographic stratification.
1.6 Sample size
1.6.1 Sample size and sampling errors
The estimates from a sample survey are affected by two types of errors: sampling errors and
non-sampling errors. Sampling errors are the representative errors due to sampling of a small number
of eligible units from the target population instead of including every eligible unit in the survey.
Sampling errors are related to the sample size and the variability among the sampling units. Sampling
errors can be statistically evaluated after the survey. Non-sampling errors result from problems during
data collection and data processing, such as failure to locate and interview the correct household,
misunderstanding of the questions on the part of either the interviewer or the respondent, and data
entry errors. Non-sampling errors are related to the capacity of the implementing organization, and
experience shows that (1) non-sampling errors are always the most important source of error in a
survey, and (2) it is difficult to evaluate the magnitude of non-sampling errors once a survey is
complete. Theoretically, with the same survey methodology and under the same survey conditions,
8
the larger the sample size, the better the survey precision. However, this relationship does not always
hold true in practice, because non-sampling errors tend to increase with survey scale and sample size.
The challenge in deciding on the sample size for a survey is to balance the demands of analysis and
precision with the capacity of the implementing organization and the constraints of funding.
A common measure of precision for estimating an indicator is its relative standard error (RSE)
which is defined as its standard error (SE) divided by the estimated value of the indicator. The
standard error of an estimator is the representative error due to sampling. The relative standard error
describes the amount of sampling error relative to the indicator level and is independent of the scale
of the indicator to be estimated; therefore, a unique RSE can be applied to a reference indicator for all
domains. If a unique RSE is desired for all domains, the domain sample size depends on the variability
and the size of the domain. The total sample size is the sum of the sample sizes over all domains for
which desired precision are required. The following are some concepts related to sample size
calculation.
1. The standard error of an estimator when estimating a proportion with a simple random
sampling without replacement
2
is given by:
×
×= )1(
1
f-1
PP
N
N
n
SQRTSE
where n is the sample size (number of completed interviews),
P is the proportion,
N is the target population size, and
f=n/N is the sampling fraction.
When N is large and n is relatively small, the above quantity can be approximated by:
n
PP
SQRTSE
)1(
Therefore the RSE of the estimator is given by:
=
n
P
SQRTP
n
PP
SQRTPRSE
1/1
/
)1(
)(
2. For a required precision with a relative standard error
α
, the net sample size (number of
completed interviews) needed for a simple random sampling is given by:
2
)1/1(
α
=
P
n
3. Since a simple random sampling is not feasible for a DHS, the sample size for a complex
survey with clustering such as the DHS can be calculated by inflating the above calculated
sample size by using a design effect (Deft). Deft is a measure of efficiency of cluster sampling
compared to a direct simple random sampling of individuals, defined as the ratio between the
standard error using the given sample design and the standard error that would result if a
simple random sample had been used. A Deft value of 1.0 indicates that the sample design is
2
A simple random sample would be a random selection of individuals or households directly from the target
population. This is not feasible for DHS surveys because a list of all eligible individuals or households is not
available.
9
as efficient as a simple random sample, while a value greater than 1.0 indicates the increase
in the sampling error due to the use of a more complex and less statistically efficient design.
The net sample size needed for a cluster sampling with same relative standard error is given
by:
2
2
)1/1(
Deft
α
×=
P
n
4. The formula for calculating the final sample size in terms of the number of households while
taking non-response into account (the formula used in the templates for sample size
calculation as shown in Table 1.1) is given by:
)(
)1/1(
Deft
2
2
dRR
P
n
hi
××
×=
α
where n is the sample size in households;
Deft is the design effect (a default value of 1.5 is used for Deft if not specified);
P is the estimated proportion;
α
is the desired relative standard error;
R
i
is the individual response rate;
R
h
is the household gross response rate; and
d is the number of eligible individuals per household.
The household gross response rate is the number of households interviewed over the number
selected. DHS reports typically report the net household response rate which is the number of
households interviewed over the number valid households found in the field (i.e. excluding
vacant and destroyed dwellings.)
5. If the target population is small (such as in a sub-national survey), a finite population
correction of the above calculated sample size should be applied. The final sample size n is
calculated by
Nn
n
n
/1
0
0
+
=
where n
0
is the initial sample size calculated in point number 4, and N is the target population
size.
6. The relationship between the RSE and the sample size shows that, if one reduces a desired
RSE to half, then the sample size needed will increase 4 times. For example, the sample size
for a RSE of 5% is 4 times larger than the sample size for a RSE of 10% (see Tables 1.1 and
1.2 in the next section). This means that it is very expensive to reduce the RSE by increasing
the sample size. Therefore, when designing the sample size, the efficiency of the design must
be considered, that is, the balance between the gain in precision and the increase in sample
size (or survey cost).
7. The width of the confidence interval is determined by the RSE. With a confidence level of 95%,
2*P*RSE is the half-length of the confidence interval for P. For example, for RSE=0.10 and
P=0.20, the half-length of the confidence interval is 0.04, which means the confidence interval
for P is (0.16, 0.24). (DHS reports +/-2*SE instead of +/-1.96*SE as 95% confidence interval
for conservative purposes).
10
1.6.2 Sample size determination
The total sample size for a DHS survey with a number of survey domains (design domain) is
the sum of the sample sizes over all domains. An appropriate sample size for a survey domain is the
minimum number of persons (e.g., women age 15-49, currently married women 15-49, children under
age five) that achieves the desired survey precision for core indicators at the domain level. If funding
is tight and fixed, the sample size is the maximum number of persons that the funding can cover.
Precision at the national level is usually not a problem. In almost all cases, sample size is decided to
guarantee precision at domain level with appropriate allocation of the sample. So apart from survey
costs, the total sample size depends on the desired precision at domain level and the number of
domains. If a reasonable precision is required at domain level, experience from the MEASURE DHS
program shows that a minimum number of 800 completed interviews with women is necessary for
some of the woman-based indicators for high fertility countries (e.g. total fertility rate, contraceptive
prevalence rate, childhood mortality rates); for low fertility countries, the minimum domain sample
size can reach 1,000 completed interviews or more. Table 1.1 below illustrates the calculation of
sample size for a domain according to different levels of desired RSE for estimating the indicator “the
proportion of currently married women who are current users of a modern contraceptive method”.
Table 1.1 Sample size determination for estimating current use of
a modern contraceptive method among currently married women
Estimated proportion p 0.20 Total target population
Estimated design effect (Deft) 1.40 # of target individuals/HH 1.05
Individual response rate 0.96 HH gross response rate 0.92
Desired Net Sample Sample size Expected 95% confidence limits
RSE size individual Household SE Lower Upper
0.20 196 212 0.040 0.120 0.280
0.19 217 234 0.038 0.124 0.276
0.18 242 261 0.036 0.128 0.272
0.17 271 293 0.034 0.132 0.268
0.16 306 330 0.032 0.136 0.264
0.15 348 376 0.030 0.140 0.260
0.14 400 432 0.028 0.144 0.256
0.13 464 501 0.026 0.148 0.252
0.12 544 587 0.024 0.152 0.248
0.11 648 699 0.022 0.156 0.244
0.10 784 846 0.020 0.160 0.240
0.05 3136 3382 0.010 0.180 0.220
Note: The confidence limits are calculated as P±2*SE.
11
Assuming the domain size is large enough such that the finite population correction is
negligible, Table 1.1 gives the required gross sample size in terms of number of households with
estimated parameters from a DHS survey. The target population is currently married women age 15-
49; the estimated parameters are:
the proportion of currently married women who are current users of any modern
contraceptive method,
the design effect (Deft),
the number of target individuals (number of currently married women 15-49) per
household,
the individual and the household response rates.
For example, with an estimated prevalence of 20%, if we require a RSE of 10%, we should
select 846 households in this particular domain. With a gross household response rate (the number of
households completed over the total number selected) of 92% and an individual response rate of
96%, we expect to obtain 784 completed interviews of currently married women age 15-49.
The estimated quantities at the top of the table used as input to the calculation can usually be
obtained from previous surveys or from administrative records. The total sample size for a survey with
several domains is the sum of the sample sizes obtained in the above table for each domain. If the
same precision required and the same indicator level apply to all domains, then the total sample size
is the sample size calculated for one domain multiplied by the number of domains. With this example,
the total sample size for a survey having six domains with approximately the same level of modern
contraceptive use among currently married women and the same precision request for each domain
would be 5076 households. The “Sample size determination” template located in the Appendix can be
used to determine required sample sizes.
Table 1.2 Sample size determination for estimating the prevalence of
full vaccination coverage among children aged 12-23 months
Estimated proportion p 0.29 Total target population
Estimated design effect (Deft) 1.22 # of target individuals/HH 0.11
Individual response rate 0.96 HH gross response rate 0.92
Desired Net Sample Sample size Expected 95% confidence limits
RSE size individual household SE Lower Upper
0.20 91 937 0.058 0.174 0.406
0.19 101 1040 0.055 0.180 0.400
0.18 112 1153 0.052 0.185 0.395
0.17 126 1297 0.049 0.191 0.389
0.16 142 1462 0.046 0.197 0.383
0.15 162 1668 0.043 0.203 0.377
0.14 186 1915 0.041 0.209 0.371
0.13 216 2224 0.038 0.215 0.365
0.12 253 2605 0.035 0.220 0.360
0.11 301 3099 0.032 0.226 0.354
0.10 364 3747 0.029 0.232 0.348
0.05 1458 15008 0.014 0.261 0.319
Note: The default value of Deft is set to be 1.5. Specify if different.
The confidence limits are calculated as P±2*SE.
If response rate is not provided, the sample size calculated is net sample size.
12
Table 1.2 shows a similar example for the indicator “proportion of children aged 12-23 months
who are fully immunized”. In this case, the target population is children aged 12-23 months. The
estimated number of target individuals per household is much smaller than the number of currently
married women per household given in Table 1.1. So for the same sample size calculated in Table 1.1,
we can only get a RSE of above 20% at domain level. With a RSE of 10%, we need to select 3746
households in this particular domain which seems unrealistic if we have several domains for the
survey.
This example shows that for a multi-indicator survey, the sample size required can be very
different from indicator to indicator. So the choice of the reference indicator upon which the sample
size is calculated is an important issue. The reference indicator which is used for sample size
determination should have demographic importance, moderate value and moderate population
coverage, i.e. apply to a sizable proportion of the population. With the same sample size calculated in
Table 1.1 for a survey having six domains, the RSE for the whole sample for estimating full
immunization among children 12-23 months is between 8% and 9%.
The domain sample sizes often need to be balanced between domains due to budget
constraints. In practice it is often the case that the total sample size is fixed according to funding
available and implementation capacity, and then the sample is allocated to each domain and to each
stratum within the domain. In the case of very tight budget constraints, we may equally allocate the
total sample to the domains. In some cases, we may want to oversample a specific domain to conduct
some in-depth analysis for a certain rare phenomenon. The method (and the tables) presented in the
following section may be used to allocate the sample at the domain level because the domains are
usually first-level strata. Regardless of the method used for allocation, the calculation of domain
sample size can give us an idea about the precision we may achieve in each domain with a given
sample size.
1.7 Sample allocation
In cases where the total sample size or domain sample size has been fixed, we need to
appropriately allocate the sample to different domains (or different strata within a domain). This
allocation is aimed at strengthening the sampling efficiency at the national level or domain level and
reducing sampling errors. Assuming a constant cost across domains/strata, the optimum allocation of
the sample depends on the size of the domain/stratum
and the variability of the indicator to be
estimated
xh
S
xhhh
SNn
For a given total sample size n the optimum allocation for variable x is given by:
=
H
h
xhh
xhh
h
SN
SN
nn
1
=
The optimum allocation is only optimal for the indicator on which the allocation is based; that
allocation may not be appropriate for other indicators. For a multipurpose survey, if the
domains/strata are not too different in size, a safe allocation that is good for all indicators is a
proportional allocation, with sample size proportional to the domain/stratum size.
N
N
n
N
N
nn
h
H
h
h
h
h
=
=1
=
13
This allocation introduces a constant sampling fraction across domain/strata with:
N
n
N
n
f
h
h
h
=
=
Because DHS surveys are multipurpose surveys, a proportional allocation of sample is
recommended if the domains/strata are not too different in size. However, if the domains/strata sizes
are very different, the smaller domains/strata may receive a very small sample size.
If a desired precision is required at domain/stratum level, by assuming equal relative
variations across strata, a power allocation (Bankier, 1988) with an appropriate power value
α
(
10
α
) may be used to guarantee sufficient sample size in small domains/strata.
=
H
h
h
h
h
M
M
nn
1
=
α
α
A power allocation is an allocation proportional to the power of a size measure M. A power
value of 1 gives proportional allocation; a power value of 0 gives equal size allocation; a power value
between 0 and 1 gives an allocation between proportional allocation and equal size allocation.
Proportional allocation is good for national level indicators, but may not meet the precision request at
domain level; while an equal size allocation is good for comparison across domains, but may affect the
precision at national level. A power allocation with power values between 0 and 1 is a tradeoff
between the national level precision and the domain level precision. Since the sample size is usually
large at the national level, the national level precision is not a concern.
In Table 1.3 below, we give an example of a proportional sample allocation of 15,000
individuals to 11 domains and to their urban-rural areas. The minimum domain sample size is 384 for
domain 2, which is too small for estimating the total fertility rate (TFR) and childhood mortality rates.
The largest sample size is for domain 11 which may be unnecessarily large. The actual total sample
size given in the total row may be slightly different from the desired sample size because of rounding.
14
Table 1.3 Sample allocation: Proportional allocation
Serial
Num
Total sample size => 15000 Power value domain=> Power value urban=>
Domain/Stratum
Name/ID
Domain/
stratum
size
Proportion
urban
Sample Allocation Specific Allocation
Urban Rural Domain Urban Rural
1 Domain 1 0.072 0.352 382 701 1083
2 Domain 2 0.026 0.317 122 262 384
3 Domain 3 0.070 0.568 597 454 1051
4 Domain 4 0.142 0.275 586 1544 2130
5 Domain 5 0.060 0.323 292 611 903
6 Domain 6 0.046 0.135 92 593 685
7 Domain 7 0.048 0.194 141 586 727
8 Domain 8 0.094 0.251 354 1055 1409
9 Domain 9 0.164 0.288 709 1749 2458
10 Domain 10 0.091 0.191 262 1104 1366
11 Domain 11 0.187 1.000 2803 0 2803
Total 1.000 0.423 6339 8660 14999
If we impose a condition such that the sample size should not be smaller than 1000 in each
domain, after trying various power values, we find that a power value of 0.25 is appropriate, as shown
in Table 1.4. In this case, we would have a minimum sample size of 1,022 for domain 2. Since domain
11 has only urban areas, the power allocation among the domains brought down the urban percentage
in the sample. In order for urban areas to be properly represented, over sampling is applied in the
urban areas of the other domains. With a power value of 0.65, the urban proportion in the sample is
close to the proportion of the target population.
Table 1.4 Sample allocation: Power allocation
Serial
Num
Total sample size => 15000 Power value domain=> 0.25 Power value urban=> 0.65
Domain/Stratum
Name/ID
Domain/
stratum
size
Proportion
urban
Sample Allocation Specific Allocation
Urban Rural Domain Urban Rural
1 Domain 1 0.072 0.352 533 791 1324
2 Domain 2 0.026 0.317 386 636 1022
3 Domain 3 0.070 0.568 716 599 1315
4 Domain 4 0.142 0.275 546 1023 1569
5 Domain 5 0.060 0.323 484 782 1266
6 Domain 6 0.046 0.135 271 910 1181
7 Domain 7 0.048 0.194 341 858 1199
8 Domain 8 0.094 0.251 466 949 1415
9 Domain 9 0.164 0.288 581 1045 1626
10 Domain 10 0.091 0.191 395 1009 1404
11 Domain 11 0.187 1.000 1680 0 1680
Total 1.000 0.423 6399 8602 15001
In Table 1.4, the small domains are oversampled compared with a proportional allocation.
Oversampling some small domains is frequently practiced if domain level precision is required.
15
However, oversampling a small domain too much will harm the precision at national level. To prevent
this, it is recommended to regroup the small domains to form domains of moderate size, especially
when there is a very unequal population distribution among geographic domains, however, this is
sometimes not possible due to political considerations.
The above discussion also applies to sample size allocation to strata within a domain where
the domain sample size is fixed. A proportional allocation with sample size proportional to stratum size
is good for all indicators and provides the best precision for the domain as a whole.
1.8 Two-stage cluster sampling procedure
The MEASURE DHS program utilizes a convenient and practical sample selection procedure for
household based surveys developed on the basis of experience from past surveys—a two-stage cluster
sampling procedure. A cluster is a group of adjacent households which serves as the PSU for field
work efficiency. Interviewing a certain number of households in the same cluster can reduce greatly
the amount of travel and time needed during data collection. In most cases, a cluster is an EA with a
measure of size equal to the number of households or the population in the EA, provided by the
population census.
At the first stage, a stratified sample of EAs is selected with probability proportional to size
(PPS): in each stratum, a sample of a predetermined number of EAs is selected independently with
probability proportional to the EA’s measure of size. In the selected EAs, a listing procedure is
performed such that all dwellings/households are listed. This procedure is important for correcting
errors existing in the sampling frame, and it provides a sampling frame for household selection.
At the second stage, after a complete household listing is conducted in each of the selected
EAs, a fixed (or variable) number of households is selected by equal probability systematic sampling in
the selected EAs. In each selected household, a household questionnaire is completed to identify
women age 15-49, men age 15-59 (15-54 or 15-49 in some surveys) and children under age five.
Every eligible woman will be interviewed with an individual questionnaire, and every eligible man will
be interviewed with an individual men’s questionnaire in those households selected for the men’s
interview.
The advantages of this two-stage cluster sampling procedure can be summarized as follows:
1) It guarantees a representative sample of the target population when a list of all target
individuals is not available which prohibits a direct sampling of target individuals;
2) A household listing procedure after the selection of the first stage and before the main
survey provides a sampling frame for household selection in the central office;
3) The use of residential households as the second-stage sampling unit guarantees the best
coverage of the target population; and
4) It reduces unnecessary sampling errors by avoiding more than two stages of selection
(which usually uses a large PSU in the first stage of selection).
See more details in Sections 1.10 and 1.11 on household listing and selection, Chapter 2 on
household listing, and Sections 3.2 and 3.3 of Chapter 3 on systematic sampling and sampling with
probability proportional to size (PPS).
16
1.9 Sample “take” per cluster
Once the total sample size is determined and allocated to different survey domains/strata, it
should be decided how many individuals (sample take) should be interviewed per sample cluster and
then convert the domain/stratum sample size to number of clusters. Since the survey cost can be very
different across the survey domains/strata, the sample take can have a big influence on the total
survey budget. With a fixed sample size, a small sample take is good for survey precision because of
the reduction of the design effect, but is expensive because more clusters are needed. The number of
clusters affects the survey budget more than the overall sample size due to the travel between
clusters during data collection, which represents an important part of field costs in rural areas. The
MEASURE DHS program proposes a sample “take” of about 25-30 women per rural cluster. In urban
areas, the cost advantage of a large “take” is generally smaller, and MEASURE DHS recommends a
“take” of about 20-25 women per urban cluster. Since in most DHS surveys, the number of eligible
women age 15-49 is very close to one per household, the sample take of individuals is equivalent to
the sample take of households; therefore, in the following sections we refer to the sample take (or
cluster take) as the number of sample households per cluster.
1.9.1 Optimum sample take
The optimum number of households to be selected per cluster depends on the variable under
consideration, the intracluster correlation
ρ
, and the survey cost ratio
21
/ cc , where
1
c represents
the cost per cluster including mainly the cost associated with travelling between the clusters for
survey implementation (household listing and interview); while
2
c represents the cost per individual
interview (the interviewing cost) and other costs of doing fieldwork within a cluster. A larger sample
take per cluster and fewer clusters reduces survey field costs if the cost ratio is high, but it could also
reduce the survey precision if the intracluster correlation is strong.
The MEASURE DHS Program has accumulated information on sampling errors for selected
variables for many surveys throughout the world. Using this information, Aliaga and Ren (2006)
conducted a research study to determine the optimum sample take per cluster. The results of the
study have informed current practice in DHS surveys. If the average cluster size is around 250
households, a sample take of 20-30 households per cluster is within the acceptable range in most
surveys. The research also supports the practice of setting a larger sample take in rural clusters than
in urban clusters. Usually, the cost ratio in urban areas is smaller than that in rural areas. This would
lead to a smaller sample take in an urban cluster than in a rural cluster. In sum, this research
indicates that for the most important survey indicators, a sample take between 20 to 25 households is
appropriate in urban clusters and a sample take between 25 to 30 households is appropriate in rural
clusters.
Based on values of
21
/ cc and
ρ
obtained from eight surveys, Table 1.5 below shows optimal
sample takes for the indicator “proportion of currently married women 15-49 currently using any
contraceptive method.” This indicator has a moderate intracluster correlation relative to other
important survey indicators.
17
Table 1.5 Optimal sample take for currently married women 15-49 currently using any
contraceptive method based on intracluster correlation
ρ
and survey cost ratio
21
/ cc from
past surveys
Country
Survey
cost
ratio
21
/ cc
Intracluster
correlation
ρ
Optimal
sample
take
Country 1 10 0.025 20
Country 2 10 0.037 16
Country 3 12 0.067 13
Country 4 12 0.052 15
Country 5 15 0.084 13
Country 6 27 0.031 29
Country 7 48 0.058 28
Country 8 52 0.023 47
Average 23 0.047 23
1.9.2 Variable sample take for self-weighting
A fixed sample take per cluster is easy for survey management and implementation, but it
requires sampling weights that vary within a stratum. Different sampling weights result in larger
sampling errors compared with a similar sample of constant weight within a sampling stratum, i.e., a
self-weighting sample. A self-weighting sample consists of a sample of individuals in which each
individual has the same probability of being selected, and therefore a constant sampling weight is
used. In some cases a self-weighting sample is preferred for various reasons:
it is equally representative for every individual of the target population;
it reduces sampling errors.
Since the sample for DHS surveys is usually the result of a two-stage cluster sampling design,
it is necessary to coordinate the sample take for each of the selected clusters. In an overall self-
weighting sample, every individual in the target population has an equal probability of selection, which
results in a proportional allocation. However, proportional allocation is not feasible when sampling
domains are very different in size. Self-weighting at domain/stratum level, by contrast, is easy to
achieve.
Let n be the total number of clusters selected for a DHS survey, let
h
n be the number of
clusters allocated to the h
th
stratum; let
h
X be the total number of households in the stratum h, let
hk
x be the number of households in cluster k of stratum h, given by the sampling frame; then the
selection probability of cluster k in stratum h is given by:
h
hkh
hk
X
xn
=
π
Let
*
hk
x be the number of households listed in the cluster in the household listing operation, let
h
m be the number of households to be selected from the cluster for a fixed sample take, then the
overall selection probability of a household in the cluster is given by:
18
**
hk
h
h
hkh
hk
h
hkhk
x
m
X
xn
x
m
f ×=×=
π
If
hkhk
xx =
*
exactly for all k in stratum h, then it is easy to see that self-weighting is achieved
in stratum h by a constant sample take
h
m in all clusters since
h
hh
h
X
mn
f =
is a constant in stratum h.
In practice, it is not possible that
hkhk
xx =
*
for all h and k, especially when the last population
census is no longer new. Therefore there is a need for sample coordination in order to achieve self-
weighting. Let
h
f and
h
m
be the calculated sampling fraction and average sample take in stratum h
according to the sample allocation with
h
hh
h
n
Xf
m =
; the number of households needed to achieve self-
weighting in cluster k of stratum h is given by
hk
hk
h
hk
hk
h
hh
hk
x
x
m
x
x
n
Xf
m
**
×=×=
which is a function of the ratio of the number of households listed over the number of households
given in the sampling frame for every cluster: take more if more are listed or take fewer if fewer are
listed. The above formula also shows that the sampling fraction is not a necessary parameter for
sample take calculation. Using the designed average sample take is a more direct method because the
sampling fraction is an abstract number. This formula is used in the self-weighting household selection
templates presented in Chapter 3, Section 3.2. The relationship between the sample take and the
cluster selection probability is given by
hk
hkh
hk
xf
m
π
*
=
For practical considerations, the sample take calculated above needs to be adjusted if is it too
small or too large. Usually, we apply a cut-off to control the sample take within the range of a
minimum of 10 households and a maximum of 50 households per cluster. For the clusters where the
cut-off is applied, the sample is no longer self-weighting.
The advantages and disadvantages of a self-weighting sample can be summarized as:
Advantages:
1) Equally representative for every individual within a sampling stratum.
2) Reduced sampling errors.
Disadvantages:
1) Difficult for survey management (for example, to distribute the work-load) because of the
variant sample take by cluster.
2) Difficult to control the expected sample size because of possible cut-offs, especially when
the upper limit cut-offs are employed.
3) The self-weighting is not exact because of the rounding of the sample takes and this will
bring bias in the survey estimation.
19
4) Self-weighting at the national level will break down the specific sample allocation at the
domain/stratum level and bring the sample allocation back to a proportional allocation.
It is possible to overcome the second and the third disadvantages through a recursive
calculation of sample take by re-distributing the cut-offs to the rest of the clusters in the stratum or
control area, and by using a randomized sample take which allows non-integer numbers as sample
size. Excel templates for both the traditional procedure and revised procedure are available.
1.10 Household listing
The household listing operation is a fundamental operation in DHS surveys. After the EAs are
selected for the survey, a complete listing of dwelling units/households in the selected EAs is
conducted prior to the selection of households. The listing operation consists of visiting each of the
selected clusters, collecting geographic coordinates of the cluster, drawing a location map of the
cluster as well as a sketch map of the structures in the cluster, recording on listing forms a description
of every structure together with the names of the heads of the households in the structures and other
characteristics. Mapping and listing of households represents a significant field cost, but it is essential
to guarantee the exactness of sample implementation.
The listing operation is an important procedure for reducing non-sampling errors in the survey,
especially when the sampling frame is outdated. The listing operation provides a complete list of
occupied residential households in the EA. This information is necessary for an equal probability
random selection of households in the second stage. With the household listing prior to the main
survey, it is possible to pre-select the sample households in advance and the interviewers are asked to
interview only the pre-selected households without replacement of non-responding households. With
the sketch map and the household listing of the cluster produced in the household listing operation,
the sampled households can be easily relocated by interviewers later. The fieldwork procedure for DHS
surveys is designed to be replicable and therefore allows easy supervision; all these elements are
designed to prevent serious bias during data collection.
It is sometimes suggested that listing could be avoided by making segments so small that
they are equal to the required sample “take” per cluster. One could then use a “take-all” rule at the
last stage of sampling. Such small segments, however, will generally be difficult to delineate. In
planned urban areas, this difficulty may be reduced—one could adopt blocks, or even single buildings,
as segments—but urban units of this kind are likely to be homogeneous, containing similar
households, and therefore less than ideal as sampling clusters.
It is also not acceptable to attempt to avoid listing altogether by having interviewers create
clusters as they go along, or by selecting the sample households at fixed intervals during a random
walk up to a predetermined quota. Such methods are not acceptable because first, they do not
guarantee a nonzero probability to every potential respondent; second, the procedure is not
replicable, which complicates the field work supervision; and third, it can end up with a sample of easy
units because of the lack of effort to make call backs to households or individuals who were not
available at the first attempt to interview.
Listing costs can be reduced by using segmentation to decrease the size of the area which has
to be listed; however, segmentation generates its own costs, and skill in map making and map
interpretation is required. Segmentation becomes progressively more difficult as segments become
smaller because there are not enough natural boundaries to delineate very small segments. Moreover,
concentration of the sample into smaller segments increases the sampling error. Since neighbors’
characteristics are correlated, a smaller segment captures less of the variety existing in the
population; this leads to less efficient sampling. There is a point beyond which it is not useful to
attempt further segmentation. As a general rule the average segment size should not be less than 500
20
in population (approximately 100 households) in both urban and rural areas. However, segmentation
has less economical effect in urban areas because the urban EAs are in general small geographic
areas.
It is quite probable that some traditional tools in the household listing process will be modified
in the future by using more sophisticated technology such as the geographic positioning systems
(GPS) in order to collect more precise location information for the selected EAs. With this new tool we
can produce more precise distribution maps of the structures with less supervision than in the
traditional approach. The main feature is that every selected EA and every selected structure/dwelling
can be located with high precision and thus relocated later, if desirable. In addition, GPS information is
used more and more in DHS data analysis and presentation. At present, though, the recommended
protocol for collecting GIS information in DHS surveys is to collect one coordinate for every selected
cluster. See Chapter 2 for more details of the household listing operation.
1.11 Household selection in the central office
After the household listing operation, once the central office receives the completed listing
materials for a cluster, they must first create a serial number for each of the occupied residential
households, beginning with 1 and continuing to the total number of occupied residential households
listed in the cluster. An occupied residential household designates those households occupied at the
time of the listing, even if the occupant refused to cooperate at the time of listing, and those
households where the occupants were absent at the time of listing but neighbors confirmed that they
would not be absent for a long period and would be at home during the period of the main survey.
Only occupied residential households should be numbered. This serial number is an ID number for the
households. The household selection procedure will be performed based on this serial number.
Whether or not a household is considered occupied at the time of the listing is very important because
this fact will be related to the proportion of vacant households in the main survey.
The MEASURE DHS program has used several methods
3
for selecting households within
clusters including:
1) Systematic selection: From a random starting point select every nth household (see
Chapter 3 Section 3.2 for more details).
2) Systematic selection with runs: From a random starting point, select a group of sequential
households called a “run”. Several runs may be used within a cluster. Runs are selected
with systematic selection. Selecting households in runs can greatly reduce the amount of
travel within cluster during data collection, especially in rural clusters where households
can be far apart.
The advantages of household selection in the central office can be summarized as:
1) It allows for a check of coverage of the household listing results before the main survey
and for the review and possible relisting of problematic clusters in advance.
2) Sampled households are pre-determined which prevents potential bias introduced by
allowing the interviewers to select in the field which households are to be interviewed.
3
The MEASURE DHS program has developed various Excel templates for household selection in the central office:
systematic selection, systematic selection with runs, self-weighting selection with and without control of sample
size and with or without runs. Once the household listing is completed, it is possible to just copy the number of
households listed in a cluster into the spreadsheet and the spreadsheet will show the selected household numbers
automatically. See Chapter 3 Section 3.2.2 for details.
21
3) The field work procedure is exactly replicable which provides the possibility of easy and
close supervision of the field work.
4) It is easier to control the work load for each interviewing team.
However, in cases when travelling between clusters represents a substantial cost, it is possible
to forego the step of selecting households in the central office. In such cases, the household listing
operation and the main survey can be combined into a single field operation. No essential changes are
needed in the household listing procedure or household numbering, but making a detailed sketch map
for the cluster may not be necessary because the listing team and the interviewing team are the
same, and the household interview will begin immediately after the listing, so identifying the exact
selected households during a separate visit is no longer a problem. The household selection must be
done in the field manually if portable computers are not available. Some manual selection procedures
have been developed for this purpose. Household listing and interviewing are two very different jobs,
so in surveys where listing, selection and interviewing takes place in the same visit by the same staff,
it may be necessary to conduct more extensive training of field teams before the field work begins and
to supervise the teams more closely during the fieldwork. See Chapter 3 Section 3.2.2 for more details
for manual household selection.
1.12 Household interviews
The household interview procedure is out of the scope of this manual since it is explained in
detail in the interviewer’s manual. This section will briefly discuss the main statistical points of the
household interview. After the household selection, interviewers will be recruited and trained for the
household and individual interviews. The training of the interviewer is an intensive training lasting at
least four weeks for a standard DHS survey, and longer if the survey includes many biomarkers. Prior
to the training, a pretest of the questionnaire will be conducted in a small number of clusters not
selected for the main survey to assess the quality of the questionnaires and the understanding of the
translations by interviewers and respondents. Problems and potential errors observed in the pretest
will be addressed and resolved prior to fieldwork training. Finally, the interviewing team will be sent to
selected clusters with a certain work load per team.
Once training is complete, teams of interviewers will be assigned a list of clusters and
deployed to the field. Upon arrival in a new area, the interviewer team must first contact the local
authorities for help to identify the correct cluster and to solicit cooperation during the field work. A
team leader or supervisor is assigned for each interviewing team. The supervisor is responsible for
cluster identification and should guarantee that the correct cluster will be interviewed. After checking
the listing materials and verifying with the local authorities, the supervisor will distribute the sampled
households among the interviewers. After locating a selected household, the interviewer will begin
with a brief household interview, listing household members and visitors, and identifying among them
all eligible women and men for the individual interview. Eligible individuals are defined as those who
are in the specified age group (15-49), and are either usual members of the selected household or
who slept in the household the night before the interviewer’s visit.
Conscious omission of eligible individuals on the part of an interviewer by mis-reporting their
age outside of the eligible age group is a real concern. Measures to eliminate this problem should be
undertaken. For example, the field editor should check the consistency of each completed
questionnaire and, if suspicious things are identified, should return to the household for further
verification of key items such as the number of household members, number of eligible individuals and
number of children under age five.
In the event of failure to contact a household or an eligible person in the first visit, the
interviewer is required to make at least two repeat visits, or call backs, on different days and at
22
different times of the day before the interview is abandoned. The process of making call backs
requires the teams to stay in a cluster for at least two to three days. Some countries propose large
interviewing teams in order to try to cover an entire cluster in one day. This process is not acceptable
for a DHS survey, even when the designed sample size can bear a large non-response rate, because
non-response biases the survey results. A quick survey usually ends up with poor data quality. Both
theory and practice prove that call backs and efforts to get difficult units to respond to the survey are
the best way to remove bias and reduce the non-sampling errors to a minimum. For more details,
refer to the DHS Survey Organization Manual and the Interviewer’s Manual.
1.13 Sampling weight calculation
1.13.1 Why we need to weight the survey data
A DHS sample is a representative sample randomly selected from the target population. Each
interviewed unit (household and individual) represents a certain number of similar units in the target
population. In order for any statistical inferences drawn from the survey data to be valid, this
representativeness of the sample must be taken into account. In general terms, sampling weights are
used to make the sample more like the target population. All analyses should use the sampling
weights calculated for each interviewed household and for each interviewed individual.
A sampling weight is an inflation factor which extrapolates the sample to the target
population. For example, if equal probability sampling (or a self-weighting sample) is applied in a
domain with a sampling fraction 1/500, this means that each sampled individual represents 500
similar individuals in the target population. Therefore, if we observed one particular individual having
secondary education, we would conclude that there are 500 individuals in the target population having
secondary education, corresponding to this particular individual. The total number of individuals with
secondary education in the target population would be 500 times the total number of interviewed
individuals having secondary education observed in the sample. This explanation also applies to
unequal probability sampling. It is very important that sampling weights are properly calculated and
applied in data analysis; otherwise, serious bias may be introduced, leading to incorrect conclusions.
Although all of the DHS indicators are means, proportions, rates or ratios, since a nationwide
self-weighting sample is not usually feasible due to study domains as explained in Section 1.9,
sampling weights are always necessary. Even when a survey is designed to be nationally self-
weighting, it is necessary to correct for the different response patterns across domains/strata (see
Section 1.13.4 for more details). Therefore, even surveys with self-weighting sample designs require
the use of sampling weights.
Though the effect of sampling weights on survey indicators may be small, it is necessary to
use sampling weights for the following reasons:
1) For valid statistical inference.
2) For correcting or reducing bias; weighting can reduce bias introduced by non-response or
other non-sampling errors.
3) For keeping the weighted sample distribution close to the target population distribution,
especially when oversampling is applied in certain domains/strata.
1.13.2 Design weights and sampling weights
The MEASURE DHS program calculates both design weights and sampling weights (or survey
weights) for both households and individuals. The design weight of a sampling unit (household or
23
individual) is the inverse of the overall probability with which the unit was selected in the sample. The
sampling weight of a sampling unit is the design weight corrected for non-response or other
calibrations.
Since is the DHS protocol involves no selection of eligible individuals within a sampled
household (except for the domestic violence module, in which one eligible woman is selected from a
sampled household), all eligible individuals from the same household share the same design weight,
which is the same as the household’s design weight. Therefore, the design weight is the basic weight
for DHS surveys. All other weights are calculated based on the design weight. In calculating the
sampling weight, it is possible to correct for both unit non-response (a sampling unit is not
interviewed at all) and item non-response (the sampling unit does not provide answer for a specific
question). The policy of the MEASURE DHS program is to correct for unit non-response at the stratum
level (see Section 1.13.4) and leave the correction of item non-response to data users because it is
variable specific. Correction of unit non-response at cluster level will increase the variability of
sampling weights and therefore increase sampling errors. Because the correction for unit non-
response is the same for an entire cluster and because household selection within a cluster is an equal
probability selection, all the households in the same cluster share the same design weight and
sampling weight, and the same is true for all individuals in the same cluster. This means that the DHS
weights (both design weights and sampling weights) are cluster weights.
1.13.3 How to calculate the design weights
Assuming that a DHS survey sample is drawn with two-stage, stratified cluster sampling,
design weights will be calculated based on the separate sampling probabilities for each sampling stage
and for each cluster. We use the following notations:
P
1hi
: first-stage sampling probability of the i
th
cluster in stratum h
P
2hi
: second-stage sampling probability within the i
th
cluster (household selection)
Let n
h
be the number of clusters selected in stratum h; let M
hi
be the measure of size of the
cluster used in the first stage’s selection, usually the measure of size is the number of households
residing in the cluster according to the sampling frame; let
M
hi
be the total measure of size in the
stratum h. The probability of selecting the i
th
cluster in the sample is calculated as follows:
M
M
n
P
hi
hi
h
hi
=
1
Let
hi
b be the proportion of households in the selected cluster compared to the total number
of households in EA i in stratum h if the EA is segmented, otherwise 1=
hi
b . Then the probability of
selecting cluster i in the sample is:
hi
hi
hi
h
1hi
b
M
M
n
=
P
×
Let
hi
L be the number of households listed in the household listing operation in cluster i in
stratum h; let
hi
t be the number of households selected in the cluster. The second stage selection
probability for each household in the cluster is calculated as follows:
hi
hi
hi
L
t
P =
2
24
The overall selection probability of each household in cluster i of stratum h is therefore the
product of the selection probabilities of the two stages:
hihihi
PPP
21
×=
The design weight for each household in cluster i of stratum h is the inverse of its overall
selection probability:
hihi
Pd /1=
The calculation of the design weight is not complicated; however, difficulties often result from
not having of all the design parameters involved in the above calculation because they are not well
documented, especially when the sampling frame is a master sample. See Chapter 5 for more details
on sample documentation.
1.13.4 Correction of unit non-response and calculation of sampling weights
The design weight calculated above is based on sample design parameters. If there is no non-
response at the cluster level, at the household level, or at the individual level, the design weight is
enough for all analyses, for both household indicators and individual indicators. However, non-
response is inevitable in all surveys, and different units have different response behaviors. The
experience of the MEASURE DHS program shows that urban households are less likely to respond to
the survey than their counterparts in rural areas, households in developed regions are less likely to
respond to the survey than their counterparts in less-developed regions, rich households are less likely
to respond to the survey than poor households, individuals with higher levels of education are less
likely to respond to the survey than those with lower levels of education, men are less likely to
respond to the survey than women, and so forth.
The idea of correcting for unit non-response is to calculate a response rate for each
homogeneous response group, then inflate the design weight by dividing it by the response rate for
each response group. The construction of homogeneous response groups depends on the knowledge
of the response behavior of the sampling units. DHS surveys always use the sampling stratum as the
response group because the stratification is usually achieved by regrouping homogeneous sampling
units in a single stratum. It is possible to use a cluster as a response group, but the disadvantage is
that the response rates may vary too much at the cluster level, which will increase the variability of
the sampling weight; which in turn increases the sampling variance. Furthermore, correction of non-
response at the cluster level will interfere with self-weighting if a self-weighting sample has been
designed.
By assuming that the response groups coincide with the sampling strata, the following steps
explain how to calculate the sampling weight by first calculating the various response rates for unit
non-response. Please note that the response rates calculated here are different from the response
rates calculated in Appendix A of DHS survey final reports. In Appendix A, household and individual
response rates are calculated as ratios of the number of interviewed units over the number of eligible
units because the aim is just to show the results of survey implementation. Here we use weighted
ratios because the aim is to correct the design weight to compensate for non-response, therefore the
design weight should be involved. Because a non-responding unit with a large sampling weight will
have a larger impact on survey estimates than a non-responding unit with a small design weight, a
weighted response rate for correction of non-response is better than an un-weighted response rate.
25
1. Cluster level response rate
Let
h
n be the number of clusters selected in stratum h; let
*
h
n be the number of clusters
interviewed. The cluster level response rate in stratum h is therefore
hhch
nnR /
*
=
2. Household level response rate
Let
hi
m be the number of households found (see Chapter 2, Section 2.10 for definition) in
cluster i of stratum h; let
*
hi
m be the number of households interviewed in the cluster. The household
response rate in stratum h is calculated by
=
hihihihihh
mdmdR /
*
where
hi
d
is the design weight of cluster i in stratum h; the summation is over all clusters in the
stratum h.
3. Individual response rate
Let
hi
k be the number of eligible individuals found in cluster i of stratum h; let
*
hi
k be the
number of individuals interviewed. The individual response rate in stratum h is calculated as
=
hihihihiph
kdkdR /
*
where
hi
d
is the design weight of cluster i in stratum h; the summation is over all clusters in the
stratum h.
The household sampling weight of cluster i in stratum h is calculated by dividing the household
design weight by the product of the cluster response rate and the household response rate, for each of
the sampling stratum:
),/(
hhchhihi
RRdD ×= for cluster i of stratum h.
The individual sampling weight of cluster i in stratum h is calculated by dividing the household
sampling weight by the individual response rate, or equivalently, by dividing the household design
weight by the product of the cluster response rate, the household response rate and the individual
response rate, for each of the sampling strata:
),/(/
phhhchhiphhihi
RRRdRDW ××==
for cluster i of stratum h.
It is easy to see that the difference between the household sampling weights and the
individual sampling weights is introduced by individual non-response.
The sampling weights for households selected for the men’s survey and for men can be
calculated similarly. We need a separate household sampling weight for the men’s survey in cases
where the men’s survey is conducted in a sub-sample of households selected for the women’s survey,
and we suppose that the response behavior of households in the men’s survey sub-sample may be
different from the overall household response rate.
If no normalization is requested, we can stop here. The above calculated household sampling
weight and individual sampling weight can be used to produce any indicators at the household level
26
and the individual level, respectively. As we mentioned earlier in Section 1.13.1, a sampling weight is
an inflation or extrapolation factor. The weighted sum of households interviewed
=
*
hihi
mDT
is an unbiased estimate of the total number of ordinary residential households of the country; where
*
hi
m is the number of households interviewed in the i
th
cluster of stratum h, and the summation is over
all clusters and strata in the total sample. Similarly, the weighted sum of all interviewed women
=
*
hihi
kWW
is an unbiased estimate of the total women in the target population (women age 15-49) of the
country; where
*
hi
k is the number of women interviewed in the i
th
cluster of stratum h, and the
summation is over all clusters and strata in the total sample.
1.13.5 Normalization of sampling weights
Normalization of sampling weights is not necessary for survey data analysis. In order to
prevent large numbers for the number of weighted cases in the tables in DHS survey final reports, it is
the MEASURE DHS tradition to calculate normalized standard weights for both households and
individuals. With the normalized standard weight, the number of unweighted cases coincides with the
number of weighted cases at the national level for both total households and total individuals. The
normalized standard weight of a sampling unit is calculated based on its sampling weight, by
multiplying the sampling weight with a unique constant at the national level. The constant or the
normalization factor is the total number of completed cases divided by the total number of weighted
cases (based on the sampling weight). This number is equal to the estimated total sampling fraction
because the total number of weighted cases with the sampling weight is an estimation of the total
target population. Therefore the standard weights in the DHS data files are relative weights. Relative
weights can be used to estimate means, proportions, rates and ratios because the normalization factor
is cancelled out when used in both numerator and denominator, so it has no effect on the calculated
indicator values. This point also explains why the normalization must be done at the national level and
not the regional level: at the regional level, the normalization factor cannot be cancelled out, and bias
will be introduced in the calculated indicator values. Because the normalized standard weights have no
scale, they are not valid for estimating totals. Also the normalized weight is not valid for pooled data,
even for data pooled for women and men in the same survey, because the normalization factor is
country and sex specific.
1. Normalized household standard weight
4
The normalization factor for calculating household standard weight is calculated as
=
**
/
hihihi
mDmFH
The household standard weight for cluster i in stratum h is calculated by
×=×=
**
/005
hihihihihihi
mDmDFHDHV
4
The MEASURE DHS program has developed Excel templates for facilitating standard weight calculations. If all
design parameters and the survey results (number of households found and interviewed, number of eligible women
found and interviewed, number of eligible men found and interviewed, number of eligible women and men found
and tested, by cluster) are provided in the input page, the standard weights will be calculated automatically in
different pages.
27
where HV005 is the household standard weight variable in the DHS Recode data files.
It is easy to see that the weighted sum of households interviewed by using the standard
weight equals the unweighted sum of households interviewed for the total sample. This condition will
not be met at the domain level or for sub-populations. At the domain level, the weighted sum of
households interviewed may be larger or smaller than the unweighted sum of households interviewed,
depending on whether the domain is undersampled or oversampled.
2. Normalized women’s standard weight
The normalization factor for calculating the women’s standard weight is calculated as
=
**
/
hihihi
kWkFW
The women’s standard weight for cluster i in stratum h is calculated by
×=×=
**
/005
hihihihihihi
kWkWFWWV
where V005 is the women’s standard weight variable in the DHS Recode data files.
The standard weights for households selected for the men’s survey and for men can be
calculated in a similar way.
1.13.6 Standard weights for HIV testing
The sampling weights for HIV testing are calculated separately for women and men, but they
are calculated using the same methodology. The only difference is in the calculation of the
normalization factors, if a normalized weight is requested. In order to calculate the weighted HIV
prevalence for women and men together using a normalized weight, the standard weight for HIV
testing must be normalized for women and men together. In most DHS surveys, HIV testing is
conducted in the same subsample of households selected for men’s survey, and every woman or man
in the household who is eligible for the individual interview is eligible for HIV testing. Once the
household sampling weight for the men’s survey is calculated using the procedures stated in Section
1.13.5, the sampling weights for HIV testing for women and men may be calculated separately by
correcting the household sampling weight for the non-response rates of women and men for HIV
testing, respectively. For simplicity, let
hi
MD be the household sampling weight in cluster i of stratum h
for the men’s survey sub-sample, the response rates to HIV testing for women and men are calculated
respectively by
=
hihihihihi
WHIVMDWHIVMDWR /
*
=
hihihihihi
MHIVMDMHIVMDMR /
*
where
hi
WHIV is the number of women eligible for HIV testing, and
*
hi
WHIV is the number of women
tested with a valid test result, in cluster i of stratum h;
hi
MHIV and
*
hi
MHIV are the number of men
eligible and the number of men tested with a valid test result, respectively, in cluster i of stratum h.
The sampling weights for HIV testing for women and men, respectively, are calculated by
hihi
W
hi
WRMDHIV /= ,
hihi
M
hi
MRMDHIV /=
28
In cluster i of stratum h, the normalized standard weights for HIV testing for women and men,
respectively, are calculated by
(
)
(
)

×+×+×=
****
/05
hi
M
hihi
W
hihihi
W
hi
W
hi
MHIVHIVWHIVHIVMHIVWHIVHIVHIV
(
)
(
)

×+×+×=
****
/05
hi
M
hihi
W
hihihi
M
hi
M
hi
MHIVHIVWHIVHIVMHIVWHIVHIVHIV
where the double summations are over all clusters and strata in the total sample.
1.13.7 De-normalization of standard weights for pooled data
For all of the DHS data, the weight variables HV005 (household standard weight), V005
(women’s standard weight) and MV005 (men’s standard weight) are relative weights which are
normalized so that the total number of weighted cases is equal to the total number of unweighted
cases, for the three kinds of units. In some situations, such as analyses involving data from more than
one survey, data users may need the un-normalized sampling weight for analyzing pooled data. As
mentioned in Section 1.13.5, since normalization is country specific and sex specific, it is necessary to
de-normalize the standard weights provided in the DHS Recode data files for analyzing pooled data.
The normalization procedure consists of multiplying the sampling weight by a normalization
factor for the total sample. The normalization factor is the estimated total sampling fraction: the
number of completed cases divided by the number of weighted cases by using the sampling weight,
for each kind of sampling unit. The weighted number of cases with sampling weight is an estimation of
the total target population. Therefore, in order to de-normalize a normalized weight, simply divide the
normalized weight by the total sampling fraction. The estimated total sampling fraction is usually not
provided in the DHS data file or in the final report. In order to calculate the total sampling fraction, it
is necessary to know the total target population at the time of the survey. The total target population
at the time of the survey is easy to get from various sources. The country’s statistical office, the
United Nations Population Division’s (UNPD) World Population Prospects
5
, and the United Nations
Population Fund (UNFPA) are three sources that may be easy to access.
As mentioned above, if pooled data analysis is required, the standard weight variables HV005,
V005 and MV005 must be rescaled or de-normalized. The de-normalization procedure is the inverse of
the normalization procedure: that is, multiply the standard weight by the target population and divide
by the number of completed cases, for each survey. The de-normalized weights for households,
women and men (HV005*, V005*, and MV005*, respectively) can be calculated using the following
formulas:
HV005* = HV005 × (total number of residential households in the country)/
(total number of households interviewed in the survey)
V005* = V005 × (total female population 15-49 in the country)/
(total number of women 15-49 interviewed in the survey)
MV005* = MV005 × (total male population 15-49 (15-59) in the country)/
(total number of men 15-49 (15-59) interviewed in the survey)
5
http://esa.un.org/unpd/wpp/index.htm
29
If normalized weights are preferred, the above re-scaled weights can be re-normalized by
multiplying by the total number of completed women’s and men’s interviews combined, dividing by
the total number of weighted cases combined, and applying the above re-scaled weights to the pooled
data.
Note that the normalization of sampling weights is done for the total sample for households,
women and men separately. If the aim is to tabulate indicators for a certain sub-population from
pooled data, for example, vaccination coverage for children 12-23 months, the de-normalization has
nothing to do with the total population of children 12-23 months because there is no standard weight
calculated for children 12-23 months in DHS surveys. If the indicator is tabulated at the household
level using the household weight, the household standard weights must be de-normalized for all of the
surveys included in the analysis as explained above; likewise, if the indicator is tabulated at the
individual level using the women’s (or child’s mother’s) weight, the women’s standard weights must
be de-normalized for each of the surveys.
1.14 Calibration of sampling weights in case of bias
Generalized calibration (Deville and Särndal, 1992; Deville et al, 1993) has now become a
popular and powerful framework in survey data analysis for statistical offices in many countries. It
allows for the utilization of different sources of auxiliary information to improve estimates from sample
surveys. Calibration can reduce sampling errors, can correct bias caused by non-response and other
non-sampling errors, and can reduce the influence of extreme values. Calibration is a “weight tuning”
procedure such that the tuned sampling weight can produce estimates without error for known
population characteristics. The precision of an estimator using a calibrated weight is equivalent to a
regression estimator but is much easier to calculate with the help of calibration software such as
CALMAR, a SAS Macro procedure developed by the French Institute of Statistics and Economic Studies
(INSEE), and the SPSS procedure developed by Statistics Belgium. DHS surveys employ calibration of
sampling weights only in cases where serious bias is observed in the collected data, and there is
reliable auxiliary information available for the calibration.
Let X be a multivariate auxiliary variable with p components such that the population totals of
each of the component variables are known beforehand from the recent population census, that is,
τ
),...,,(
21 P
xxx
Ui
ix
tttXt ==
is known. Let x
i
be the observations of the auxiliary variables from the
survey
τ
),...,,(
21 piiii
xxxx =
for the respondent sampling unit i. Let
i
D be the sampling weight for
unit i. The calibration procedure consists of modifying the sampling weight slightly from
i
D to
i
W
such that a given distance measure between the sampling weights
i
D and the calibrated weights
i
W
si
ii
DWg ),(
is minimized under the constraints
x
si
ii
txW =
where g is a distance function which measures the distance between
i
D and
i
W . The constraints
imposed are that the known auxiliary variable totals are estimated without error with the calibrated
weights. If the variable of interest is well correlated with the auxiliary variables, then we expect that
the precision can be greatly improved for estimating the variable of interest. The calibration theory
states that the calibrated weights have the following formula
(
)
)(sxqFDW
iiii
λ
τ
=
30
where
()
F
is called the calibration function which is the reciprocal of the derivative of the distance
function g; q
i
is a calibration weight which is usually set to 1 in the lack of prior knowledge;
)(s
λ
is a
constant depending on the particular sample s which is to be solved.
When
(
)
))(1()( sxqsxF
iii
λλ
ττ
+= ,
which corresponds to one of the five proposed calibration functions in Deville et al, 1993, it is easy to
solve,
)(s
λ
is given by
)
ˆ
()(
1
xxs
ttTs
π
λ
=
with
=
si
iiiis
xxqDT
τ
For a given variable of interest y, the calibrated estimator of the population total is equivalent
to the generalized regression estimator
)
ˆ
(
ˆ
ˆˆ
xxsy
si
iiy
ttBtyWt
π
τ
π
+==
where
=
si
iiiiss
yxDqTB
1
ˆ
is the sample estimation of the regression coefficient;
y
t
π
ˆ
and
x
t
π
ˆ
are the
simple estimators using the sampling weight
=
si
iiy
yDt
π
ˆ
,
=
si
iix
xDt
π
ˆ
A mean estimation of the variable of interest y can be calculated by
=
si
i
si
ii
W
yW
Y
ˆ
The calibration estimator can be equivalently formulated with known proportions of one or
more auxiliary variables. The calibration can be conducted at the individual level, which will result in
an individual specific weight, or it can be conducted at the cluster level with aggregated data, which
will result in a cluster weight. For more details see the related references given in the end of this
document.
1.15 Data quality and sampling error reporting
Data quality is always a major concern for all MEASURE DHS projects. Though numerous
efforts are made in implementing DHS surveys to maximize the quality of the data collected, non-
sampling errors are always the main concerns for data quality. Data quality of a survey directly affects
the reliability of the statistics produced. Many countries have laws that require reports of survey
findings to include an evaluation of data quality and reliability. Data quality can be measured by total
survey error including bias introduced by various sampling and non-sampling errors.
DHS survey final reports usually include tables in an appendix for data quality evaluation
purposes, including: age distributions of household population by sex; age distributions of eligible and
interviewed women and men; completeness of reporting on date of birth, age at death, age/date at
first union, education and anthropometric measures, etc. The MEASURE DHS program also conducts
some in-depth studies on data quality for specific topics, which are provided in published reports.
Apart from the data quality tables, DHS survey final reports provide sampling errors for
selected indicators in Appendix B. Sampling errors are important reliability measures which tell the
user the degree of error associated with a particular estimated indicator value, the number of cases
involved in the calculation of the indicator, the efficiency or clustering effects of the sample design
compared to a simple random sampling and the range for the true value of an indicator at a certain
31
confidence level. The reader is referred to Chapter 4, Section 4.2 for more details on sampling errors
and their calculation.
DHS survey final reports also provide an appendix on the sample design of the survey. The
sample design document reports the survey methodology used for the survey, including the aim of the
survey, the target population, the sample size, the reporting domains, the stratification and sample
allocation, sample selection procedure, sampling weight calculation, correction for non-response,
calibration of sampling weights, and the results of survey implementation. See Chapter 5, Section 5.2
for more details on sample design.
1.16 Sample documentation
The task of a sampling statistician does not end with the selection of the sample. The
preservation of sampling documentation is an essential requisite for sampling weight calculation, for
sampling error computation, for data quality evaluation, for linkage with other data sources, and for
various kinds of checks and supplementary studies. Special efforts are needed at the time of the
sample design, at the end of the fieldwork, and at the completion of the data file if the task of sample
documentation is to be carried out effectively. If preservation of documentation is delayed,
considerable effort will be required to reconstitute the missing information when it is needed.
The sample documentation must comply with the survey confidentiality requirements. When
HIV testing is conducted in a DHS or AIS (AIDS Indicator Survey), the confidentiality guidelines
require the complete destruction of all intermediate documents which can potentially be used to
identify any single household or individual who participated in the testing. This requirement reinforces
the importance of timely sample documentation. See Chapter 5 for detailed requirements in sample
documentation.
1.17 Confidentiality
The final data files for DHS surveys are made available to interested researchers. Therefore,
the confidentiality of private information collected from individual respondents is a major concern,
especially when sensitive information such as sexual activity and HIV status are collected. Protecting
the confidentiality of the individual respondent is not only an ethical obligation, but it also promotes
more accurate data because respondents are more likely to provide truthful responses if they feel
confident their information will be kept private.
DHS surveys follow strict rules imposed at various steps during the survey implementation to
prevent the direct or indirect disclosure of the identity of individual respondents. The principal pieces
of information that can indirectly identify an individual respondent are cluster number, household
number, the cluster selection probability and the sampling weights. The cluster number is an
important identifier for sampling error calculations; the household number is important for household
level and individual level data management and tabulation; the cluster selection probability is useful
for cluster level modeling; and sampling weights are necessary for all analysis. So these variables
must be present in the final data file. The household number in the final DHS data file is not
informative, and sampling weights are not informative after correction of non-response and
normalization. The cluster selection probability is potentially informative only if lower level
identification information such as district and locality are present, and DHS survey final data files do
not provide geographic information below the level of region or survey domain, especially when HIV
testing is conducted. Thus the only concern is the disclosure of the cluster. For DHS or AIS surveys
with HIV testing, the final data files provide scrambled cluster and household numbers for further
insurance against disclosure.
32
2 HOUSEHOLD LISTING OPERATION
2.1 Introduction
DHS surveys are nationwide sample surveys designed to provide information on the levels of
fertility, infant and child mortality, use of family planning, knowledge and attitudes toward HIV/AIDS
and other sexually transmitted infections (STI), and on other family welfare and health indicators. The
surveys generally interview women age 15-49 and men age 15-59 (15-49 or 15-54 in some surveys).
The women and men to be interviewed live in ordinary residential households which are randomly
selected from a set of sample points consisting of clusters of households. Prior to interviewing, all
households located in the selected clusters will be listed. The listing of households for each cluster will
be used in selecting the final sample of households to be included in the DHS survey.
The listing operation consists of visiting each cluster, recording on listing forms a description
of every structure together with the names of the heads of the households found in the structure, and
drawing a location map of the cluster as well as a detailed sketch map of all structures residing in the
cluster. These materials will guide the interviewers to find the pre-selected households for interviewing
and will allow field work supervisors to perform quality control during data collection.
The following sections present the general guidelines for conducting a household listing
operation. Modifications may be needed to adapt to country specific situations.
2.2 Definition of terms
Following are brief definitions of the terms used in this document.
A census Enumeration Area (EA) is a geographical statistical unit created for a census and
containing a certain number of households. An EA is usually a city block in urban areas and a village, a
part of a village or a group of small villages in the rural areas with its location and boundaries well
defined and recorded on census maps.
A cluster is the smallest geographical survey statistical unit for DHS surveys. It consists of a
number of adjacent households in a geographical area. For DHS surveys, a cluster corresponds either
to an EA or a segment of a large EA.
A base map is a reference map that describes the geographical location and boundaries of an
EA.
A structure is a free-standing building or other construction that can have one or more
dwelling units for residential or commercial use. Residential structures can have one or more dwelling
units (for example: single house, apartment structure).
A dwelling unit is a room or a group of rooms normally intended as a residence for one
household (for example: a single house, an apartment, a group of rooms in a house); a dwelling unit
can also have more than one household.
A household consists of a person or a group of related or unrelated persons, who live together
in the same dwelling unit, who acknowledge one adult male or female 15 years old or older as the
head of the household, who share the same housekeeping arrangements, and are considered as one
unit. In some cases one may find a group of people living together in the same house, but each
person has separate eating arrangements; they should be counted as separate one-person
households. Collective living arrangements such as army camps, boarding schools, or prisons will not
be considered as households. Examples of households are:
33
a man with his wife or his wives with or without children
a man with his wife or his wives, his children and his parents
a man with his wife or his wives, his married children living together for some social or
economic reasons (the group recognize one person as household head)
a widowed or divorced man or woman with or without children
The head of household is the person who is acknowledged as such by members of the
household and who is usually responsible for the upkeep and maintenance of the household.
A location map is a map produced in the household listing operation which indicates the main
access to a cluster, including main roads and main landmarks in the cluster. Sometimes it may be
useful even to include some important landmarks in the neighboring cluster.
A sketch map is a map produced in household listing operation, with location or marks of all
structures found in the listing operation which helps the interviewer to relocate the selected
households. A sketch map also contains the cluster identification information, location information,
access information, principal physical features and landmarks such as mountains, rivers, roads and
electric poles.
2.3 Responsibilities of the listing staff
Persons recruited to participate in the household listing operation will work in teams consisting
of two enumerators. A coordinator will monitor the entire operation.
The responsibilities of the coordinator are to:
1) obtain base maps for all the clusters included in the survey;
2) arrange for the reproduction of all listing materials (listing manuals, mapping and listing
forms); the map information forms and the household listing forms must be prepared in
sufficient numbers to cover all of the clusters to be visited.
3) assign teams to clusters;
4) monitor the reception of the completed listing forms at the central office; and
5) verify that the quality of work is acceptable.
If GPS coordinates are being collected during the listing operation, the coordinator must also:
6) obtain one GPS receiver per listing team, plus two backup receivers, and tag each GPS
receiver with a number;
7) ensure that all GPS receivers have the correct settings (see Section 2.6 below) and
distribute a receiver to each field team;
8) obtain and copy all GPS training materials for listing staff; and
9) train all listing staff to record GPS waypoints in the GPS units as well as on Form DHS/1.
34
The responsibilities of the enumerators are to:
1) identify the boundaries of the cluster;
2) draw a location map showing the location of the cluster;
3) draw a detailed sketch map of the cluster showing the locations of all structures residing in
the cluster;
4) list all the households in the cluster in a systematic manner;
5) communicate to the coordinator problems encountered in the field and follow his
instructions.
6) transfer the completed listing forms to the coordinator or to the central office;
If GPS coordinates are being collected during the listing operation, enumerators must also:
7) capture and record the GPS waypoint of the center of the cluster; and
8) complete the portion of form DHS/1 designated for GPS information for each cluster.
The two enumerators in each team should work together at the same time in the same area.
They will first identify the cluster boundaries together. Then one enumerator prepares the location and
the sketch map while the other does the household listing. The materials needed for the household
listing operation are:
Manual for Household Listing
Base map of the area containing the cluster
Map Information Form (Form DHS/1)
Household Listing Form (Form DHS/2)
Segmentation form (Form DHS/3)
If GPS coordinates are to be recorded during the listing operation, the following additional
materials are needed:
GPS receivers, batteries and cables
GPS training manuals and handouts
2.4 Locating the cluster
The coordinator will provide the listing team with a base map containing the cluster assigned
to the team. The listing team will typically make two tours of the cluster: the first to identify the
cluster boundaries and to create the location map, and the second to create the listing and draw the
sketch map. Upon arrival in a cluster, the team should first contact the local authorities for help in
identifying the boundaries and get general information on the cluster, for example, the rough number
of residential households in the cluster. In most cases, the cluster boundaries follow easily
recognizable natural features such as streams or rivers, and construction features such as roads or
railroads. In some cases, the boundaries may not be marked with visible features (especially in rural
areas), attention should be paid to locate the cluster boundaries as precisely as possible according to
the detailed description of the cluster and its base map.
Before doing the listing, the team should tour the cluster to determine an efficient route of
travel for listing all of the structures. The cluster should be divided into parts if possible. A part can be
35
a block of structures. The listing team will make a location map of the cluster indicating the
boundaries of the parts, as well as the relative location of landmarks, public structures (e.g., schools,
religious structures, public offices and markets) and main roads. This location map will serve as a
guide for the interviewing team when they begin data collection.
2.5 Preparing location and sketch maps
The coordinator will designate one enumerator of the team as the mapper. The second
enumerator will be the lister. Although the two have separate tasks to perform, they must move
together and work in close cooperation; the mapper prepares the maps, and the lister collects
information on the structures (and corresponding households) indicated on the sketch map.
The mapping of the cluster and the listing of the households should be done in a systematic
manner so that there are no omissions or duplications. If the cluster consists of a number of blocks,
then the team should finish each block before going to the next adjacent block. Within each block,
start at one corner of the block and move clockwise around it. In rural areas where structures are
frequently found in small groups, the team should work in one group of structures at a time and in
each group they can start at the centre (choosing any landmark, such as a school, to be the centre)
and move around it clockwise.
In the first tour of the cluster, the mapper will prepare a location map of the cluster on the
Map Information Form (Form DHS/1). First, fill in the identification box for the cluster on the first
page. All information needed for filling in the identification box is provided by the coordinator. In the
space provided on the second page, draw a map showing the location of the cluster and include
instructions on how to get to the cluster. Include all useful information to find the cluster and its
boundaries directly on the map and in the space reserved for observations if necessary.
In the second tour of the cluster, using the third page of the Map Information Form, the
mapper will draw a sketch map of all structures found in the cluster, including vacant structures and
structures under construction. It is important that the mapper and lister work together and coordinate
their activities, since the structure numbers that the mapper indicates on the sketch map must
correspond to the serial numbers assigned by the lister on the listing form for the same structures.
On the sketch map, mark the starting point with a large X. Place a small square at the spot
where each structure in the cluster is located. For any non-residential structure, identify its use (for
example, a store or factory). Number all structures in sequential order beginning with "1". Whenever
there is a break in the numbering of structures (for example, when moving from one block to
another), use an arrow to indicate how the numbers proceed from one set of structures to another.
Although it may be difficult to pinpoint the exact location of the structure on the map, even an
approximate location is useful for finding the structure in the future. Add to the sketch map all
landmarks (such as a park), public structures (such as a school or church), and streets or roads.
Sometimes it is useful to add to the sketch map landmarks that are found outside the cluster
boundaries, if they are helpful in identifying other structures inside the cluster.
Use the marker or chalk provided to write on the entrance to the structure the number that
has been assigned to the structure. Remember that this is the serial number of the structure as
assigned on the household listing form, which is the same as the number indicated on the sketch map.
In order to distinguish the number from other numbers that may exist already on the door of the
structure, write “DHS” in front of the number, for example, for the structure number 5, write “DHS/5,”
similarly on the door of structure number 44 write “DHS/44.”
A structure is called a multi-unit structure if it contains more than one household in the
structure. Otherwise it is called a single-unit structure. All households found in a structure or multi-
36
unit structure must be numbered from 1 to m, within the structure
6
. The structure number plus the
household number form a unique identification number for a household, and for all of the households
in the cluster. For example, household number 3 in structure number 44 would be uniquely identified
with ID number DHS/44-3. It is very useful to write the household ID number at the entrance of the
household to later assist the interviewer to identify the household for interview.
2.6 Collecting a GPS waypoint for each cluster
A GPS waypoint is a latitude and longitude reading that represents a location. For some
surveys, GPS data for EAs are available from the census. However, if the data are not available, or are
of questionable quality, one GPS waypoint for each cluster should be recorded during the listing phase
of the survey. These waypoints are recorded using a GPS unit (a Garmin ETREX unit is used in this
guide) and data collection forms. If GPS units other than the Garmin ETREX are used, this guide will
still be useful; however, some of the instructions may not apply due to differences in design and
menus. The Garmin ETREX owner’s manual may be useful to consult on the basics of the GPS unit.
Take one reading for each cluster. The GPS waypoints will be captured by the mapper
while he is mapping the clusters. One GPS waypoint must be taken for each cluster, and in the case of
large clusters which are being segmented, one point should be taken for each segment selected for
listing. In DHS surveys, clusters are usually census EAs, sometimes villages in rural areas or city
blocks in urban areas. Collecting only one waypoint for the cluster greatly reduces the chance of
compromising confidentiality of the respondents and at the same time is sufficient to allow for the
integration of multiple datasets for further analysis. The DHS cluster waypoint should always be taken
at the geographic center of the cluster or segment. If the cluster is segmented, the point should be
taken for the segment chosen by the Mapping and Listing Coordinator to be included in the survey.
Save the waypoint and record the latitude, longitude, and altitude. The latitude,
longitude, and altitude reading for a location are stored in two places: in the GPS unit’s memory and
on the DHS/1 paper form. GPS units can be broken or lost, and experience has shown that a hardcopy
backup is essential. In addition, the paper form provides a backup should the data in the GPS unit be
changed, deleted, or misidentified (i.e., the operator names the cluster incorrectly in the unit). Each
position saved in the GPS unit is called a waypoint, and each waypoint has a unique name. If possible,
the waypoint ID should be the same as the DHS cluster number. If it is not possible, the waypoint ID
should be unique to the cluster and recorded on Form DHS/1 (do not record the same waypoint ID for
two different clusters). When a waypoint is saved, the GPS unit assigns it a default name. The mapper
must edit the default name and change it to the 6-digit DHS cluster ID number. For example, the
waypoint for DHS cluster 101 would be named “000101”. Cluster 1101 would be named “001101”.
After saving the waypoint, the mapper will use the identification box of the Map Information Form
(Form DHS/1) to record the latitude, longitude, and altitude for the cluster and segment on paper.
First, the mapper will write down the latitude and longitude coordinates in decimal degree format and
altitude in meters in the Identification Box on the “Location Map Cluster” Form (DHS/1). Second, the
mapper will draw a circle, in the middle of the cluster/segment, at the location where he/she captured
the waypoint.
After the listing is complete, the GPS units must be collected as soon as possible and returned
to the sampling office by the Mapping and Listing Coordinator. The waypoints will then be downloaded
and examined for problems by the designated sampling staff. The Sampling Coordinator should
designate one member of the Data Processing Team to receive and process the GPS waypoint file and
then give the file to survey manager.
6
This number is different from the household number later given to all of the households listed in the whole cluster
just prior to household selection.
37
In most situations, the Mapping and Listing Coordinator will be responsible for providing the
listing teams with a GPS unit prior to the listing. Before these units are distributed they should be set
up for use by the listers. For DHS surveys, the only format which is acceptable is Decimal Degrees,
regardless of what geographic standards may be in use for other purposes. To set the format, enter
the SETUP menu and in the UNITS sub-menu, select the item POSITION FRMT and press the ENTER
button. Select “hddd.ddddd” Decimal Degrees, which is the first item. Once “hddd.ddddd” is
highlighted, press the ENTER button. It is important that all the GPS units be set up in the same way
so that the waypoints returned at the end of the survey are all in the same format. For more details
on how to properly prepare the GPS units for waypoint collection, please refer to the DHS Manual for
GPS Data Collection.
2.7 Listing of households
The lister will use the Household Listing Form (Form DHS/2) to record all households found in
the cluster. Begin by entering the identification information for the cluster. The first two columns are
reserved for office use only—leave them blank.
Complete the rest of the form as follows:
Column (1) [Serial Number of Structure]: For each structure, record the same structure serial
number that the mapper enters on the sketch map. All the structures recorded on the sketch
map (except the landmarks) must be recorded on the listing form and numbered.
Column (2) [Address/description of Structure]: Record the street address of the structure.
Where structures do not have visible street addresses (especially in rural areas), give a
description of the structure and any details that help in locating it (for example, in front of the
school, next to the store, etc.).
Column (3) [Residence Y/N]: Indicate whether the structure is used for residential purposes
(eating and sleeping) by writing Y for “Yes”. In cases where a structure is used for commercial
or other purposes, write N for “No”. Structures used both for residential and commercial
purposes (for example, a combination of store and home) should be classified as residential
(i.e. mark Y in column 3). Make sure to list any household unit found in a nonresidential
structure (for example, a guard living inside a factory or in a church). Also do not forget to list
vacant structures and structures under construction, and in Column (6) give some explanation
(for example: vacant, under construction, etc.) All structures seen in the cluster should be
recorded on the sketch map of the cluster and in the listing.
Column (4) [Serial Number of Household in Structure]: This is the serial number assigned to
each household found in the structure; there can be more than one household in a structure.
The first household in the structure will always have number “1”. If there is a second
household in the structure, then this household should be recorded on the next line, a “2” is
recorded in Column (4), and Columns (1) to (3) repeat the structure number and address or
are left blank.
Column (5) [Name of Head of Household]: Write the name of the head of the household.
There can only be one head per household. If no one is home or the household refuses to
cooperate, ask neighbors for the name of the head of the household. If a name cannot be
determined, leave this column blank. Note that it is not the name of the landlord or owner of
the structure that is needed, but the name of the head of the household that lives there.
Column (6) [Observations/Occupied or not]: This space is provided for any special remarks
that might help the coordinator decide whether to include a household in the household
38
selection or not, and might also help the interviewing team locate the structure or identify the
household during the main survey fieldwork.
If the structure is an apartment block or block of flats, assign one serial number to the entire
structure (only one square with one number appears on the sketch map), but complete Columns (2)
through (6) for each apartment in the structure individually. Each apartment should have its own
address, which is the apartment number within the structure.
The listing team should be careful to locate hidden structures. In some areas, structures may
have been built so haphazardly that they are easily missed. In rural areas, structures may be hidden
by tall grasses and trees. If there is a pathway leading from the listed structure, check to see if the
pathway goes to another structure. Talking with people living in the area may help in identifying the
hidden structures.
2.8 Segmentation of large clusters
A certain number of the selected EAs may be very large in population size. A complete listing
of EAs that are very large may not be feasible for the survey. These EAs should be subdivided into
several smaller segments, only one of which will be included in the survey and listed. In this case, the
DHS cluster corresponds to a segment of an EA. When the team arrives in a large EA that may need
segmentation, it should first tour the EA and make a quick count to get the estimated number of
households residing in the EA. There is no standard threshold for the size of an EA that needs to be
segmented, or for segment size. But for efficiency and accuracy considerations, DHS recommends that
if the EA size is bigger than 300 households, then the team should communicate to the coordinator
the cluster number, the estimated number of households and the suggested number of segments to
be created. The final decision to segment an EA, and the number of segments to be created, can only
be taken by the coordinator. Ideally, for ease of operation, an EA would only need to be segments into
2 segments, with an ideal segment size of 150-200 households in each segment. Dividing an EA into a
large number of segments (more than 3) should be avoided if it is not really necessary in order to
minimize errors.
In dividing an EA into segments, the ideal would be to have segments of approximately equal
size, but it is also important to adopt segment boundaries that are easily identifiable. In the first tour
of the cluster draw a location map of the entire cluster. Using identifiable boundaries such as roads,
streams, and electric power lines, divide the EA into the designated number of roughly equal-sized
segments. On the location map of the EA, show clearly the boundaries of the segments created.
Number the segments sequentially. Estimate the relative size of each segment in the following
manner: quickly count the number of dwellings in each segment, add up the total number of dwellings
in the EA and calculate the proportion of the dwellings in the whole EA that are located in each
segment.
Example 2.1: A cluster of 620 dwellings has been divided into 3 segments and the results are as
follows:
Segment 1: 220 dwellings, or 220/620 = 35 percent
Segment 2: 190 dwellings, or 190/620 = 31 percent
Segment 3: 210 dwellings, or 210/620 = 34 percent
Total: 620 dwellings, or 620/620 = 100 percent
On Form DHS/3 (Segmentation Form) write the size of the segments in the appropriate
columns (number and percent) and calculate the cumulative size of all of the segments in terms of a
percentage. The cumulative size of the last segment on the list must be equal to 100.
39
Segment
number
Number of
dwellings
Percent Cumulative
percent
1 220 35 35
2 190 31 66
3 210 34 100
For each large EA to be segmented, a random number between 0 and 100 will be selected in
the central office and included in the file. Compare this random number with the cumulative size.
Select the first segment for which the cumulative size is greater than or equal to the random number.
Random number: 67
Segment selected: Segment number 3
Proceed with the household listing operation in segment number 3 as described in the above
sections (see Appendix 2.3 for an example of how to complete the segmentation form.) Draw a
detailed sketch map of the selected segment and list all the households found in the selected
segment.
2.9 Quality control
To ensure that the work done by each listing team is acceptable, quality checks should be
performed. The coordinator should tour the regions during the household listing operation, and assess
the quality of the finished clusters. The coordinator should select a finished cluster and do an
independent listing of 10 percent of the cluster. If important errors are found, the whole cluster should
be relisted. If the problem is related to systematic errors, and it is not possible to do corrections on
the listing forms, then all of the listed clusters should be relisted.
2.10 Prepare the household listing forms for household selection
Once the central office receives the completed listing materials for a cluster, they must first
assign a serial number to all of the households in the cluster in the second column of the form DHS/2.
Only occupied residential households (including households that refused to cooperate at the time of
listing and households where the occupants were absent at the time of listing but would return shortly
and would be at home during the period of household interview) will be numbered. This is a
continuous serial number from 1 to the total number of occupied residential households listed in the
cluster. Leave the cell in the second column blank if the household is not occupied, or if the structure
is not a residential structure. Fill in the second column only if the structure on that row is an occupied
household. Make sure that the numbering of all occupied households follows sequentially from the
previous occupied household on the list, with no gaps or repetitions in the numbering. See the
example of a completed listing form in Appendix 2.3.
After assigning the serial numbers to all households listed in the cluster, copy the total
number of households listed to the column “Number of households listed” in the Excel file prepared for
household selection. Make sure this number is recorded in the correct row for the cluster number. In
the column “Segmentation information” record the percentage of the entire EA population that is
included in the selected segment. The segmentation information is important for correctly calculating
the sampling weights. After the total number of households listed in the cluster has been entered in
the Excel file, the spreadsheet automatically generate the household numbers of those households
selected to be interviewed. Copy the numbers of the selected households to the first column of the
form DHS/2, corresponding to the serial number of the households in the listing form. These are the
households that must be interviewed. It is recommended to use a different colored pen on the listing
40
forms to indicate the households selected for interviewing. It is also very helpful to use color on the
cluster’s sketch map to mark the structures where the selected households are located.
In many surveys, a sub-sample of households will be selected for the men’s survey. The
household selection spreadsheet uses shaded columns to indicate which households are selected for
the men’s survey. Put a mark in the first column on the form DHS/2 next to the number of the
selected household to indicate the households selected for the men’s survey, or use a different colored
pen for the households selected for both men’s and women’s surveys. Make a copy of the whole
package of files (sketch maps and the listing forms with household selection). Give the original to the
interviewing team for the household interview and keep the other copy in the central office.
41
Appendix 2.1 Example listing forms
Form DHS/1 PAGE 1 of 3
Map Information Form
Identification Label Code
Locality ___________________________________________
DHS Cluster Number ..........................................................
Urban/Rural (Urban=1/Rural=2) .........................................
EA Number ......................................................................
District ____________________________________________
Region ____________________________________________
Name of Mapper ____________________________________
Name of Lister ______________________________________
GPS Unit Tracking Number .................................................
Waypoint name (entered in GPS unit) ..................................
Latitude (North/South) .............................................. N / S
Longitude (East/West) .............................................. E / W
Altitude / Elevation (Meters) ...............................................
.
.
Observations:
Road access _____________________________________________________________
________________________________________________________________________
________________________________________________________________________
Other useful information ____________________________________________________
________________________________________________________________________
________________________________________________________________________
42
Form DHS/1 Map Information Form PAGE 2 of 3
Locality __________________________ District ___________________________
Location map DHS Cluster:
43
Form DHS/1 Map Information Form PAGE 3 of 3
Locality __________________________ District ___________________________
Sketch map of cluster DHS Cluster:
44
45
Form DHS/3 Segmentation Form
Identification Label Code
Locality ___________________________________________
DHS Cluster Number ..........................................................
Urban/Rural (Urban=1/Rural=2) .........................................
EA Number ......................................................................
District ____________________________________________
Region ____________________________________________
Name of Mapper ____________________________________
Name of Lister ______________________________________
Number of segments:
Segment number Number of households Percent Cumulative percent
1
2
3
4
5
Random number: ____________________
Segment selected: ___________________
46
Appendix 2.2 Symbols for mapping and listing
Orientation to the North
Boundaries of the cluster
Paved road
Unpaved (dirt) road
Footpath
River, creek, etc.
Bridge
Lake, pond, etc.
Mountains, hills
Water point (wells, fountain, etc.)
Market
School
Administrative structure
Church, temple
Mosque
Cemetery
Residential structure
47
Non-residential structure
Vacant structure
Hospital, clinic, etc.
Electric pole
Tree or bush
48
Appendix 2.3 Examples of completed mapping and listing forms
49
50
51
52
3 SELECTED SAMPLING TECHNIQUES
In this section, some of the most commonly used sampling techniques and their application
are presented. The presentation will focus mainly on practical rather than theoretical aspects.
However, the chapter does touch on some basic theoretical properties of the techniques used in the
DHS surveys.
We focus on without replacement sampling rather than with replacement sampling procedures,
since the latter represents a reduction of efficiency for samples of a fixed size due to the potential that
some sampling units may be repeated. When this occurs, the amount of information carried in a fixed
size sample is reduced because the same sampling unit is selected several times. For readers who are
interested in the theoretical aspects of the selected sampling techniques, please refer to the textbooks
dealing with survey sampling theory listed in the references.
3.1 Simple random sampling
We begin with simple random sampling without replacement (SRSWOR) since this is a
fundamental sampling procedure that is used as standard to which the efficiency of other sampling
procedures is compared. Simple random sampling without replacement is a selection procedure where
every unit has an equal chance of being selected. Selection can be performed through successive
draws without replacement from a well-mixed container containing all sampling units, or using certain
computerized algorithms to select from a list of all sampling units.
Let N be the total number of sampling units, let n be the total sample size, n<N. The
probability of selection for every i
th
unit is given by:
N
n
P
i
=
The design weight (assuming no non-response) is given by:
n
N
PD
ii
== /1
The probability for any particular n different units selected together in a sample s is given by:
=
N
n
P
s
/1
where
N
n
is the total number of combinations of n elements out of N. Let
n
yyy ...,,
21
be the
observations made from the selected units on a variable of interest, then the weighted sample mean
which is the same as the unweighted sample mean,
==
n
i
n
i
n
ii
y
n
DyDy
111
1
/
,
is an unbiased estimator of the population mean,
=
N
i
y
N
Y
1
1
, with its sampling variance given by
()
2
1
ysrs
S
n
f
yV
=
where
()
2
1
2
1
1
=
N
iy
Yy
N
S
is the finite population variance of the variable y and f=n/N is the
sampling fraction. An unbiased estimation of this variance can be made using
53
()
2
1
ysrs
s
n
f
y
=
υ
where
()
2
1
2
1
1
=
n
iy
yy
n
s
is the sample variance. When n and N are large, the standardized
variable
()
ySE
Yy
follows a student-t distribution with n-1 degrees of freedom and
()
ySE
is the square root of
()
y
srs
υ
.
Therefore the confidence limits of the population mean
Y can be constructed based on sample
observations allowing for 95% confidence that the true value of
Y will lie within the range of
)(*96.1 ySEy
and
)(*96.1 ySEy +
. DHS reports use
)(*2 ySEy ±
for a conservative estimate of
95% confidence limits.
Given a complete list of all sampling units in a computerized file, the easiest way to draw a
simple random sample of size n is to first generate a uniformly distributed random number between 0
and 1 and associate a number with each of the sampling units. Next, sort the file based on the
generated random numbers in ascending order, and the first n units associated with the n smallest
random numbers are the selected units. This procedure provides a SRSWOR sample of size n. This
procedure is easy to implement, but requires sorting of the sampling frame. Since sorting is time
consuming, the following algorithm (Tillé, 2001) may be used with the sampling frame without
sorting:
Definition of terms and the initial step
k: the k
th
unit of the frame file; j: the j
th
selected unit
k = 0
j = 0
generate a uniformly distributed random number between [0,1)
repeat if
j < n
if u <
kN
jn
then
unit k + 1 is selected; j = j + 1
else unit k + 1 is not selected
k = k + 1
3.2 Equal probability systematic sampling
3.2.1 Sampling theory
Systematic sampling (SYS) is the selection of sampling units at a fixed interval from a list,
starting from a randomly determined point. Selection is systematic because selection of the first
sampling unit determines the selection of the remaining sampling units. Compared with SRSWOR,
systematic sampling has the following advantages:
1) It is easier to perform;
2) It allows easy verification of the selection;
3) If the sampling frame is in some order, it provides a stratification effect with respect to the
variables on which the frame is sorted, and with a proportional allocation. This
stratification is called implicit stratification.
54
4) Implicit stratification prevents unexpected concentration of sample points in certain areas
such as is possible with SRSWOR.
Because of these advantages, especially (3) and (4), systematic selection is more often used
than simple random sampling.
Systematic sampling is normally carried out as follows: assuming a whole number interval
I=N/n, where N is the number of units in the frame list and n is the number of units to be selected.
The procedure begins with an integer random number S that is less than or equal to I. The units to be
selected are S, S+I, S+2*I, ..., S+(n-1)*I. When I is not a whole number there may be appreciable
errors in rounding it to the nearest whole number, it is suggested that the decimal interval method be
used. Selection with a decimal interval may be carried out as follows:
1) Calculate the interval I rounded to two decimal places.
2) Generate a random number R between 0 and 1 with two decimal points.
3) Compute the sequence of sampling numbers: R*I, R*I + I, R*I + 2*I, ..., R*I + (n - 1)*I
4) Round up the above calculated sampling numbers to the next highest whole numbers;
these are the selected units’ numbers.
Example 3.2.1:
Let N=100, n=14, so that I=7.14; let the generated random number be R=0.96. The sampling
numbers and the corresponding selected unit numbers are as follows:
6.85 13.99 21.13 28.27 35.41 42.55 49.69 56.83 63.97 71.11 78.25 85.39 92.53 99.67
7 14 22 29 36 43 50 57 64 72 79 86 93 100
In this example, the decimal interval method gives a selection interval which is sometimes 7
or sometimes 8. The household selection templates are all programmed with decimal sampling
intervals.
Often sample design requires numerous systematic samples as is the case when a systematic
sample of households is needed within each selected cluster. In this situation a separate random start
R should be determined independently for each cluster.
With SYS, the probability of selection for any unit i is given by
N
n
I
P
i
==
1
The design weight (assuming no non-response) is given by
n
N
PD
ii
== /1
Let
n
yyy ...,,
21
be the observations made from the selected units on a variable of interest,
then the weighted sample mean which is the same as the unweighted sample mean
==
n
i
n
i
n
ii
y
n
DyDy
111