This document is part of the Demographic and Health Survey’s DHS Toolkit of methodology

for the MEASURE DHS Phase III project, implemented from 2008-2013.

This publication was produced for review by the United States Agency for International

Development (USAID). It was prepared by MEASURE DHS/ICF International.

SAMPLING AND HOUSEHOLD

LISTING MANUAL

Demographic and Health

Surveys Methodology

[THIS PAGE IS INTENTIONALLY BLANK]

Demographic and Health Survey

Sampling and Household Listing Manual

ICF International

Calverton, Maryland USA

September 2012

MEASURE DHS is a five-year project to assist institutions in collecting and analyzing data needed to

plan, monitor, and evaluate population, health, and nutrition programs. MEASURE DHS is funded by the

U.S. Agency for International Development (USAID). The project is implemented by ICF International in

Calverton, Maryland, in partnership with the Johns Hopkins Bloomberg School of Public Health/Center

for Communication Programs, the Program for Appropriate Technology in Health (PATH), Futures

Institute, Camris International, and Blue Raster.

The main objectives of the MEASURE DHS program are to: 1) provide improved information through

appropriate data collection, analysis, and evaluation; 2) improve coordination and partnerships in data

collection at the international and country levels; 3) increase host-country institutionalization of data

collection capacity; 4) improve data collection and analysis tools and methodologies; and 5) improve the

dissemination and utilization of data.

For information about the Demographic and Health Surveys (DHS) program, write to DHS, ICF

International, 11785 Beltsville Drive, Suite 300, Calverton, MD 20705, U.S.A. (Telephone: 301-572-

0200; fax: 301-572-0999; e-mail: info@measuredhs.com; Internet: http://www.measuredhs.com).

Recommended citation:

ICF International. 2012. Demographic and Health Survey Sampling and Household Listing Manual.

MEASURE DHS, Calverton, Maryland, U.S.A.: ICF International

iii

TABLE OF CONTENTS

TABLES AND FIGURES ......................................................................................................... vii

1 DEMOGRAPHIC AND HEALTH SURVEYS SAMPLING POLICY ........................................ 1

1.1 General principles ............................................................................................ 1

1.1.1 Existing sampling frame ......................................................................... 1

1.1.2 Full coverage ........................................................................................ 1

1.1.3 Probability sampling .............................................................................. 2

1.1.4 Suitable sample size .............................................................................. 2

1.1.5 Simple design ....................................................................................... 2

1.1.6 Household listing and pre-selection of households ...................................... 2

1.1.7 Good sample documentation ................................................................... 2

1.1.8 Confidentiality ...................................................................................... 3

1.1.9 Exactness of survey implementation ........................................................ 3

1.2 Survey objectives and target population .............................................................. 3

1.3 Survey domain ................................................................................................ 4

1.4 Sampling frame ............................................................................................... 4

1.4.1 Conventional sampling frame .................................................................. 5

1.4.2 Alternative sampling frames ................................................................... 5

1.4.3 Evaluation of the sampling frame ............................................................ 6

1.5 Stratification ................................................................................................... 6

1.6 Sample size .................................................................................................... 7

1.6.1 Sample size and sampling errors ............................................................. 7

1.6.2 Sample size determination .................................................................... 10

1.7 Sample allocation ............................................................................................ 12

1.8 Two-stage cluster sampling procedure ............................................................... 15

1.9 Sample “take” per cluster ................................................................................. 16

1.9.1 Optimum sample take ........................................................................... 16

1.9.2 Variable sample take for self-weighting ................................................... 17

1.10 Household listing ............................................................................................ 19

1.11 Household selection in the central office ............................................................. 20

1.12 Household interviews ....................................................................................... 21

1.13 Sampling weight calculation .............................................................................. 22

1.13.1 Why we need to weight the survey data .................................................. 22

1.13.2 Design weights and sampling weights ..................................................... 22

1.13.3 How to calculate the design weights ........................................................ 23

iv

1.13.4 Correction of unit non-response and calculation of sampling weights............ 24

1.13.5 Normalization of sampling weights .......................................................... 26

1.13.6 Standard weights for HIV testing ............................................................ 27

1.13.7 De-normalization of standard weights for pooled data ................................ 28

1.14 Calibration of sampling weights in case of bias .................................................... 29

1.15 Data quality and sampling error reporting ........................................................... 30

1.16 Sample documentation .................................................................................... 31

1.17 Confidentiality ................................................................................................ 31

2 HOUSEHOLD LISTING OPERATION ........................................................................... 32

2.1 Introduction ................................................................................................... 32

2.2 Definition of terms .......................................................................................... 32

2.3 Responsibilities of the listing staff ...................................................................... 33

2.4 Locating the cluster ......................................................................................... 34

2.5 Preparing location and sketch maps ................................................................... 35

2.6 Collecting a GPS waypoint for each cluster .......................................................... 36

2.7 Listing of households ....................................................................................... 37

2.8 Segmentation of large clusters .......................................................................... 38

2.9 Quality control ................................................................................................ 39

2.10 Prepare the household listing forms for household selection .................................. 39

Appendix 2.1 Example listing forms ............................................................................. 41

Appendix 2.2 Symbols for mapping and listing .............................................................. 46

Appendix 2.3 Examples of completed mapping and listing forms ...................................... 48

3 SELECTED SAMPLING TECHNIQUES ......................................................................... 52

3.1 Simple random sampling .................................................................................. 52

3.2 Equal probability systematic sampling ................................................................ 53

3.2.1 Sampling theory .................................................................................. 53

3.2.2 Excel templates for systematic sampling .................................................. 55

3.3 Probability proportional to size sampling ............................................................. 64

3.3.1 Sampling theory .................................................................................. 64

3

.3.2 Operational description and examples ..................................................... 65

3.4 Complex sampling procedures ........................................................................... 70

4 SURVEY ERRORS ...................................................................................................... 73

4.1 Errors of coverage and non-response ................................................................. 73

4.1.1 Coverage errors ................................................................................... 73

4.1.2 Deliberate restrictions of coverage .......................................................... 74

4.1.3 Non-response ...................................................................................... 74

v

4.1.4 Response rates .................................................................................... 76

4.2 Sampling errors .............................................................................................. 78

5 SAMPLE DOCUMENTATION ....................................................................................... 80

5.1 Introduction ................................................................................................... 80

5.2 Sample design document ................................................................................. 80

5.2.1 Introduction ........................................................................................ 80

5.2.2 Sampling frame ................................................................................... 81

5.2.3 Structure of the sample and the sampling procedure ................................. 82

5.2.4 Selection probability and sampling weight ................................................ 84

5.3 Sample file ..................................................................................................... 85

5.4 Results of Survey implementation ..................................................................... 88

5.5 Sampling errors .............................................................................................. 90

5.6 Sampling parameters in DHS data files ............................................................... 91

Glossary of terms ................................................................................................................ 93

References .......................................................................................................................... 97

vii

TABLES AND FIGURES

Table 1.1 Sample size determination for estimating current use of a modern contraceptive

method among currently married women ................................................................. 10

Table 1.2 Sample size determination for estimating the prevalence of full vaccination coverage

among children aged 12-23 months ........................................................................ 11

Table 1.3 Sample allocation: Proportional allocation ................................................................ 14

Table 1.4 Sample allocation: Power allocation ......................................................................... 14

Table 1.5 Optimal sample take for currently married women 15-49 currently using any

contraceptive method based on intracluster correlation

ρ

and survey cost ratio

21

/ cc

from past surveys ................................................................................................. 17

Table 5.1 Distribution of EAs and average size of EA by region and by type of residence ............... 82

Table 5.2 Distribution of households by region and by type of residence ..................................... 82

Table 5.3 Sample allocation of clusters and households by region and by type of residence ........... 84

Table 5.4 Expected number of interviews by region and by type of residence .............................. 84

Table 5.5 An example sample file ......................................................................................... 87

Table 5.6 Example table for the results of survey implementation .............................................. 88

Table 5.7 Example appendix table for the results of the women’s survey implementation .............. 89

Table 5.8 Example appendix table for the results of the men’s survey implementation .................. 90

Table 5.9 Example table for sampling errors ........................................................................... 91

Figure 3.1 Simple household selection with a sub-sample .......................................................... 57

Figure 3.2 Selection of runs with a sub-sample ........................................................................ 58

Figure 3.3 Simple self-weighting selection without sample size control ........................................ 59

Figure 3.4 Self-weighting selection with runs and without sample size control .............................. 60

Figure 3.5 Self-weighting selection with sample size control ....................................................... 61

Figure 3.6 Self-weighting selection with runs and with sample size control ................................... 62

Figure 3.7 Manual household selection in the field .................................................................... 63

Figure 3.8 Part of an Excel template for stratified sampling ........................................................ 68

Figure 3.9 Part of an example for a province crossed urban-rural stratified PPS sampling ............... 69

Figure 3.10 Part of an example sample file from a stratified PPS sampling ..................................... 70

1

1 DEMOGRAPHIC AND HEALTH SURVEYS SAMPLING POLICY

1.1 General principles

Scientific sample surveys are cost-efficient and reliable ways to collect population-level

information such as social, demographic and health data. The MEASURE DHS project is a worldwide

project implemented across various countries and at multiple points in time within a country. In order

to achieve comparability, consistency and the best quality in survey results, sampling activities in

the Demographic and Health Surveys (DHS) should be guided by a number of general principles. This

manual presents general guidelines on sampling for DHS surveys, although modifications may be

required for country-specific situations. The key principles of DHS sampling include:

• Use of an existing sampling frame

• Full coverage of the target population

• Probability sampling

• Using a suitable sample size

• Using the most simple design possible

• Conducting a household listing and pre-selection of households

• Providing good sample documentation

• Maintaining confidentiality of individual’s information

• Implementing the sample exactly as designed

1.1.1 Existing sampling frame

A probability sample can only be drawn from an existing sampling frame which is a complete

list of statistical units covering the target population. Since the construction of a new sampling frame

is likely to be too expensive, DHS surveys should use an adequate pre-existing sampling frame which

is officially recognized. This is possible for most of the countries where there has been a population

census in recent years. Census frames are generally the best available sampling frame in terms of

coverage, cartographic materials and organization. However, an evaluation of the quality and the

accessibility of the frame should be considered during the development of the survey design, and a

detailed study of the sampling frame is necessary before drawing the sample. In the absence of a

census frame, a DHS survey can use an alternative sampling frame, such as a complete list of villages

or communities in the country with all necessary identification information including a measure of

population size (e.g. number of households), or a master sample which is large enough to support the

DHS design.

1.1.2 Full coverage

A DHS survey should cover 100 percent of the target population in the country. The target

population for the DHS survey is all women age 15-49 and children under five years of age living in

residential households. Most surveys also include all men age 15-59

1

. The target population may vary

from country to country or from survey to survey, but the general sampling principles are the same.

In some cases, exclusion of some areas may be necessary because of extreme inaccessibility, violence

or instability, but these issues need to be considered at the very beginning of the survey, before the

sample is drawn.

1

The age range varies from survey to survey and may be 15-49, 15-54, 15-59 or 15-64.

2

1.1.3 Probability sampling

A scientific probability sampling methodology must be used in DHS surveys. A probability

sample is defined as one in which the units are selected randomly with known and nonzero

probabilities. This is the only way to obtain unbiased estimation and to be able to evaluate the

sampling errors. The term probability sampling excludes purposive sampling, quota sampling, and

other uncontrolled non-probability methods because they cannot provide evaluation of precision

and/or confidence of survey findings.

1.1.4 Suitable sample size

Sample size is a key parameter for DHS surveys because it is directly related to survey

budget, data quality and survey precision. Theoretically, the larger the sample size, the better the

survey precision, but this is not always true in practice. Survey budget is not the only important factor

in determining the sample size. Desired precision, the number of domains, capability of the

implementing organization, data quality concerns and cost effectiveness are essential constraints in

determining the total sample size. Thus a suitable sample size is also a key parameter to guarantee

data quality.

1.1.5 Simple design

In large-scale surveys, non-sampling errors (coverage errors, errors committed in survey

implementation and data processing, etc.) are usually the most important sources of error and are

expensive to control and difficult to evaluate quantitatively. It is therefore important to minimize them

in survey implementation. In order to facilitate accurate implementation of the survey, the sampling

design for DHS should be as simple and straightforward as possible. Macro’s experience from 25 years

of DHS surveys shows that a two-stage household-based sample design is relatively easy to

implement and that quality can be maintained.

1.1.6 Household listing and pre-selection of households

The DHS standard procedure recommends that households be pre-selected in the central office

prior to the start of fieldwork rather than by teams in the field who may have pressures to bias the

selection. The interviewers are asked to interview only the pre-selected households. In order to

prevent bias, no changes or replacements are allowed in the field. To perform pre-selection of

households, a complete list of all residential households in each of the selected sample clusters is

necessary. This list is usually obtained from a household listing operation conducted before the main

survey.

In some surveys, the household listing operation may be combined with the main survey to

form a single field operation, and households can be selected in the field from a complete listing.

Combining the household listing and survey data collection in one field operation is less expensive;

however, it provides incentive to leave households off the household list to reduce workload, thus

reducing the representativeness of the survey results. Close supervision is needed during the field

work to prevent this problem. Separate listing and data collection operations are thus required for this

reason. Interviewers selecting households in the field without a complete listing is not acceptable for

DHS surveys.

1.1.7 Good sample documentation

DHS surveys are usually year-long projects conducted by different people specialized in

different aspects of survey implementation, so good sample documentation is necessary to guarantee

the exact implementation of the project. The sample documentation should include a sample design

3

document and the list of primary sampling units. The sample design document should explain in detail

the methodology, the sampling procedure, the sample size, the sample allocation, the survey domains

and the stratification. This should also form the basis for an appendix to the DHS final report

describing the sample design. The sample list should include all identification information for all of the

selected sample points, along with their probability of selection.

1.1.8 Confidentiality

Confidentiality is a major concern in DHS, especially when human bio-markers are collected

such as blood samples for HIV testing. The DHS surveys are anonymous surveys which do not allow

any potential identification of any single household or individual in the data file. Confidentiality is also

a key factor affecting the response rate to sensitive questions regarding sexual activity and partners.

In particular, in surveys that include HIV testing DHS policy requires that PSU and household

codes are scrambled in the final data to further anonymize the data and the original sample list is

destroyed.

1.1.9 Exactness of survey implementation

Exactness of sample implementation is the last element in achieving good sampling precision.

No matter how carefully a survey is designed and how complete the materials for conducting sampling

activities are, if the implementation of the sampling activities by sampling staff (office staff

responsible for selecting sample units, field workers responsible for the mapping and household listing

and interviewers responsible for data collection) is not preformed exactly as designed, serious bias

and misleading results may occur.

In the sections that follow, DHS policies related to sample design and implementation are

described.

1.2 Survey objectives and target population

The main objective of DHS surveys is to collect up-to-date information on basic demographic

and health indicators, including housing characteristics, fertility, childhood mortality, contraceptive

knowledge and use, maternal and child health, nutritional status of mothers and children, knowledge,

attitudes and behavior toward HIV/AIDS and other sexually transmitted infections (STI), women’s

status. The target population for DHS is defined as all women of reproductive age (15-49 years old)

and their young children under five years of age living in ordinary residential households. However, in

some countries, the coverage may be restricted to ever-married women.

The main indicator topics include:

• Total fertility and age specific fertility rates

• Age at first sex, first birth, and first marriage

• Knowledge and use of contraception

• Unmet need for family planning

• Birth spacing

• Antenatal care

• Place of delivery

• Assistance from skilled personnel during delivery

• Knowledge of HIV/AIDS and other STIs

• Higher-risk sexual behavior

• Condom use

• Childhood vaccination coverage

4

• Treatment of diarrhea, fever, and cough

• Infant and under-five mortality rates

• Nutritional status

Since the target population can be easily found in residential households, DHS is a household-

based survey.

1.3 Survey domain

In DHS surveys, an important objective is to compare the survey results for different

characteristics such as urban and rural residence, different administrative or geographic regions, or

different educational levels of respondents. A survey domain or study domain is a sub-population for

which separate estimation of the main indicators is required. There are two kinds of survey domains:

design domains and analysis domains. A design domain consists of a sub-population which can be

identified in the sampling frame and therefore can be handled independently in the sample size and

sampling procedures, usually consisting of geographic areas or administrative units. For example,

urban and rural differences are very frequently requested; therefore, urban and rural areas are usually

separate design domains for Demographic and Health Surveys. An analysis domain is a sub-population

which cannot be identified in the sampling frame, such as domains specified by individual

characteristics. These may include women with secondary or higher education, pregnant women,

children 12-23 months, and children having diarrhea in the two weeks preceding the survey.

In order for survey estimates to be reliable at the domain level, it is necessary to ensure that

the number of cases in each survey domain is sufficient, especially when desired levels of precision

are required for particular domains. For a design domain, adequate sample size is achieved by

allocating the target population at the survey design stage into the requested design domains, and

then calculating the sample size for the specific design domains by taking the precision required into

account. On the other hand, for an analysis domain, it is difficult to guarantee a specified precision

because it is difficult to control the sample size at the design stage. However, if prior estimates of the

average number of target individuals per household are available, then it is possible to control the

precision for an analysis domain. For example, if survey estimates are required for the nutritional

status of children under age 5 is required and estimates of the number of children under age 5 per

household are available, it is then possible to calculate a sample size to give a certain level of

precision.

DHS reports also produce some indicators for second level domains such as vaccination

coverage of children age 12-23 months within a region, where region is the first level domain, and

children 12-23 months is the second level domain. Caution must be paid to the precision required for

a second level domain because the second level domain usually includes a very small sub-population.

If domain-level estimates are required, it is better to avoid a large number of domains

because otherwise a very large sample size will be needed. The number of domains and the desired

level of precision for each must be taken into account in the budget calculation and assessment of the

implementation capabilities of the implementing organization. The total sample size needed is the sum

of sample sizes needed in all exclusive (first level) domains.

1.4 Sampling frame

A sampling frame is a complete list of all sampling units that entirely covers the target

population. The existence of a sampling frame allows a probability selection of sampling units. For a

multi-stage survey, a sampling frame should exist for each stage of selection. The sampling unit for

the first stage of selection is called the Primary Sampling Unit (PSU); the sampling unit for the second

stage of selection is called the Secondary Sampling Unit (SSU), and so on. In most cases, DHS

5

surveys are two-stage surveys. Note that each stage of sample selection will involve sampling errors,

so it is better to avoid more than two stages if additional stages of selection are not necessary.

The availability of a suitable sampling frame is a major determinant of the feasibility of

conducting a DHS survey. This issue should be addressed in the earliest stages of planning for a

survey. A sampling frame for a DHS survey could be an existing sampling frame, an existing master

sample, or a sample of a previously executed survey of sufficiently large sample size, which allows for

the selection of subsamples of desired size for the DHS survey.

1.4.1 Conventional sampling frame

The best frame is the list of Enumeration Areas (EAs) from a recently completed population

census. An EA is usually a geographic area which groups a number of households together for

convenient counting purposes for the census. A complete list of EAs which covers the survey area

entirely is the most ideal frame for DHS surveys.

In most cases, a list of EAs from a recent census is available. This list should be thoroughly

evaluated before it is used. The sampling frame used for DHS should be as up-to-date as possible. It

should cover the whole survey area, without omission or overlap. Basic cartographic materials should

exist for each area unit or at least for groups of units with clearly defined boundaries. Each area unit

should have a unique identification code or a series of codes that, when combined, can serve as a

unique identification code. Each unit should have at least one measure of size estimate (population

and/or number of households). If other characteristics of the area units (e.g., socioeconomic level)

exist, they should be evaluated and retained as they may be used for stratification.

A pre-existing master sample (which is a random sample from the census frame) can be

accepted only where there is confidence in the master sample design, including detailed sampling

design parameters such as sampling method, stratification, and inclusion probability for the selected

primary sampling units. The task for the DHS survey is then to design a sub-sampling procedure,

which produces a sample in line with DHS requirements. This will not always be possible. However,

the larger the master sample is in relation to the desired DHS sub-sample, the more flexibility there

will be for developing a sub-sampling design. A key question with a pre-existing sample is whether the

listing of dwellings/households is still current or whether it needs to be updated. If updating is

required, use of a pre-existing sample may not be economical. The potential advantages of using a

pre-existing sample are: 1) economy, and 2) increased analytic power through comparative analysis

of two or more surveys. The disadvantages are: 1) the problem of adapting the sample to DHS

requirements, and 2) the problem of repeated interviews with the same household or person in

different surveys, resulting in respondent fatigue or contamination. One way to avoid this last problem

is to keep just the primary sampling units from the pre-existing sample and reselect the households

for the DHS survey.

1.4.2 Alternative sampling frames

When neither a census frame nor a master sample is available then alternative frames should

be considered. Examples of such frames are:

• A list of electoral zones with estimated number of qualified voters for each zone

• A gridded high resolution satellite map with estimated number of structures for each grid

• A list of administrative units such as villages with estimated population for each unit

A main concern when using alternative frames are coverage problems, that is, does the frame

completely cover the target population? Usually checking the quality of an alternative frame is more

difficult because of a lack of information either from the frame itself or from administrative sources.

6

Another problem is the size of the primary sampling unit. Since the alternative frame is not specifically

created for a population census or household based survey, the size of the PSUs of such frames may

be too large or too small for a DHS survey. A third problem is identifying the boundaries of the

sampling units due to the lack of cartographic materials.

In the first two examples of alternative sampling frames, the standard DHS two-stage

sampling procedure can be applied by treating the electoral zones or the grids of satellite map as the

PSUs. In the third case, when a list of administrative units larger than villages (e.g. sub-districts,

wards or communes) is available, for example, a complete list of all communes in a country may be

easier to get than a complete list of villages, then it is necessary to use a selection procedure that

includes more than two stages. In the first stage, select a number of communes; in each of the

selected communes, construct a complete list of all villages residing in the commune; select one

village per commune as a DHS cluster, then proceed with the subsequent household listing and

selection as in a standard DHS. This procedure works best when the number of communes is large and

the commune size is small. A list of administrative units that are small in number but large in size is

not suitable for a DHS sampling frame because this situation will result in large sampling errors, as

explained later in Section 1.9.

1.4.3 Evaluation of the sampling frame

No matter what kind of sampling frame will be used, it is always necessary to check the

quality of the frame before selecting the sample. Following are several things that need to be checked

when using a conventional sampling frame:

• Coverage

• Distribution

• Identification and coding

• Measure of size

• Consistency

There are several easy but useful ways to check the quality of a sampling frame. For example,

for a census frame, check the total population of the sampling frame and the population distribution

among urban and rural areas and among different regions/administrative units obtained from the

frame with that from the census report. Any important differences may indicate that there may be

coverage problems. If the frame provides information on population and households for each EA, then

the average number of household members can be calculated, and a check for extreme values can

help to find incorrect measures of size of the PSUs. If information on population by sex is available for

each EA, then a sex ratio can be calculated for each EA, and a check for extreme values can help to

identify non-residential EAs. If the EAs are associated with an identification (ID) code, then check the

ID codes to identify miscoded or misplaced EAs. A sampling frame with full coverage and of good

quality is the first element for a DHS survey; therefore, efforts should be made to guarantee a good

start for the project.

For a nationally representative survey, geographic coverage of the survey should include the

entire national territory unless there are strong reasons for excluding certain areas. If areas must be

excluded, they should constitute a coherent domain. A survey from which a number of scattered zones

have been excluded is difficult to interpret and to use.

1.5 Stratification

Stratification is the process by which the survey population is divided into subgroups or strata

that are as homogeneous as possible using certain criteria. Explicit stratification is the actual sorting

and separating of the units into specified strata. Within each stratum, the sample is designed and

7

selected independently. It is also possible to systematically sample units from an ordered list (with a

fixed sampling interval between selected units) to achieve the effect of stratification. For example, in

DHS survey, it is not unusual for the PSUs within the explicit strata to be sorted geographically. This is

called implicit stratification.

The principal objective of stratification is to reduce sampling errors. In a stratified sample, the

sampling errors depend on the population variance existing within the strata but not between the

strata. For this reason, it pays to create strata with low internal variability (or high homogeneity).

Another major reason for stratification is that, where marked differences exist between subgroups of

the population (e.g., urban vs. rural areas), stratification allows for a flexible sample design that can

be different for each subgroup.

Stratification should be introduced only at the first stage of sampling. At the

dwelling/household selection stage, systematic sampling is used for convenience; however, no

attempt should be made to reorder the dwelling/household list before selection in the hope of

increasing the implicit stratification effect. Such efforts generally have a negligible effect.

Stratification can be single-level or multi-level. In single-level stratification, the population is

divided into strata according to certain criteria. In multi-level stratification, the population is divided

into first-level strata according to certain criteria, and then the first-level strata are subdivided into

second-level strata, and so on. A typical two-level stratification involves first stratifying the population

by region at the first level and then by urban-rural within each region. A DHS survey usually employs

multi-level stratification.

Strata should not be confused with survey domains. A survey domain is a population subgroup

for which separate survey estimates are desired (e.g., urban areas/rural areas). A stratum is a

subgroup of homogeneous units (e.g., subdivisions of an administrative region) in which the sample

may be designed differently and is selected separately. Survey domains and strata can be the same

but they need not be. For example, survey domains could be the first-level stratum in a multi-level

stratification. On the other hand, a survey domain could consist of one or several lower-level strata.

DHS surveys typically use explicit stratification by separating urban and rural residence within

each region. Where data are available, explicit stratification could also be done on the basis of socio-

economic zones or more directly relevant characteristics such as the level of female literacy or the

presence of health facilities in the areas. These kinds of information could be obtained from

administrative sources. Within each explicit stratum, the units can then be ordered according to

location, thus providing further implicit geographic stratification.

1.6 Sample size

1.6.1 Sample size and sampling errors

The estimates from a sample survey are affected by two types of errors: sampling errors and

non-sampling errors. Sampling errors are the representative errors due to sampling of a small number

of eligible units from the target population instead of including every eligible unit in the survey.

Sampling errors are related to the sample size and the variability among the sampling units. Sampling

errors can be statistically evaluated after the survey. Non-sampling errors result from problems during

data collection and data processing, such as failure to locate and interview the correct household,

misunderstanding of the questions on the part of either the interviewer or the respondent, and data

entry errors. Non-sampling errors are related to the capacity of the implementing organization, and

experience shows that (1) non-sampling errors are always the most important source of error in a

survey, and (2) it is difficult to evaluate the magnitude of non-sampling errors once a survey is

complete. Theoretically, with the same survey methodology and under the same survey conditions,

8

the larger the sample size, the better the survey precision. However, this relationship does not always

hold true in practice, because non-sampling errors tend to increase with survey scale and sample size.

The challenge in deciding on the sample size for a survey is to balance the demands of analysis and

precision with the capacity of the implementing organization and the constraints of funding.

A common measure of precision for estimating an indicator is its relative standard error (RSE)

which is defined as its standard error (SE) divided by the estimated value of the indicator. The

standard error of an estimator is the representative error due to sampling. The relative standard error

describes the amount of sampling error relative to the indicator level and is independent of the scale

of the indicator to be estimated; therefore, a unique RSE can be applied to a reference indicator for all

domains. If a unique RSE is desired for all domains, the domain sample size depends on the variability

and the size of the domain. The total sample size is the sum of the sample sizes over all domains for

which desired precision are required. The following are some concepts related to sample size

calculation.

1. The standard error of an estimator when estimating a proportion with a simple random

sampling without replacement

2

is given by:

−×

−

×= )1(

1

f-1

PP

N

N

n

SQRTSE

where n is the sample size (number of completed interviews),

P is the proportion,

N is the target population size, and

f=n/N is the sampling fraction.

When N is large and n is relatively small, the above quantity can be approximated by:

−

≅

n

PP

SQRTSE

)1(

Therefore the RSE of the estimator is given by:

−

=

−

≅

n

P

SQRTP

n

PP

SQRTPRSE

1/1

/

)1(

)(

2. For a required precision with a relative standard error

α

, the net sample size (number of

completed interviews) needed for a simple random sampling is given by:

2

)1/1(

α

−

=

P

n

3. Since a simple random sampling is not feasible for a DHS, the sample size for a complex

survey with clustering such as the DHS can be calculated by inflating the above calculated

sample size by using a design effect (Deft). Deft is a measure of efficiency of cluster sampling

compared to a direct simple random sampling of individuals, defined as the ratio between the

standard error using the given sample design and the standard error that would result if a

simple random sample had been used. A Deft value of 1.0 indicates that the sample design is

2

A simple random sample would be a random selection of individuals or households directly from the target

population. This is not feasible for DHS surveys because a list of all eligible individuals or households is not

available.

9

as efficient as a simple random sample, while a value greater than 1.0 indicates the increase

in the sampling error due to the use of a more complex and less statistically efficient design.

The net sample size needed for a cluster sampling with same relative standard error is given

by:

2

2

)1/1(

Deft

α

−

×=

P

n

4. The formula for calculating the final sample size in terms of the number of households while

taking non-response into account (the formula used in the templates for sample size

calculation as shown in Table 1.1) is given by:

)(

)1/1(

Deft

2

2

dRR

P

n

hi

××

−

×=

α

where n is the sample size in households;

Deft is the design effect (a default value of 1.5 is used for Deft if not specified);

P is the estimated proportion;

α

is the desired relative standard error;

R

i

is the individual response rate;

R

h

is the household gross response rate; and

d is the number of eligible individuals per household.

The household gross response rate is the number of households interviewed over the number

selected. DHS reports typically report the net household response rate which is the number of

households interviewed over the number valid households found in the field (i.e. excluding

vacant and destroyed dwellings.)

5. If the target population is small (such as in a sub-national survey), a finite population

correction of the above calculated sample size should be applied. The final sample size n is

calculated by

Nn

n

n

/1

0

0

+

=

where n

0

is the initial sample size calculated in point number 4, and N is the target population

size.

6. The relationship between the RSE and the sample size shows that, if one reduces a desired

RSE to half, then the sample size needed will increase 4 times. For example, the sample size

for a RSE of 5% is 4 times larger than the sample size for a RSE of 10% (see Tables 1.1 and

1.2 in the next section). This means that it is very expensive to reduce the RSE by increasing

the sample size. Therefore, when designing the sample size, the efficiency of the design must

be considered, that is, the balance between the gain in precision and the increase in sample

size (or survey cost).

7. The width of the confidence interval is determined by the RSE. With a confidence level of 95%,

2*P*RSE is the half-length of the confidence interval for P. For example, for RSE=0.10 and

P=0.20, the half-length of the confidence interval is 0.04, which means the confidence interval

for P is (0.16, 0.24). (DHS reports +/-2*SE instead of +/-1.96*SE as 95% confidence interval

for conservative purposes).

10

1.6.2 Sample size determination

The total sample size for a DHS survey with a number of survey domains (design domain) is

the sum of the sample sizes over all domains. An appropriate sample size for a survey domain is the

minimum number of persons (e.g., women age 15-49, currently married women 15-49, children under

age five) that achieves the desired survey precision for core indicators at the domain level. If funding

is tight and fixed, the sample size is the maximum number of persons that the funding can cover.

Precision at the national level is usually not a problem. In almost all cases, sample size is decided to

guarantee precision at domain level with appropriate allocation of the sample. So apart from survey

costs, the total sample size depends on the desired precision at domain level and the number of

domains. If a reasonable precision is required at domain level, experience from the MEASURE DHS

program shows that a minimum number of 800 completed interviews with women is necessary for

some of the woman-based indicators for high fertility countries (e.g. total fertility rate, contraceptive

prevalence rate, childhood mortality rates); for low fertility countries, the minimum domain sample

size can reach 1,000 completed interviews or more. Table 1.1 below illustrates the calculation of

sample size for a domain according to different levels of desired RSE for estimating the indicator “the

proportion of currently married women who are current users of a modern contraceptive method”.

Table 1.1 Sample size determination for estimating current use of

a modern contraceptive method among currently married women

Estimated proportion p 0.20 Total target population

Estimated design effect (Deft) 1.40 # of target individuals/HH 1.05

Individual response rate 0.96 HH gross response rate 0.92

Desired Net Sample Sample size Expected 95% confidence limits

RSE size individual Household SE Lower Upper

0.20 196 212 0.040 0.120 0.280

0.19 217 234 0.038 0.124 0.276

0.18 242 261 0.036 0.128 0.272

0.17 271 293 0.034 0.132 0.268

0.16 306 330 0.032 0.136 0.264

0.15 348 376 0.030 0.140 0.260

0.14 400 432 0.028 0.144 0.256

0.13 464 501 0.026 0.148 0.252

0.12 544 587 0.024 0.152 0.248

0.11 648 699 0.022 0.156 0.244

0.10 784 846 0.020 0.160 0.240

0.05 3136 3382 0.010 0.180 0.220

Note: The confidence limits are calculated as P±2*SE.

11

Assuming the domain size is large enough such that the finite population correction is

negligible, Table 1.1 gives the required gross sample size in terms of number of households with

estimated parameters from a DHS survey. The target population is currently married women age 15-

49; the estimated parameters are:

• the proportion of currently married women who are current users of any modern

contraceptive method,

• the design effect (Deft),

• the number of target individuals (number of currently married women 15-49) per

household,

• the individual and the household response rates.

For example, with an estimated prevalence of 20%, if we require a RSE of 10%, we should

select 846 households in this particular domain. With a gross household response rate (the number of

households completed over the total number selected) of 92% and an individual response rate of

96%, we expect to obtain 784 completed interviews of currently married women age 15-49.

The estimated quantities at the top of the table used as input to the calculation can usually be

obtained from previous surveys or from administrative records. The total sample size for a survey with

several domains is the sum of the sample sizes obtained in the above table for each domain. If the

same precision required and the same indicator level apply to all domains, then the total sample size

is the sample size calculated for one domain multiplied by the number of domains. With this example,

the total sample size for a survey having six domains with approximately the same level of modern

contraceptive use among currently married women and the same precision request for each domain

would be 5076 households. The “Sample size determination” template located in the Appendix can be

used to determine required sample sizes.

Table 1.2 Sample size determination for estimating the prevalence of

full vaccination coverage among children aged 12-23 months

Estimated proportion p 0.29 Total target population

Estimated design effect (Deft) 1.22 # of target individuals/HH 0.11

Individual response rate 0.96 HH gross response rate 0.92

Desired Net Sample Sample size Expected 95% confidence limits

RSE size individual household SE Lower Upper

0.20 91 937 0.058 0.174 0.406

0.19 101 1040 0.055 0.180 0.400

0.18 112 1153 0.052 0.185 0.395

0.17 126 1297 0.049 0.191 0.389

0.16 142 1462 0.046 0.197 0.383

0.15 162 1668 0.043 0.203 0.377

0.14 186 1915 0.041 0.209 0.371

0.13 216 2224 0.038 0.215 0.365

0.12 253 2605 0.035 0.220 0.360

0.11 301 3099 0.032 0.226 0.354

0.10 364 3747 0.029 0.232 0.348

0.05 1458 15008 0.014 0.261 0.319

Note: The default value of Deft is set to be 1.5. Specify if different.

The confidence limits are calculated as P±2*SE.

If response rate is not provided, the sample size calculated is net sample size.

12

Table 1.2 shows a similar example for the indicator “proportion of children aged 12-23 months

who are fully immunized”. In this case, the target population is children aged 12-23 months. The

estimated number of target individuals per household is much smaller than the number of currently

married women per household given in Table 1.1. So for the same sample size calculated in Table 1.1,

we can only get a RSE of above 20% at domain level. With a RSE of 10%, we need to select 3746

households in this particular domain which seems unrealistic if we have several domains for the

survey.

This example shows that for a multi-indicator survey, the sample size required can be very

different from indicator to indicator. So the choice of the reference indicator upon which the sample

size is calculated is an important issue. The reference indicator which is used for sample size

determination should have demographic importance, moderate value and moderate population

coverage, i.e. apply to a sizable proportion of the population. With the same sample size calculated in

Table 1.1 for a survey having six domains, the RSE for the whole sample for estimating full

immunization among children 12-23 months is between 8% and 9%.

The domain sample sizes often need to be balanced between domains due to budget

constraints. In practice it is often the case that the total sample size is fixed according to funding

available and implementation capacity, and then the sample is allocated to each domain and to each

stratum within the domain. In the case of very tight budget constraints, we may equally allocate the

total sample to the domains. In some cases, we may want to oversample a specific domain to conduct

some in-depth analysis for a certain rare phenomenon. The method (and the tables) presented in the

following section may be used to allocate the sample at the domain level because the domains are

usually first-level strata. Regardless of the method used for allocation, the calculation of domain

sample size can give us an idea about the precision we may achieve in each domain with a given

sample size.

1.7 Sample allocation

In cases where the total sample size or domain sample size has been fixed, we need to

appropriately allocate the sample to different domains (or different strata within a domain). This

allocation is aimed at strengthening the sampling efficiency at the national level or domain level and

reducing sampling errors. Assuming a constant cost across domains/strata, the optimum allocation of

the sample depends on the size of the domain/stratum

and the variability of the indicator to be

estimated

xh

S

xhhh

SNn ∝

For a given total sample size n the optimum allocation for variable x is given by:

=

H

h

xhh

xhh

h

SN

SN

nn

1

=

The optimum allocation is only optimal for the indicator on which the allocation is based; that

allocation may not be appropriate for other indicators. For a multipurpose survey, if the

domains/strata are not too different in size, a safe allocation that is good for all indicators is a

proportional allocation, with sample size proportional to the domain/stratum size.

N

N

n

N

N

nn

h

H

h

h

h

h

=

=1

=

13

This allocation introduces a constant sampling fraction across domain/strata with:

N

n

N

n

f

h

h

h

=

=

Because DHS surveys are multipurpose surveys, a proportional allocation of sample is

recommended if the domains/strata are not too different in size. However, if the domains/strata sizes

are very different, the smaller domains/strata may receive a very small sample size.

If a desired precision is required at domain/stratum level, by assuming equal relative

variations across strata, a power allocation (Bankier, 1988) with an appropriate power value

α

(

10 ≤≤

α

) may be used to guarantee sufficient sample size in small domains/strata.

=

H

h

h

h

h

M

M

nn

1

=

α

α

A power allocation is an allocation proportional to the power of a size measure M. A power

value of 1 gives proportional allocation; a power value of 0 gives equal size allocation; a power value

between 0 and 1 gives an allocation between proportional allocation and equal size allocation.

Proportional allocation is good for national level indicators, but may not meet the precision request at

domain level; while an equal size allocation is good for comparison across domains, but may affect the

precision at national level. A power allocation with power values between 0 and 1 is a tradeoff

between the national level precision and the domain level precision. Since the sample size is usually

large at the national level, the national level precision is not a concern.

In Table 1.3 below, we give an example of a proportional sample allocation of 15,000

individuals to 11 domains and to their urban-rural areas. The minimum domain sample size is 384 for

domain 2, which is too small for estimating the total fertility rate (TFR) and childhood mortality rates.

The largest sample size is for domain 11 which may be unnecessarily large. The actual total sample

size given in the total row may be slightly different from the desired sample size because of rounding.

14

Table 1.3 Sample allocation: Proportional allocation

Serial

Num

Total sample size => 15000 Power value domain=> Power value urban=>

Domain/Stratum

Name/ID

Domain/

stratum

size

Proportion

urban

Sample Allocation Specific Allocation

Urban Rural Domain Urban Rural

1 Domain 1 0.072 0.352 382 701 1083

2 Domain 2 0.026 0.317 122 262 384

3 Domain 3 0.070 0.568 597 454 1051

4 Domain 4 0.142 0.275 586 1544 2130

5 Domain 5 0.060 0.323 292 611 903

6 Domain 6 0.046 0.135 92 593 685

7 Domain 7 0.048 0.194 141 586 727

8 Domain 8 0.094 0.251 354 1055 1409

9 Domain 9 0.164 0.288 709 1749 2458

10 Domain 10 0.091 0.191 262 1104 1366

11 Domain 11 0.187 1.000 2803 0 2803

Total 1.000 0.423 6339 8660 14999

If we impose a condition such that the sample size should not be smaller than 1000 in each

domain, after trying various power values, we find that a power value of 0.25 is appropriate, as shown

in Table 1.4. In this case, we would have a minimum sample size of 1,022 for domain 2. Since domain

11 has only urban areas, the power allocation among the domains brought down the urban percentage

in the sample. In order for urban areas to be properly represented, over sampling is applied in the

urban areas of the other domains. With a power value of 0.65, the urban proportion in the sample is

close to the proportion of the target population.

Table 1.4 Sample allocation: Power allocation

Serial

Num

Total sample size => 15000 Power value domain=> 0.25 Power value urban=> 0.65

Domain/Stratum

Name/ID

Domain/

stratum

size

Proportion

urban

Sample Allocation Specific Allocation

Urban Rural Domain Urban Rural

1 Domain 1 0.072 0.352 533 791 1324

2 Domain 2 0.026 0.317 386 636 1022

3 Domain 3 0.070 0.568 716 599 1315

4 Domain 4 0.142 0.275 546 1023 1569

5 Domain 5 0.060 0.323 484 782 1266

6 Domain 6 0.046 0.135 271 910 1181

7 Domain 7 0.048 0.194 341 858 1199

8 Domain 8 0.094 0.251 466 949 1415

9 Domain 9 0.164 0.288 581 1045 1626

10 Domain 10 0.091 0.191 395 1009 1404

11 Domain 11 0.187 1.000 1680 0 1680

Total 1.000 0.423 6399 8602 15001

In Table 1.4, the small domains are oversampled compared with a proportional allocation.

Oversampling some small domains is frequently practiced if domain level precision is required.

15

However, oversampling a small domain too much will harm the precision at national level. To prevent

this, it is recommended to regroup the small domains to form domains of moderate size, especially

when there is a very unequal population distribution among geographic domains, however, this is

sometimes not possible due to political considerations.

The above discussion also applies to sample size allocation to strata within a domain where

the domain sample size is fixed. A proportional allocation with sample size proportional to stratum size

is good for all indicators and provides the best precision for the domain as a whole.

1.8 Two-stage cluster sampling procedure

The MEASURE DHS program utilizes a convenient and practical sample selection procedure for

household based surveys developed on the basis of experience from past surveys—a two-stage cluster

sampling procedure. A cluster is a group of adjacent households which serves as the PSU for field

work efficiency. Interviewing a certain number of households in the same cluster can reduce greatly

the amount of travel and time needed during data collection. In most cases, a cluster is an EA with a

measure of size equal to the number of households or the population in the EA, provided by the

population census.

At the first stage, a stratified sample of EAs is selected with probability proportional to size

(PPS): in each stratum, a sample of a predetermined number of EAs is selected independently with

probability proportional to the EA’s measure of size. In the selected EAs, a listing procedure is

performed such that all dwellings/households are listed. This procedure is important for correcting

errors existing in the sampling frame, and it provides a sampling frame for household selection.

At the second stage, after a complete household listing is conducted in each of the selected

EAs, a fixed (or variable) number of households is selected by equal probability systematic sampling in

the selected EAs. In each selected household, a household questionnaire is completed to identify

women age 15-49, men age 15-59 (15-54 or 15-49 in some surveys) and children under age five.

Every eligible woman will be interviewed with an individual questionnaire, and every eligible man will

be interviewed with an individual men’s questionnaire in those households selected for the men’s

interview.

The advantages of this two-stage cluster sampling procedure can be summarized as follows:

1) It guarantees a representative sample of the target population when a list of all target

individuals is not available which prohibits a direct sampling of target individuals;

2) A household listing procedure after the selection of the first stage and before the main

survey provides a sampling frame for household selection in the central office;

3) The use of residential households as the second-stage sampling unit guarantees the best

coverage of the target population; and

4) It reduces unnecessary sampling errors by avoiding more than two stages of selection

(which usually uses a large PSU in the first stage of selection).

See more details in Sections 1.10 and 1.11 on household listing and selection, Chapter 2 on

household listing, and Sections 3.2 and 3.3 of Chapter 3 on systematic sampling and sampling with

probability proportional to size (PPS).

16

1.9 Sample “take” per cluster

Once the total sample size is determined and allocated to different survey domains/strata, it

should be decided how many individuals (sample take) should be interviewed per sample cluster and

then convert the domain/stratum sample size to number of clusters. Since the survey cost can be very

different across the survey domains/strata, the sample take can have a big influence on the total

survey budget. With a fixed sample size, a small sample take is good for survey precision because of

the reduction of the design effect, but is expensive because more clusters are needed. The number of

clusters affects the survey budget more than the overall sample size due to the travel between

clusters during data collection, which represents an important part of field costs in rural areas. The

MEASURE DHS program proposes a sample “take” of about 25-30 women per rural cluster. In urban

areas, the cost advantage of a large “take” is generally smaller, and MEASURE DHS recommends a

“take” of about 20-25 women per urban cluster. Since in most DHS surveys, the number of eligible

women age 15-49 is very close to one per household, the sample take of individuals is equivalent to

the sample take of households; therefore, in the following sections we refer to the sample take (or

cluster take) as the number of sample households per cluster.

1.9.1 Optimum sample take

The optimum number of households to be selected per cluster depends on the variable under

consideration, the intracluster correlation

ρ

, and the survey cost ratio

21

/ cc , where

1

c represents

the cost per cluster including mainly the cost associated with travelling between the clusters for

survey implementation (household listing and interview); while

2

c represents the cost per individual

interview (the interviewing cost) and other costs of doing fieldwork within a cluster. A larger sample

take per cluster and fewer clusters reduces survey field costs if the cost ratio is high, but it could also

reduce the survey precision if the intracluster correlation is strong.

The MEASURE DHS Program has accumulated information on sampling errors for selected

variables for many surveys throughout the world. Using this information, Aliaga and Ren (2006)

conducted a research study to determine the optimum sample take per cluster. The results of the

study have informed current practice in DHS surveys. If the average cluster size is around 250

households, a sample take of 20-30 households per cluster is within the acceptable range in most

surveys. The research also supports the practice of setting a larger sample take in rural clusters than

in urban clusters. Usually, the cost ratio in urban areas is smaller than that in rural areas. This would

lead to a smaller sample take in an urban cluster than in a rural cluster. In sum, this research

indicates that for the most important survey indicators, a sample take between 20 to 25 households is

appropriate in urban clusters and a sample take between 25 to 30 households is appropriate in rural

clusters.

Based on values of

21

/ cc and

ρ

obtained from eight surveys, Table 1.5 below shows optimal

sample takes for the indicator “proportion of currently married women 15-49 currently using any

contraceptive method.” This indicator has a moderate intracluster correlation relative to other

important survey indicators.

17

Table 1.5 Optimal sample take for currently married women 15-49 currently using any

contraceptive method based on intracluster correlation

ρ

and survey cost ratio

21

/ cc from

past surveys

Country

Survey

cost

ratio

21

/ cc

Intracluster

correlation

ρ

Optimal

sample

take

Country 1 10 0.025 20

Country 2 10 0.037 16

Country 3 12 0.067 13

Country 4 12 0.052 15

Country 5 15 0.084 13

Country 6 27 0.031 29

Country 7 48 0.058 28

Country 8 52 0.023 47

Average 23 0.047 23

1.9.2 Variable sample take for self-weighting

A fixed sample take per cluster is easy for survey management and implementation, but it

requires sampling weights that vary within a stratum. Different sampling weights result in larger

sampling errors compared with a similar sample of constant weight within a sampling stratum, i.e., a

self-weighting sample. A self-weighting sample consists of a sample of individuals in which each

individual has the same probability of being selected, and therefore a constant sampling weight is

used. In some cases a self-weighting sample is preferred for various reasons:

• it is equally representative for every individual of the target population;

• it reduces sampling errors.

Since the sample for DHS surveys is usually the result of a two-stage cluster sampling design,

it is necessary to coordinate the sample take for each of the selected clusters. In an overall self-

weighting sample, every individual in the target population has an equal probability of selection, which

results in a proportional allocation. However, proportional allocation is not feasible when sampling

domains are very different in size. Self-weighting at domain/stratum level, by contrast, is easy to

achieve.

Let n be the total number of clusters selected for a DHS survey, let

h

n be the number of

clusters allocated to the h

th

stratum; let

h

X be the total number of households in the stratum h, let

hk

x be the number of households in cluster k of stratum h, given by the sampling frame; then the

selection probability of cluster k in stratum h is given by:

h

hkh

hk

X

xn

=

π

Let

*

hk

x be the number of households listed in the cluster in the household listing operation, let

h

m be the number of households to be selected from the cluster for a fixed sample take, then the

overall selection probability of a household in the cluster is given by:

18

**

hk

h

h

hkh

hk

h

hkhk

x

m

X

xn

x

m

f ×=×=

π

If

hkhk

xx =

*

exactly for all k in stratum h, then it is easy to see that self-weighting is achieved

in stratum h by a constant sample take

h

m in all clusters since

h

hh

h

X

mn

f =

is a constant in stratum h.

In practice, it is not possible that

hkhk

xx =

*

for all h and k, especially when the last population

census is no longer new. Therefore there is a need for sample coordination in order to achieve self-

weighting. Let

h

f and

h

m

be the calculated sampling fraction and average sample take in stratum h

according to the sample allocation with

h

hh

h

n

Xf

m =

; the number of households needed to achieve self-

weighting in cluster k of stratum h is given by

hk

hk

h

hk

hk

h

hh

hk

x

x

m

x

x

n

Xf

m

**

×=×=

which is a function of the ratio of the number of households listed over the number of households

given in the sampling frame for every cluster: take more if more are listed or take fewer if fewer are

listed. The above formula also shows that the sampling fraction is not a necessary parameter for

sample take calculation. Using the designed average sample take is a more direct method because the

sampling fraction is an abstract number. This formula is used in the self-weighting household selection

templates presented in Chapter 3, Section 3.2. The relationship between the sample take and the

cluster selection probability is given by

hk

hkh

hk

xf

m

π

*

=

For practical considerations, the sample take calculated above needs to be adjusted if is it too

small or too large. Usually, we apply a cut-off to control the sample take within the range of a

minimum of 10 households and a maximum of 50 households per cluster. For the clusters where the

cut-off is applied, the sample is no longer self-weighting.

The advantages and disadvantages of a self-weighting sample can be summarized as:

Advantages:

1) Equally representative for every individual within a sampling stratum.

2) Reduced sampling errors.

Disadvantages:

1) Difficult for survey management (for example, to distribute the work-load) because of the

variant sample take by cluster.

2) Difficult to control the expected sample size because of possible cut-offs, especially when

the upper limit cut-offs are employed.

3) The self-weighting is not exact because of the rounding of the sample takes and this will

bring bias in the survey estimation.

19

4) Self-weighting at the national level will break down the specific sample allocation at the

domain/stratum level and bring the sample allocation back to a proportional allocation.

It is possible to overcome the second and the third disadvantages through a recursive

calculation of sample take by re-distributing the cut-offs to the rest of the clusters in the stratum or

control area, and by using a randomized sample take which allows non-integer numbers as sample

size. Excel templates for both the traditional procedure and revised procedure are available.

1.10 Household listing

The household listing operation is a fundamental operation in DHS surveys. After the EAs are

selected for the survey, a complete listing of dwelling units/households in the selected EAs is

conducted prior to the selection of households. The listing operation consists of visiting each of the

selected clusters, collecting geographic coordinates of the cluster, drawing a location map of the

cluster as well as a sketch map of the structures in the cluster, recording on listing forms a description

of every structure together with the names of the heads of the households in the structures and other

characteristics. Mapping and listing of households represents a significant field cost, but it is essential

to guarantee the exactness of sample implementation.

The listing operation is an important procedure for reducing non-sampling errors in the survey,

especially when the sampling frame is outdated. The listing operation provides a complete list of

occupied residential households in the EA. This information is necessary for an equal probability

random selection of households in the second stage. With the household listing prior to the main

survey, it is possible to pre-select the sample households in advance and the interviewers are asked to

interview only the pre-selected households without replacement of non-responding households. With

the sketch map and the household listing of the cluster produced in the household listing operation,

the sampled households can be easily relocated by interviewers later. The fieldwork procedure for DHS

surveys is designed to be replicable and therefore allows easy supervision; all these elements are

designed to prevent serious bias during data collection.

It is sometimes suggested that listing could be avoided by making segments so small that

they are equal to the required sample “take” per cluster. One could then use a “take-all” rule at the

last stage of sampling. Such small segments, however, will generally be difficult to delineate. In

planned urban areas, this difficulty may be reduced—one could adopt blocks, or even single buildings,

as segments—but urban units of this kind are likely to be homogeneous, containing similar

households, and therefore less than ideal as sampling clusters.

It is also not acceptable to attempt to avoid listing altogether by having interviewers create

clusters as they go along, or by selecting the sample households at fixed intervals during a random

walk up to a predetermined quota. Such methods are not acceptable because first, they do not

guarantee a nonzero probability to every potential respondent; second, the procedure is not

replicable, which complicates the field work supervision; and third, it can end up with a sample of easy

units because of the lack of effort to make call backs to households or individuals who were not

available at the first attempt to interview.

Listing costs can be reduced by using segmentation to decrease the size of the area which has

to be listed; however, segmentation generates its own costs, and skill in map making and map

interpretation is required. Segmentation becomes progressively more difficult as segments become

smaller because there are not enough natural boundaries to delineate very small segments. Moreover,

concentration of the sample into smaller segments increases the sampling error. Since neighbors’

characteristics are correlated, a smaller segment captures less of the variety existing in the

population; this leads to less efficient sampling. There is a point beyond which it is not useful to

attempt further segmentation. As a general rule the average segment size should not be less than 500

20

in population (approximately 100 households) in both urban and rural areas. However, segmentation

has less economical effect in urban areas because the urban EAs are in general small geographic

areas.

It is quite probable that some traditional tools in the household listing process will be modified

in the future by using more sophisticated technology such as the geographic positioning systems

(GPS) in order to collect more precise location information for the selected EAs. With this new tool we

can produce more precise distribution maps of the structures with less supervision than in the

traditional approach. The main feature is that every selected EA and every selected structure/dwelling

can be located with high precision and thus relocated later, if desirable. In addition, GPS information is

used more and more in DHS data analysis and presentation. At present, though, the recommended

protocol for collecting GIS information in DHS surveys is to collect one coordinate for every selected

cluster. See Chapter 2 for more details of the household listing operation.

1.11 Household selection in the central office

After the household listing operation, once the central office receives the completed listing

materials for a cluster, they must first create a serial number for each of the occupied residential

households, beginning with 1 and continuing to the total number of occupied residential households

listed in the cluster. An occupied residential household designates those households occupied at the

time of the listing, even if the occupant refused to cooperate at the time of listing, and those

households where the occupants were absent at the time of listing but neighbors confirmed that they

would not be absent for a long period and would be at home during the period of the main survey.

Only occupied residential households should be numbered. This serial number is an ID number for the

households. The household selection procedure will be performed based on this serial number.

Whether or not a household is considered occupied at the time of the listing is very important because

this fact will be related to the proportion of vacant households in the main survey.

The MEASURE DHS program has used several methods

3

for selecting households within

clusters including:

1) Systematic selection: From a random starting point select every nth household (see

Chapter 3 Section 3.2 for more details).

2) Systematic selection with runs: From a random starting point, select a group of sequential

households called a “run”. Several runs may be used within a cluster. Runs are selected

with systematic selection. Selecting households in runs can greatly reduce the amount of

travel within cluster during data collection, especially in rural clusters where households

can be far apart.

The advantages of household selection in the central office can be summarized as:

1) It allows for a check of coverage of the household listing results before the main survey

and for the review and possible relisting of problematic clusters in advance.

2) Sampled households are pre-determined which prevents potential bias introduced by

allowing the interviewers to select in the field which households are to be interviewed.

3

The MEASURE DHS program has developed various Excel templates for household selection in the central office:

systematic selection, systematic selection with runs, self-weighting selection with and without control of sample

size and with or without runs. Once the household listing is completed, it is possible to just copy the number of

households listed in a cluster into the spreadsheet and the spreadsheet will show the selected household numbers

automatically. See Chapter 3 Section 3.2.2 for details.

21

3) The field work procedure is exactly replicable which provides the possibility of easy and

close supervision of the field work.

4) It is easier to control the work load for each interviewing team.

However, in cases when travelling between clusters represents a substantial cost, it is possible

to forego the step of selecting households in the central office. In such cases, the household listing

operation and the main survey can be combined into a single field operation. No essential changes are

needed in the household listing procedure or household numbering, but making a detailed sketch map

for the cluster may not be necessary because the listing team and the interviewing team are the

same, and the household interview will begin immediately after the listing, so identifying the exact

selected households during a separate visit is no longer a problem. The household selection must be

done in the field manually if portable computers are not available. Some manual selection procedures

have been developed for this purpose. Household listing and interviewing are two very different jobs,

so in surveys where listing, selection and interviewing takes place in the same visit by the same staff,

it may be necessary to conduct more extensive training of field teams before the field work begins and

to supervise the teams more closely during the fieldwork. See Chapter 3 Section 3.2.2 for more details

for manual household selection.

1.12 Household interviews

The household interview procedure is out of the scope of this manual since it is explained in

detail in the interviewer’s manual. This section will briefly discuss the main statistical points of the

household interview. After the household selection, interviewers will be recruited and trained for the

household and individual interviews. The training of the interviewer is an intensive training lasting at

least four weeks for a standard DHS survey, and longer if the survey includes many biomarkers. Prior

to the training, a pretest of the questionnaire will be conducted in a small number of clusters not

selected for the main survey to assess the quality of the questionnaires and the understanding of the

translations by interviewers and respondents. Problems and potential errors observed in the pretest

will be addressed and resolved prior to fieldwork training. Finally, the interviewing team will be sent to

selected clusters with a certain work load per team.

Once training is complete, teams of interviewers will be assigned a list of clusters and

deployed to the field. Upon arrival in a new area, the interviewer team must first contact the local

authorities for help to identify the correct cluster and to solicit cooperation during the field work. A

team leader or supervisor is assigned for each interviewing team. The supervisor is responsible for

cluster identification and should guarantee that the correct cluster will be interviewed. After checking

the listing materials and verifying with the local authorities, the supervisor will distribute the sampled

households among the interviewers. After locating a selected household, the interviewer will begin

with a brief household interview, listing household members and visitors, and identifying among them

all eligible women and men for the individual interview. Eligible individuals are defined as those who

are in the specified age group (15-49), and are either usual members of the selected household or

who slept in the household the night before the interviewer’s visit.

Conscious omission of eligible individuals on the part of an interviewer by mis-reporting their

age outside of the eligible age group is a real concern. Measures to eliminate this problem should be

undertaken. For example, the field editor should check the consistency of each completed

questionnaire and, if suspicious things are identified, should return to the household for further

verification of key items such as the number of household members, number of eligible individuals and

number of children under age five.

In the event of failure to contact a household or an eligible person in the first visit, the

interviewer is required to make at least two repeat visits, or call backs, on different days and at

22

different times of the day before the interview is abandoned. The process of making call backs

requires the teams to stay in a cluster for at least two to three days. Some countries propose large

interviewing teams in order to try to cover an entire cluster in one day. This process is not acceptable

for a DHS survey, even when the designed sample size can bear a large non-response rate, because

non-response biases the survey results. A quick survey usually ends up with poor data quality. Both

theory and practice prove that call backs and efforts to get difficult units to respond to the survey are

the best way to remove bias and reduce the non-sampling errors to a minimum. For more details,

refer to the DHS Survey Organization Manual and the Interviewer’s Manual.

1.13 Sampling weight calculation

1.13.1 Why we need to weight the survey data

A DHS sample is a representative sample randomly selected from the target population. Each

interviewed unit (household and individual) represents a certain number of similar units in the target

population. In order for any statistical inferences drawn from the survey data to be valid, this

representativeness of the sample must be taken into account. In general terms, sampling weights are

used to make the sample more like the target population. All analyses should use the sampling

weights calculated for each interviewed household and for each interviewed individual.

A sampling weight is an inflation factor which extrapolates the sample to the target

population. For example, if equal probability sampling (or a self-weighting sample) is applied in a

domain with a sampling fraction 1/500, this means that each sampled individual represents 500

similar individuals in the target population. Therefore, if we observed one particular individual having

secondary education, we would conclude that there are 500 individuals in the target population having

secondary education, corresponding to this particular individual. The total number of individuals with

secondary education in the target population would be 500 times the total number of interviewed

individuals having secondary education observed in the sample. This explanation also applies to

unequal probability sampling. It is very important that sampling weights are properly calculated and

applied in data analysis; otherwise, serious bias may be introduced, leading to incorrect conclusions.

Although all of the DHS indicators are means, proportions, rates or ratios, since a nationwide

self-weighting sample is not usually feasible due to study domains as explained in Section 1.9,

sampling weights are always necessary. Even when a survey is designed to be nationally self-

weighting, it is necessary to correct for the different response patterns across domains/strata (see

Section 1.13.4 for more details). Therefore, even surveys with self-weighting sample designs require

the use of sampling weights.

Though the effect of sampling weights on survey indicators may be small, it is necessary to

use sampling weights for the following reasons:

1) For valid statistical inference.

2) For correcting or reducing bias; weighting can reduce bias introduced by non-response or

other non-sampling errors.

3) For keeping the weighted sample distribution close to the target population distribution,

especially when oversampling is applied in certain domains/strata.

1.13.2 Design weights and sampling weights

The MEASURE DHS program calculates both design weights and sampling weights (or survey

weights) for both households and individuals. The design weight of a sampling unit (household or

23

individual) is the inverse of the overall probability with which the unit was selected in the sample. The

sampling weight of a sampling unit is the design weight corrected for non-response or other

calibrations.

Since is the DHS protocol involves no selection of eligible individuals within a sampled

household (except for the domestic violence module, in which one eligible woman is selected from a

sampled household), all eligible individuals from the same household share the same design weight,

which is the same as the household’s design weight. Therefore, the design weight is the basic weight

for DHS surveys. All other weights are calculated based on the design weight. In calculating the

sampling weight, it is possible to correct for both unit non-response (a sampling unit is not

interviewed at all) and item non-response (the sampling unit does not provide answer for a specific

question). The policy of the MEASURE DHS program is to correct for unit non-response at the stratum

level (see Section 1.13.4) and leave the correction of item non-response to data users because it is

variable specific. Correction of unit non-response at cluster level will increase the variability of

sampling weights and therefore increase sampling errors. Because the correction for unit non-

response is the same for an entire cluster and because household selection within a cluster is an equal

probability selection, all the households in the same cluster share the same design weight and

sampling weight, and the same is true for all individuals in the same cluster. This means that the DHS

weights (both design weights and sampling weights) are cluster weights.

1.13.3 How to calculate the design weights

Assuming that a DHS survey sample is drawn with two-stage, stratified cluster sampling,

design weights will be calculated based on the separate sampling probabilities for each sampling stage

and for each cluster. We use the following notations:

P

1hi

: first-stage sampling probability of the i

th

cluster in stratum h

P

2hi

: second-stage sampling probability within the i

th

cluster (household selection)

Let n

h

be the number of clusters selected in stratum h; let M

hi

be the measure of size of the

cluster used in the first stage’s selection, usually the measure of size is the number of households

residing in the cluster according to the sampling frame; let

M

hi

be the total measure of size in the

stratum h. The probability of selecting the i

th

cluster in the sample is calculated as follows:

M

M

n

P

hi

hi

h

hi

=

1

Let

hi

b be the proportion of households in the selected cluster compared to the total number

of households in EA i in stratum h if the EA is segmented, otherwise 1=

hi

b . Then the probability of

selecting cluster i in the sample is:

hi

hi

hi

h

1hi

b

M

M

n

=

P

×

Let

hi

L be the number of households listed in the household listing operation in cluster i in

stratum h; let

hi

t be the number of households selected in the cluster. The second stage selection

probability for each household in the cluster is calculated as follows:

hi

hi

hi

L

t

P =

2

24

The overall selection probability of each household in cluster i of stratum h is therefore the

product of the selection probabilities of the two stages:

hihihi

PPP

21

×=

The design weight for each household in cluster i of stratum h is the inverse of its overall

selection probability:

hihi

Pd /1=

The calculation of the design weight is not complicated; however, difficulties often result from

not having of all the design parameters involved in the above calculation because they are not well

documented, especially when the sampling frame is a master sample. See Chapter 5 for more details

on sample documentation.

1.13.4 Correction of unit non-response and calculation of sampling weights

The design weight calculated above is based on sample design parameters. If there is no non-

response at the cluster level, at the household level, or at the individual level, the design weight is

enough for all analyses, for both household indicators and individual indicators. However, non-

response is inevitable in all surveys, and different units have different response behaviors. The

experience of the MEASURE DHS program shows that urban households are less likely to respond to

the survey than their counterparts in rural areas, households in developed regions are less likely to

respond to the survey than their counterparts in less-developed regions, rich households are less likely

to respond to the survey than poor households, individuals with higher levels of education are less

likely to respond to the survey than those with lower levels of education, men are less likely to

respond to the survey than women, and so forth.

The idea of correcting for unit non-response is to calculate a response rate for each

homogeneous response group, then inflate the design weight by dividing it by the response rate for

each response group. The construction of homogeneous response groups depends on the knowledge

of the response behavior of the sampling units. DHS surveys always use the sampling stratum as the

response group because the stratification is usually achieved by regrouping homogeneous sampling

units in a single stratum. It is possible to use a cluster as a response group, but the disadvantage is

that the response rates may vary too much at the cluster level, which will increase the variability of

the sampling weight; which in turn increases the sampling variance. Furthermore, correction of non-

response at the cluster level will interfere with self-weighting if a self-weighting sample has been

designed.

By assuming that the response groups coincide with the sampling strata, the following steps

explain how to calculate the sampling weight by first calculating the various response rates for unit

non-response. Please note that the response rates calculated here are different from the response

rates calculated in Appendix A of DHS survey final reports. In Appendix A, household and individual

response rates are calculated as ratios of the number of interviewed units over the number of eligible

units because the aim is just to show the results of survey implementation. Here we use weighted

ratios because the aim is to correct the design weight to compensate for non-response, therefore the

design weight should be involved. Because a non-responding unit with a large sampling weight will

have a larger impact on survey estimates than a non-responding unit with a small design weight, a

weighted response rate for correction of non-response is better than an un-weighted response rate.

25

1. Cluster level response rate

Let

h

n be the number of clusters selected in stratum h; let

*

h

n be the number of clusters

interviewed. The cluster level response rate in stratum h is therefore

hhch

nnR /

*

=

2. Household level response rate

Let

hi

m be the number of households found (see Chapter 2, Section 2.10 for definition) in

cluster i of stratum h; let

*

hi

m be the number of households interviewed in the cluster. The household

response rate in stratum h is calculated by

=

hihihihihh

mdmdR /

*

where

hi

d

is the design weight of cluster i in stratum h; the summation is over all clusters in the

stratum h.

3. Individual response rate

Let

hi

k be the number of eligible individuals found in cluster i of stratum h; let

*

hi

k be the

number of individuals interviewed. The individual response rate in stratum h is calculated as

=

hihihihiph

kdkdR /

*

where

hi

d

is the design weight of cluster i in stratum h; the summation is over all clusters in the

stratum h.

The household sampling weight of cluster i in stratum h is calculated by dividing the household

design weight by the product of the cluster response rate and the household response rate, for each of

the sampling stratum:

),/(

hhchhihi

RRdD ×= for cluster i of stratum h.

The individual sampling weight of cluster i in stratum h is calculated by dividing the household

sampling weight by the individual response rate, or equivalently, by dividing the household design

weight by the product of the cluster response rate, the household response rate and the individual

response rate, for each of the sampling strata:

),/(/

phhhchhiphhihi

RRRdRDW ××==

for cluster i of stratum h.

It is easy to see that the difference between the household sampling weights and the

individual sampling weights is introduced by individual non-response.

The sampling weights for households selected for the men’s survey and for men can be

calculated similarly. We need a separate household sampling weight for the men’s survey in cases

where the men’s survey is conducted in a sub-sample of households selected for the women’s survey,

and we suppose that the response behavior of households in the men’s survey sub-sample may be

different from the overall household response rate.

If no normalization is requested, we can stop here. The above calculated household sampling

weight and individual sampling weight can be used to produce any indicators at the household level

26

and the individual level, respectively. As we mentioned earlier in Section 1.13.1, a sampling weight is

an inflation or extrapolation factor. The weighted sum of households interviewed

=

*

hihi

mDT

is an unbiased estimate of the total number of ordinary residential households of the country; where

*

hi

m is the number of households interviewed in the i

th

cluster of stratum h, and the summation is over

all clusters and strata in the total sample. Similarly, the weighted sum of all interviewed women

=

*

hihi

kWW

is an unbiased estimate of the total women in the target population (women age 15-49) of the

country; where

*

hi

k is the number of women interviewed in the i

th

cluster of stratum h, and the

summation is over all clusters and strata in the total sample.

1.13.5 Normalization of sampling weights

Normalization of sampling weights is not necessary for survey data analysis. In order to

prevent large numbers for the number of weighted cases in the tables in DHS survey final reports, it is

the MEASURE DHS tradition to calculate normalized standard weights for both households and

individuals. With the normalized standard weight, the number of unweighted cases coincides with the

number of weighted cases at the national level for both total households and total individuals. The

normalized standard weight of a sampling unit is calculated based on its sampling weight, by

multiplying the sampling weight with a unique constant at the national level. The constant or the

normalization factor is the total number of completed cases divided by the total number of weighted

cases (based on the sampling weight). This number is equal to the estimated total sampling fraction

because the total number of weighted cases with the sampling weight is an estimation of the total

target population. Therefore the standard weights in the DHS data files are relative weights. Relative

weights can be used to estimate means, proportions, rates and ratios because the normalization factor

is cancelled out when used in both numerator and denominator, so it has no effect on the calculated

indicator values. This point also explains why the normalization must be done at the national level and

not the regional level: at the regional level, the normalization factor cannot be cancelled out, and bias

will be introduced in the calculated indicator values. Because the normalized standard weights have no

scale, they are not valid for estimating totals. Also the normalized weight is not valid for pooled data,

even for data pooled for women and men in the same survey, because the normalization factor is

country and sex specific.

1. Normalized household standard weight

4

The normalization factor for calculating household standard weight is calculated as

=

**

/

hihihi

mDmFH

The household standard weight for cluster i in stratum h is calculated by

×=×=

**

/005

hihihihihihi

mDmDFHDHV

4

The MEASURE DHS program has developed Excel templates for facilitating standard weight calculations. If all

design parameters and the survey results (number of households found and interviewed, number of eligible women

found and interviewed, number of eligible men found and interviewed, number of eligible women and men found

and tested, by cluster) are provided in the input page, the standard weights will be calculated automatically in

different pages.

27

where HV005 is the household standard weight variable in the DHS Recode data files.

It is easy to see that the weighted sum of households interviewed by using the standard

weight equals the unweighted sum of households interviewed for the total sample. This condition will

not be met at the domain level or for sub-populations. At the domain level, the weighted sum of

households interviewed may be larger or smaller than the unweighted sum of households interviewed,

depending on whether the domain is undersampled or oversampled.

2. Normalized women’s standard weight

The normalization factor for calculating the women’s standard weight is calculated as

=

**

/

hihihi

kWkFW

The women’s standard weight for cluster i in stratum h is calculated by

×=×=

**

/005

hihihihihihi

kWkWFWWV

where V005 is the women’s standard weight variable in the DHS Recode data files.

The standard weights for households selected for the men’s survey and for men can be

calculated in a similar way.

1.13.6 Standard weights for HIV testing

The sampling weights for HIV testing are calculated separately for women and men, but they

are calculated using the same methodology. The only difference is in the calculation of the

normalization factors, if a normalized weight is requested. In order to calculate the weighted HIV

prevalence for women and men together using a normalized weight, the standard weight for HIV

testing must be normalized for women and men together. In most DHS surveys, HIV testing is

conducted in the same subsample of households selected for men’s survey, and every woman or man

in the household who is eligible for the individual interview is eligible for HIV testing. Once the

household sampling weight for the men’s survey is calculated using the procedures stated in Section

1.13.5, the sampling weights for HIV testing for women and men may be calculated separately by

correcting the household sampling weight for the non-response rates of women and men for HIV

testing, respectively. For simplicity, let

hi

MD be the household sampling weight in cluster i of stratum h

for the men’s survey sub-sample, the response rates to HIV testing for women and men are calculated

respectively by

=

hihihihihi

WHIVMDWHIVMDWR /

*

=

hihihihihi

MHIVMDMHIVMDMR /

*

where

hi

WHIV is the number of women eligible for HIV testing, and

*

hi

WHIV is the number of women

tested with a valid test result, in cluster i of stratum h;

hi

MHIV and

*

hi

MHIV are the number of men

eligible and the number of men tested with a valid test result, respectively, in cluster i of stratum h.

The sampling weights for HIV testing for women and men, respectively, are calculated by

hihi

W

hi

WRMDHIV /= ,

hihi

M

hi

MRMDHIV /=

28

In cluster i of stratum h, the normalized standard weights for HIV testing for women and men,

respectively, are calculated by

(

)

(

)

×+×+×=

****

/05

hi

M

hihi

W

hihihi

W

hi

W

hi

MHIVHIVWHIVHIVMHIVWHIVHIVHIV

(

)

(

)

×+×+×=

****

/05

hi

M

hihi

W

hihihi

M

hi

M

hi

MHIVHIVWHIVHIVMHIVWHIVHIVHIV

where the double summations are over all clusters and strata in the total sample.

1.13.7 De-normalization of standard weights for pooled data

For all of the DHS data, the weight variables HV005 (household standard weight), V005

(women’s standard weight) and MV005 (men’s standard weight) are relative weights which are

normalized so that the total number of weighted cases is equal to the total number of unweighted

cases, for the three kinds of units. In some situations, such as analyses involving data from more than

one survey, data users may need the un-normalized sampling weight for analyzing pooled data. As

mentioned in Section 1.13.5, since normalization is country specific and sex specific, it is necessary to

de-normalize the standard weights provided in the DHS Recode data files for analyzing pooled data.

The normalization procedure consists of multiplying the sampling weight by a normalization

factor for the total sample. The normalization factor is the estimated total sampling fraction: the

number of completed cases divided by the number of weighted cases by using the sampling weight,

for each kind of sampling unit. The weighted number of cases with sampling weight is an estimation of

the total target population. Therefore, in order to de-normalize a normalized weight, simply divide the

normalized weight by the total sampling fraction. The estimated total sampling fraction is usually not

provided in the DHS data file or in the final report. In order to calculate the total sampling fraction, it

is necessary to know the total target population at the time of the survey. The total target population

at the time of the survey is easy to get from various sources. The country’s statistical office, the

United Nations Population Division’s (UNPD) World Population Prospects

5

, and the United Nations

Population Fund (UNFPA) are three sources that may be easy to access.

As mentioned above, if pooled data analysis is required, the standard weight variables HV005,

V005 and MV005 must be rescaled or de-normalized. The de-normalization procedure is the inverse of

the normalization procedure: that is, multiply the standard weight by the target population and divide

by the number of completed cases, for each survey. The de-normalized weights for households,

women and men (HV005*, V005*, and MV005*, respectively) can be calculated using the following

formulas:

HV005* = HV005 × (total number of residential households in the country)/

(total number of households interviewed in the survey)

V005* = V005 × (total female population 15-49 in the country)/

(total number of women 15-49 interviewed in the survey)

MV005* = MV005 × (total male population 15-49 (15-59) in the country)/

(total number of men 15-49 (15-59) interviewed in the survey)

5

http://esa.un.org/unpd/wpp/index.htm

29

If normalized weights are preferred, the above re-scaled weights can be re-normalized by

multiplying by the total number of completed women’s and men’s interviews combined, dividing by

the total number of weighted cases combined, and applying the above re-scaled weights to the pooled

data.

Note that the normalization of sampling weights is done for the total sample for households,

women and men separately. If the aim is to tabulate indicators for a certain sub-population from

pooled data, for example, vaccination coverage for children 12-23 months, the de-normalization has

nothing to do with the total population of children 12-23 months because there is no standard weight

calculated for children 12-23 months in DHS surveys. If the indicator is tabulated at the household

level using the household weight, the household standard weights must be de-normalized for all of the

surveys included in the analysis as explained above; likewise, if the indicator is tabulated at the

individual level using the women’s (or child’s mother’s) weight, the women’s standard weights must

be de-normalized for each of the surveys.

1.14 Calibration of sampling weights in case of bias

Generalized calibration (Deville and Särndal, 1992; Deville et al, 1993) has now become a

popular and powerful framework in survey data analysis for statistical offices in many countries. It

allows for the utilization of different sources of auxiliary information to improve estimates from sample

surveys. Calibration can reduce sampling errors, can correct bias caused by non-response and other

non-sampling errors, and can reduce the influence of extreme values. Calibration is a “weight tuning”

procedure such that the tuned sampling weight can produce estimates without error for known

population characteristics. The precision of an estimator using a calibrated weight is equivalent to a

regression estimator but is much easier to calculate with the help of calibration software such as

CALMAR, a SAS Macro procedure developed by the French Institute of Statistics and Economic Studies

(INSEE), and the SPSS procedure developed by Statistics Belgium. DHS surveys employ calibration of

sampling weights only in cases where serious bias is observed in the collected data, and there is

reliable auxiliary information available for the calibration.

Let X be a multivariate auxiliary variable with p components such that the population totals of

each of the component variables are known beforehand from the recent population census, that is,

τ

),...,,(

21 P

xxx

Ui

ix

tttXt ==

∈

is known. Let x

i

be the observations of the auxiliary variables from the

survey

τ

),...,,(

21 piiii

xxxx =

for the respondent sampling unit i. Let

i

D be the sampling weight for

unit i. The calibration procedure consists of modifying the sampling weight slightly from

i

D to

i

W

such that a given distance measure between the sampling weights

i

D and the calibrated weights

i

W

∈si

ii

DWg ),(

is minimized under the constraints

x

si

ii

txW =

∈

where g is a distance function which measures the distance between

i

D and

i

W . The constraints

imposed are that the known auxiliary variable totals are estimated without error with the calibrated

weights. If the variable of interest is well correlated with the auxiliary variables, then we expect that

the precision can be greatly improved for estimating the variable of interest. The calibration theory

states that the calibrated weights have the following formula

(

)

)(sxqFDW

iiii

λ

τ

=

30

where

()

•F

is called the calibration function which is the reciprocal of the derivative of the distance

function g; q

i

is a calibration weight which is usually set to 1 in the lack of prior knowledge;

)(s

λ

is a

constant depending on the particular sample s which is to be solved.

When

(

)

))(1()( sxqsxF

iii

λλ

ττ

+= ,

which corresponds to one of the five proposed calibration functions in Deville et al, 1993, it is easy to

solve,

)(s

λ

is given by

)

ˆ

()(

1

xxs

ttTs

π

λ

−=

−

with

∈

=

si

iiiis

xxqDT

τ

For a given variable of interest y, the calibrated estimator of the population total is equivalent

to the generalized regression estimator

)

ˆ

(

ˆ

ˆˆ

xxsy

si

iiy

ttBtyWt

π

τ

π

−+==

∈

where

∈

−

=

si

iiiiss

yxDqTB

1

ˆ

is the sample estimation of the regression coefficient;

y

t

π

ˆ

and

x

t

π

ˆ

are the

simple estimators using the sampling weight

∈

=

si

iiy

yDt

π

ˆ

,

∈

=

si

iix

xDt

π

ˆ

A mean estimation of the variable of interest y can be calculated by

∈

∈

=

si

i

si

ii

W

yW

Y

ˆ

The calibration estimator can be equivalently formulated with known proportions of one or

more auxiliary variables. The calibration can be conducted at the individual level, which will result in

an individual specific weight, or it can be conducted at the cluster level with aggregated data, which

will result in a cluster weight. For more details see the related references given in the end of this

document.

1.15 Data quality and sampling error reporting

Data quality is always a major concern for all MEASURE DHS projects. Though numerous

efforts are made in implementing DHS surveys to maximize the quality of the data collected, non-

sampling errors are always the main concerns for data quality. Data quality of a survey directly affects

the reliability of the statistics produced. Many countries have laws that require reports of survey

findings to include an evaluation of data quality and reliability. Data quality can be measured by total

survey error including bias introduced by various sampling and non-sampling errors.

DHS survey final reports usually include tables in an appendix for data quality evaluation

purposes, including: age distributions of household population by sex; age distributions of eligible and

interviewed women and men; completeness of reporting on date of birth, age at death, age/date at

first union, education and anthropometric measures, etc. The MEASURE DHS program also conducts

some in-depth studies on data quality for specific topics, which are provided in published reports.

Apart from the data quality tables, DHS survey final reports provide sampling errors for

selected indicators in Appendix B. Sampling errors are important reliability measures which tell the

user the degree of error associated with a particular estimated indicator value, the number of cases

involved in the calculation of the indicator, the efficiency or clustering effects of the sample design

compared to a simple random sampling and the range for the true value of an indicator at a certain

31

confidence level. The reader is referred to Chapter 4, Section 4.2 for more details on sampling errors

and their calculation.

DHS survey final reports also provide an appendix on the sample design of the survey. The

sample design document reports the survey methodology used for the survey, including the aim of the

survey, the target population, the sample size, the reporting domains, the stratification and sample

allocation, sample selection procedure, sampling weight calculation, correction for non-response,

calibration of sampling weights, and the results of survey implementation. See Chapter 5, Section 5.2

for more details on sample design.

1.16 Sample documentation

The task of a sampling statistician does not end with the selection of the sample. The

preservation of sampling documentation is an essential requisite for sampling weight calculation, for

sampling error computation, for data quality evaluation, for linkage with other data sources, and for

various kinds of checks and supplementary studies. Special efforts are needed at the time of the

sample design, at the end of the fieldwork, and at the completion of the data file if the task of sample

documentation is to be carried out effectively. If preservation of documentation is delayed,

considerable effort will be required to reconstitute the missing information when it is needed.

The sample documentation must comply with the survey confidentiality requirements. When

HIV testing is conducted in a DHS or AIS (AIDS Indicator Survey), the confidentiality guidelines

require the complete destruction of all intermediate documents which can potentially be used to

identify any single household or individual who participated in the testing. This requirement reinforces

the importance of timely sample documentation. See Chapter 5 for detailed requirements in sample

documentation.

1.17 Confidentiality

The final data files for DHS surveys are made available to interested researchers. Therefore,

the confidentiality of private information collected from individual respondents is a major concern,

especially when sensitive information such as sexual activity and HIV status are collected. Protecting

the confidentiality of the individual respondent is not only an ethical obligation, but it also promotes

more accurate data because respondents are more likely to provide truthful responses if they feel

confident their information will be kept private.

DHS surveys follow strict rules imposed at various steps during the survey implementation to

prevent the direct or indirect disclosure of the identity of individual respondents. The principal pieces

of information that can indirectly identify an individual respondent are cluster number, household

number, the cluster selection probability and the sampling weights. The cluster number is an

important identifier for sampling error calculations; the household number is important for household

level and individual level data management and tabulation; the cluster selection probability is useful

for cluster level modeling; and sampling weights are necessary for all analysis. So these variables

must be present in the final data file. The household number in the final DHS data file is not

informative, and sampling weights are not informative after correction of non-response and

normalization. The cluster selection probability is potentially informative only if lower level

identification information such as district and locality are present, and DHS survey final data files do

not provide geographic information below the level of region or survey domain, especially when HIV

testing is conducted. Thus the only concern is the disclosure of the cluster. For DHS or AIS surveys

with HIV testing, the final data files provide scrambled cluster and household numbers for further

insurance against disclosure.

32

2 HOUSEHOLD LISTING OPERATION

2.1 Introduction

DHS surveys are nationwide sample surveys designed to provide information on the levels of

fertility, infant and child mortality, use of family planning, knowledge and attitudes toward HIV/AIDS

and other sexually transmitted infections (STI), and on other family welfare and health indicators. The

surveys generally interview women age 15-49 and men age 15-59 (15-49 or 15-54 in some surveys).

The women and men to be interviewed live in ordinary residential households which are randomly

selected from a set of sample points consisting of clusters of households. Prior to interviewing, all

households located in the selected clusters will be listed. The listing of households for each cluster will

be used in selecting the final sample of households to be included in the DHS survey.

The listing operation consists of visiting each cluster, recording on listing forms a description

of every structure together with the names of the heads of the households found in the structure, and

drawing a location map of the cluster as well as a detailed sketch map of all structures residing in the

cluster. These materials will guide the interviewers to find the pre-selected households for interviewing

and will allow field work supervisors to perform quality control during data collection.

The following sections present the general guidelines for conducting a household listing

operation. Modifications may be needed to adapt to country specific situations.

2.2 Definition of terms

Following are brief definitions of the terms used in this document.

A census Enumeration Area (EA) is a geographical statistical unit created for a census and

containing a certain number of households. An EA is usually a city block in urban areas and a village, a

part of a village or a group of small villages in the rural areas with its location and boundaries well

defined and recorded on census maps.

A cluster is the smallest geographical survey statistical unit for DHS surveys. It consists of a

number of adjacent households in a geographical area. For DHS surveys, a cluster corresponds either

to an EA or a segment of a large EA.

A base map is a reference map that describes the geographical location and boundaries of an

EA.

A structure is a free-standing building or other construction that can have one or more

dwelling units for residential or commercial use. Residential structures can have one or more dwelling

units (for example: single house, apartment structure).

A dwelling unit is a room or a group of rooms normally intended as a residence for one

household (for example: a single house, an apartment, a group of rooms in a house); a dwelling unit

can also have more than one household.

A household consists of a person or a group of related or unrelated persons, who live together

in the same dwelling unit, who acknowledge one adult male or female 15 years old or older as the

head of the household, who share the same housekeeping arrangements, and are considered as one

unit. In some cases one may find a group of people living together in the same house, but each

person has separate eating arrangements; they should be counted as separate one-person

households. Collective living arrangements such as army camps, boarding schools, or prisons will not

be considered as households. Examples of households are:

33

• a man with his wife or his wives with or without children

• a man with his wife or his wives, his children and his parents

• a man with his wife or his wives, his married children living together for some social or

economic reasons (the group recognize one person as household head)

• a widowed or divorced man or woman with or without children

The head of household is the person who is acknowledged as such by members of the

household and who is usually responsible for the upkeep and maintenance of the household.

A location map is a map produced in the household listing operation which indicates the main

access to a cluster, including main roads and main landmarks in the cluster. Sometimes it may be

useful even to include some important landmarks in the neighboring cluster.

A sketch map is a map produced in household listing operation, with location or marks of all

structures found in the listing operation which helps the interviewer to relocate the selected

households. A sketch map also contains the cluster identification information, location information,

access information, principal physical features and landmarks such as mountains, rivers, roads and

electric poles.

2.3 Responsibilities of the listing staff

Persons recruited to participate in the household listing operation will work in teams consisting

of two enumerators. A coordinator will monitor the entire operation.

The responsibilities of the coordinator are to:

1) obtain base maps for all the clusters included in the survey;

2) arrange for the reproduction of all listing materials (listing manuals, mapping and listing

forms); the map information forms and the household listing forms must be prepared in

sufficient numbers to cover all of the clusters to be visited.

3) assign teams to clusters;

4) monitor the reception of the completed listing forms at the central office; and

5) verify that the quality of work is acceptable.

If GPS coordinates are being collected during the listing operation, the coordinator must also:

6) obtain one GPS receiver per listing team, plus two backup receivers, and tag each GPS

receiver with a number;

7) ensure that all GPS receivers have the correct settings (see Section 2.6 below) and

distribute a receiver to each field team;

8) obtain and copy all GPS training materials for listing staff; and

9) train all listing staff to record GPS waypoints in the GPS units as well as on Form DHS/1.

34

The responsibilities of the enumerators are to:

1) identify the boundaries of the cluster;

2) draw a location map showing the location of the cluster;

3) draw a detailed sketch map of the cluster showing the locations of all structures residing in

the cluster;

4) list all the households in the cluster in a systematic manner;

5) communicate to the coordinator problems encountered in the field and follow his

instructions.

6) transfer the completed listing forms to the coordinator or to the central office;

If GPS coordinates are being collected during the listing operation, enumerators must also:

7) capture and record the GPS waypoint of the center of the cluster; and

8) complete the portion of form DHS/1 designated for GPS information for each cluster.

The two enumerators in each team should work together at the same time in the same area.

They will first identify the cluster boundaries together. Then one enumerator prepares the location and

the sketch map while the other does the household listing. The materials needed for the household

listing operation are:

• Manual for Household Listing

• Base map of the area containing the cluster

• Map Information Form (Form DHS/1)

• Household Listing Form (Form DHS/2)

• Segmentation form (Form DHS/3)

If GPS coordinates are to be recorded during the listing operation, the following additional

materials are needed:

• GPS receivers, batteries and cables

• GPS training manuals and handouts

2.4 Locating the cluster

The coordinator will provide the listing team with a base map containing the cluster assigned

to the team. The listing team will typically make two tours of the cluster: the first to identify the

cluster boundaries and to create the location map, and the second to create the listing and draw the

sketch map. Upon arrival in a cluster, the team should first contact the local authorities for help in

identifying the boundaries and get general information on the cluster, for example, the rough number

of residential households in the cluster. In most cases, the cluster boundaries follow easily

recognizable natural features such as streams or rivers, and construction features such as roads or

railroads. In some cases, the boundaries may not be marked with visible features (especially in rural

areas), attention should be paid to locate the cluster boundaries as precisely as possible according to

the detailed description of the cluster and its base map.

Before doing the listing, the team should tour the cluster to determine an efficient route of

travel for listing all of the structures. The cluster should be divided into parts if possible. A part can be

35

a block of structures. The listing team will make a location map of the cluster indicating the

boundaries of the parts, as well as the relative location of landmarks, public structures (e.g., schools,

religious structures, public offices and markets) and main roads. This location map will serve as a

guide for the interviewing team when they begin data collection.

2.5 Preparing location and sketch maps

The coordinator will designate one enumerator of the team as the mapper. The second

enumerator will be the lister. Although the two have separate tasks to perform, they must move

together and work in close cooperation; the mapper prepares the maps, and the lister collects

information on the structures (and corresponding households) indicated on the sketch map.

The mapping of the cluster and the listing of the households should be done in a systematic

manner so that there are no omissions or duplications. If the cluster consists of a number of blocks,

then the team should finish each block before going to the next adjacent block. Within each block,

start at one corner of the block and move clockwise around it. In rural areas where structures are

frequently found in small groups, the team should work in one group of structures at a time and in

each group they can start at the centre (choosing any landmark, such as a school, to be the centre)

and move around it clockwise.

In the first tour of the cluster, the mapper will prepare a location map of the cluster on the

Map Information Form (Form DHS/1). First, fill in the identification box for the cluster on the first

page. All information needed for filling in the identification box is provided by the coordinator. In the

space provided on the second page, draw a map showing the location of the cluster and include

instructions on how to get to the cluster. Include all useful information to find the cluster and its

boundaries directly on the map and in the space reserved for observations if necessary.

In the second tour of the cluster, using the third page of the Map Information Form, the

mapper will draw a sketch map of all structures found in the cluster, including vacant structures and

structures under construction. It is important that the mapper and lister work together and coordinate

their activities, since the structure numbers that the mapper indicates on the sketch map must

correspond to the serial numbers assigned by the lister on the listing form for the same structures.

On the sketch map, mark the starting point with a large X. Place a small square at the spot

where each structure in the cluster is located. For any non-residential structure, identify its use (for

example, a store or factory). Number all structures in sequential order beginning with "1". Whenever

there is a break in the numbering of structures (for example, when moving from one block to

another), use an arrow to indicate how the numbers proceed from one set of structures to another.

Although it may be difficult to pinpoint the exact location of the structure on the map, even an

approximate location is useful for finding the structure in the future. Add to the sketch map all

landmarks (such as a park), public structures (such as a school or church), and streets or roads.

Sometimes it is useful to add to the sketch map landmarks that are found outside the cluster

boundaries, if they are helpful in identifying other structures inside the cluster.

Use the marker or chalk provided to write on the entrance to the structure the number that

has been assigned to the structure. Remember that this is the serial number of the structure as

assigned on the household listing form, which is the same as the number indicated on the sketch map.

In order to distinguish the number from other numbers that may exist already on the door of the

structure, write “DHS” in front of the number, for example, for the structure number 5, write “DHS/5,”

similarly on the door of structure number 44 write “DHS/44.”

A structure is called a multi-unit structure if it contains more than one household in the

structure. Otherwise it is called a single-unit structure. All households found in a structure or multi-

36

unit structure must be numbered from 1 to m, within the structure

6

. The structure number plus the

household number form a unique identification number for a household, and for all of the households

in the cluster. For example, household number 3 in structure number 44 would be uniquely identified

with ID number DHS/44-3. It is very useful to write the household ID number at the entrance of the

household to later assist the interviewer to identify the household for interview.

2.6 Collecting a GPS waypoint for each cluster

A GPS waypoint is a latitude and longitude reading that represents a location. For some

surveys, GPS data for EAs are available from the census. However, if the data are not available, or are

of questionable quality, one GPS waypoint for each cluster should be recorded during the listing phase

of the survey. These waypoints are recorded using a GPS unit (a Garmin ETREX unit is used in this

guide) and data collection forms. If GPS units other than the Garmin ETREX are used, this guide will

still be useful; however, some of the instructions may not apply due to differences in design and

menus. The Garmin ETREX owner’s manual may be useful to consult on the basics of the GPS unit.

Take one reading for each cluster. The GPS waypoints will be captured by the mapper

while he is mapping the clusters. One GPS waypoint must be taken for each cluster, and in the case of

large clusters which are being segmented, one point should be taken for each segment selected for

listing. In DHS surveys, clusters are usually census EAs, sometimes villages in rural areas or city

blocks in urban areas. Collecting only one waypoint for the cluster greatly reduces the chance of

compromising confidentiality of the respondents and at the same time is sufficient to allow for the

integration of multiple datasets for further analysis. The DHS cluster waypoint should always be taken

at the geographic center of the cluster or segment. If the cluster is segmented, the point should be

taken for the segment chosen by the Mapping and Listing Coordinator to be included in the survey.

Save the waypoint and record the latitude, longitude, and altitude. The latitude,

longitude, and altitude reading for a location are stored in two places: in the GPS unit’s memory and

on the DHS/1 paper form. GPS units can be broken or lost, and experience has shown that a hardcopy

backup is essential. In addition, the paper form provides a backup should the data in the GPS unit be

changed, deleted, or misidentified (i.e., the operator names the cluster incorrectly in the unit). Each

position saved in the GPS unit is called a waypoint, and each waypoint has a unique name. If possible,

the waypoint ID should be the same as the DHS cluster number. If it is not possible, the waypoint ID

should be unique to the cluster and recorded on Form DHS/1 (do not record the same waypoint ID for

two different clusters). When a waypoint is saved, the GPS unit assigns it a default name. The mapper

must edit the default name and change it to the 6-digit DHS cluster ID number. For example, the

waypoint for DHS cluster 101 would be named “000101”. Cluster 1101 would be named “001101”.

After saving the waypoint, the mapper will use the identification box of the Map Information Form

(Form DHS/1) to record the latitude, longitude, and altitude for the cluster and segment on paper.

First, the mapper will write down the latitude and longitude coordinates in decimal degree format and

altitude in meters in the Identification Box on the “Location Map Cluster” Form (DHS/1). Second, the

mapper will draw a circle, in the middle of the cluster/segment, at the location where he/she captured

the waypoint.

After the listing is complete, the GPS units must be collected as soon as possible and returned

to the sampling office by the Mapping and Listing Coordinator. The waypoints will then be downloaded

and examined for problems by the designated sampling staff. The Sampling Coordinator should

designate one member of the Data Processing Team to receive and process the GPS waypoint file and

then give the file to survey manager.

6

This number is different from the household number later given to all of the households listed in the whole cluster

just prior to household selection.

37

In most situations, the Mapping and Listing Coordinator will be responsible for providing the

listing teams with a GPS unit prior to the listing. Before these units are distributed they should be set

up for use by the listers. For DHS surveys, the only format which is acceptable is Decimal Degrees,

regardless of what geographic standards may be in use for other purposes. To set the format, enter

the SETUP menu and in the UNITS sub-menu, select the item POSITION FRMT and press the ENTER

button. Select “hddd.ddddd” Decimal Degrees, which is the first item. Once “hddd.ddddd” is

highlighted, press the ENTER button. It is important that all the GPS units be set up in the same way

so that the waypoints returned at the end of the survey are all in the same format. For more details

on how to properly prepare the GPS units for waypoint collection, please refer to the DHS Manual for

GPS Data Collection.

2.7 Listing of households

The lister will use the Household Listing Form (Form DHS/2) to record all households found in

the cluster. Begin by entering the identification information for the cluster. The first two columns are

reserved for office use only—leave them blank.

Complete the rest of the form as follows:

Column (1) [Serial Number of Structure]: For each structure, record the same structure serial

number that the mapper enters on the sketch map. All the structures recorded on the sketch

map (except the landmarks) must be recorded on the listing form and numbered.

Column (2) [Address/description of Structure]: Record the street address of the structure.

Where structures do not have visible street addresses (especially in rural areas), give a

description of the structure and any details that help in locating it (for example, in front of the

school, next to the store, etc.).

Column (3) [Residence Y/N]: Indicate whether the structure is used for residential purposes

(eating and sleeping) by writing Y for “Yes”. In cases where a structure is used for commercial

or other purposes, write N for “No”. Structures used both for residential and commercial

purposes (for example, a combination of store and home) should be classified as residential

(i.e. mark Y in column 3). Make sure to list any household unit found in a nonresidential

structure (for example, a guard living inside a factory or in a church). Also do not forget to list

vacant structures and structures under construction, and in Column (6) give some explanation

(for example: vacant, under construction, etc.) All structures seen in the cluster should be

recorded on the sketch map of the cluster and in the listing.

Column (4) [Serial Number of Household in Structure]: This is the serial number assigned to

each household found in the structure; there can be more than one household in a structure.

The first household in the structure will always have number “1”. If there is a second

household in the structure, then this household should be recorded on the next line, a “2” is

recorded in Column (4), and Columns (1) to (3) repeat the structure number and address or

are left blank.

Column (5) [Name of Head of Household]: Write the name of the head of the household.

There can only be one head per household. If no one is home or the household refuses to

cooperate, ask neighbors for the name of the head of the household. If a name cannot be

determined, leave this column blank. Note that it is not the name of the landlord or owner of

the structure that is needed, but the name of the head of the household that lives there.

Column (6) [Observations/Occupied or not]: This space is provided for any special remarks

that might help the coordinator decide whether to include a household in the household

38

selection or not, and might also help the interviewing team locate the structure or identify the

household during the main survey fieldwork.

If the structure is an apartment block or block of flats, assign one serial number to the entire

structure (only one square with one number appears on the sketch map), but complete Columns (2)

through (6) for each apartment in the structure individually. Each apartment should have its own

address, which is the apartment number within the structure.

The listing team should be careful to locate hidden structures. In some areas, structures may

have been built so haphazardly that they are easily missed. In rural areas, structures may be hidden

by tall grasses and trees. If there is a pathway leading from the listed structure, check to see if the

pathway goes to another structure. Talking with people living in the area may help in identifying the

hidden structures.

2.8 Segmentation of large clusters

A certain number of the selected EAs may be very large in population size. A complete listing

of EAs that are very large may not be feasible for the survey. These EAs should be subdivided into

several smaller segments, only one of which will be included in the survey and listed. In this case, the

DHS cluster corresponds to a segment of an EA. When the team arrives in a large EA that may need

segmentation, it should first tour the EA and make a quick count to get the estimated number of

households residing in the EA. There is no standard threshold for the size of an EA that needs to be

segmented, or for segment size. But for efficiency and accuracy considerations, DHS recommends that

if the EA size is bigger than 300 households, then the team should communicate to the coordinator

the cluster number, the estimated number of households and the suggested number of segments to

be created. The final decision to segment an EA, and the number of segments to be created, can only

be taken by the coordinator. Ideally, for ease of operation, an EA would only need to be segments into

2 segments, with an ideal segment size of 150-200 households in each segment. Dividing an EA into a

large number of segments (more than 3) should be avoided if it is not really necessary in order to

minimize errors.

In dividing an EA into segments, the ideal would be to have segments of approximately equal

size, but it is also important to adopt segment boundaries that are easily identifiable. In the first tour

of the cluster draw a location map of the entire cluster. Using identifiable boundaries such as roads,

streams, and electric power lines, divide the EA into the designated number of roughly equal-sized

segments. On the location map of the EA, show clearly the boundaries of the segments created.

Number the segments sequentially. Estimate the relative size of each segment in the following

manner: quickly count the number of dwellings in each segment, add up the total number of dwellings

in the EA and calculate the proportion of the dwellings in the whole EA that are located in each

segment.

Example 2.1: A cluster of 620 dwellings has been divided into 3 segments and the results are as

follows:

Segment 1: 220 dwellings, or 220/620 = 35 percent

Segment 2: 190 dwellings, or 190/620 = 31 percent

Segment 3: 210 dwellings, or 210/620 = 34 percent

Total: 620 dwellings, or 620/620 = 100 percent

On Form DHS/3 (Segmentation Form) write the size of the segments in the appropriate

columns (number and percent) and calculate the cumulative size of all of the segments in terms of a

percentage. The cumulative size of the last segment on the list must be equal to 100.

39

Segment

number

Number of

dwellings

Percent Cumulative

percent

1 220 35 35

2 190 31 66

3 210 34 100

For each large EA to be segmented, a random number between 0 and 100 will be selected in

the central office and included in the file. Compare this random number with the cumulative size.

Select the first segment for which the cumulative size is greater than or equal to the random number.

Random number: 67

Segment selected: Segment number 3

Proceed with the household listing operation in segment number 3 as described in the above

sections (see Appendix 2.3 for an example of how to complete the segmentation form.) Draw a

detailed sketch map of the selected segment and list all the households found in the selected

segment.

2.9 Quality control

To ensure that the work done by each listing team is acceptable, quality checks should be

performed. The coordinator should tour the regions during the household listing operation, and assess

the quality of the finished clusters. The coordinator should select a finished cluster and do an

independent listing of 10 percent of the cluster. If important errors are found, the whole cluster should

be relisted. If the problem is related to systematic errors, and it is not possible to do corrections on

the listing forms, then all of the listed clusters should be relisted.

2.10 Prepare the household listing forms for household selection

Once the central office receives the completed listing materials for a cluster, they must first

assign a serial number to all of the households in the cluster in the second column of the form DHS/2.

Only occupied residential households (including households that refused to cooperate at the time of

listing and households where the occupants were absent at the time of listing but would return shortly

and would be at home during the period of household interview) will be numbered. This is a

continuous serial number from 1 to the total number of occupied residential households listed in the

cluster. Leave the cell in the second column blank if the household is not occupied, or if the structure

is not a residential structure. Fill in the second column only if the structure on that row is an occupied

household. Make sure that the numbering of all occupied households follows sequentially from the

previous occupied household on the list, with no gaps or repetitions in the numbering. See the

example of a completed listing form in Appendix 2.3.

After assigning the serial numbers to all households listed in the cluster, copy the total

number of households listed to the column “Number of households listed” in the Excel file prepared for

household selection. Make sure this number is recorded in the correct row for the cluster number. In

the column “Segmentation information” record the percentage of the entire EA population that is

included in the selected segment. The segmentation information is important for correctly calculating

the sampling weights. After the total number of households listed in the cluster has been entered in

the Excel file, the spreadsheet automatically generate the household numbers of those households

selected to be interviewed. Copy the numbers of the selected households to the first column of the

form DHS/2, corresponding to the serial number of the households in the listing form. These are the

households that must be interviewed. It is recommended to use a different colored pen on the listing

40

forms to indicate the households selected for interviewing. It is also very helpful to use color on the

cluster’s sketch map to mark the structures where the selected households are located.

In many surveys, a sub-sample of households will be selected for the men’s survey. The

household selection spreadsheet uses shaded columns to indicate which households are selected for

the men’s survey. Put a mark in the first column on the form DHS/2 next to the number of the

selected household to indicate the households selected for the men’s survey, or use a different colored

pen for the households selected for both men’s and women’s surveys. Make a copy of the whole

package of files (sketch maps and the listing forms with household selection). Give the original to the

interviewing team for the household interview and keep the other copy in the central office.

41

Appendix 2.1 Example listing forms

Form DHS/1 PAGE 1 of 3

Map Information Form

Identification Label Code

Locality ___________________________________________

DHS Cluster Number ..........................................................

Urban/Rural (Urban=1/Rural=2) .........................................

EA Number ......................................................................

District ____________________________________________

Region ____________________________________________

Name of Mapper ____________________________________

Name of Lister ______________________________________

GPS Unit Tracking Number .................................................

Waypoint name (entered in GPS unit) ..................................

Latitude (North/South) .............................................. N / S

Longitude (East/West) .............................................. E / W

Altitude / Elevation (Meters) ...............................................

.

.

Observations:

Road access _____________________________________________________________

________________________________________________________________________

________________________________________________________________________

Other useful information ____________________________________________________

________________________________________________________________________

________________________________________________________________________

42

Form DHS/1 Map Information Form PAGE 2 of 3

Locality __________________________ District ___________________________

Location map DHS Cluster:

43

Form DHS/1 Map Information Form PAGE 3 of 3

Locality __________________________ District ___________________________

Sketch map of cluster DHS Cluster:

44

45

Form DHS/3 Segmentation Form

Identification Label Code

Locality ___________________________________________

DHS Cluster Number ..........................................................

Urban/Rural (Urban=1/Rural=2) .........................................

EA Number ......................................................................

District ____________________________________________

Region ____________________________________________

Name of Mapper ____________________________________

Name of Lister ______________________________________

Number of segments:

Segment number Number of households Percent Cumulative percent

1

2

3

4

5

Random number: ____________________

Segment selected: ___________________

46

Appendix 2.2 Symbols for mapping and listing

Orientation to the North

Boundaries of the cluster

Paved road

Unpaved (dirt) road

Footpath

River, creek, etc.

Bridge

Lake, pond, etc.

Mountains, hills

Water point (wells, fountain, etc.)

Market

School

Administrative structure

Church, temple

Mosque

Cemetery

Residential structure

47

Non-residential structure

Vacant structure

Hospital, clinic, etc.

Electric pole

Tree or bush

48

Appendix 2.3 Examples of completed mapping and listing forms

49

50

51

52

3 SELECTED SAMPLING TECHNIQUES

In this section, some of the most commonly used sampling techniques and their application

are presented. The presentation will focus mainly on practical rather than theoretical aspects.

However, the chapter does touch on some basic theoretical properties of the techniques used in the

DHS surveys.

We focus on without replacement sampling rather than with replacement sampling procedures,

since the latter represents a reduction of efficiency for samples of a fixed size due to the potential that

some sampling units may be repeated. When this occurs, the amount of information carried in a fixed

size sample is reduced because the same sampling unit is selected several times. For readers who are

interested in the theoretical aspects of the selected sampling techniques, please refer to the textbooks

dealing with survey sampling theory listed in the references.

3.1 Simple random sampling

We begin with simple random sampling without replacement (SRSWOR) since this is a

fundamental sampling procedure that is used as standard to which the efficiency of other sampling

procedures is compared. Simple random sampling without replacement is a selection procedure where

every unit has an equal chance of being selected. Selection can be performed through successive

draws without replacement from a well-mixed container containing all sampling units, or using certain

computerized algorithms to select from a list of all sampling units.

Let N be the total number of sampling units, let n be the total sample size, n<N. The

probability of selection for every i

th

unit is given by:

N

n

P

i

=

The design weight (assuming no non-response) is given by:

n

N

PD

ii

== /1

The probability for any particular n different units selected together in a sample s is given by:

=

N

n

P

s

/1

where

N

n

is the total number of combinations of n elements out of N. Let

n

yyy ...,,

21

be the

observations made from the selected units on a variable of interest, then the weighted sample mean

which is the same as the unweighted sample mean,

==

n

i

n

i

n

ii

y

n

DyDy

111

1

/

,

is an unbiased estimator of the population mean,

=

N

i

y

N

Y

1

1

, with its sampling variance given by

()

2

1

ysrs

S

n

f

yV

−

=

where

()

2

1

2

1

1

−

−

=

N

iy

Yy

N

S

is the finite population variance of the variable y and f=n/N is the

sampling fraction. An unbiased estimation of this variance can be made using

53

()

2

1

ysrs

s

n

f

y

−

=

υ

where

()

2

1

2

1

1

−

−

=

n

iy

yy

n

s

is the sample variance. When n and N are large, the standardized

variable

()

ySE

Yy −

follows a student-t distribution with n-1 degrees of freedom and

()

ySE

is the square root of

()

y

srs

υ

.

Therefore the confidence limits of the population mean

Y can be constructed based on sample

observations allowing for 95% confidence that the true value of

Y will lie within the range of

)(*96.1 ySEy −

and

)(*96.1 ySEy +

. DHS reports use

)(*2 ySEy ±

for a conservative estimate of

95% confidence limits.

Given a complete list of all sampling units in a computerized file, the easiest way to draw a

simple random sample of size n is to first generate a uniformly distributed random number between 0

and 1 and associate a number with each of the sampling units. Next, sort the file based on the

generated random numbers in ascending order, and the first n units associated with the n smallest

random numbers are the selected units. This procedure provides a SRSWOR sample of size n. This

procedure is easy to implement, but requires sorting of the sampling frame. Since sorting is time

consuming, the following algorithm (Tillé, 2001) may be used with the sampling frame without

sorting:

Definition of terms and the initial step

k: the k

th

unit of the frame file; j: the j

th

selected unit

k = 0

j = 0

generate a uniformly distributed random number between [0,1)

repeat if

j < n

if u <

kN

jn

−

−

then

unit k + 1 is selected; j = j + 1

else unit k + 1 is not selected

k = k + 1

3.2 Equal probability systematic sampling

3.2.1 Sampling theory

Systematic sampling (SYS) is the selection of sampling units at a fixed interval from a list,

starting from a randomly determined point. Selection is systematic because selection of the first

sampling unit determines the selection of the remaining sampling units. Compared with SRSWOR,

systematic sampling has the following advantages:

1) It is easier to perform;

2) It allows easy verification of the selection;

3) If the sampling frame is in some order, it provides a stratification effect with respect to the

variables on which the frame is sorted, and with a proportional allocation. This

stratification is called implicit stratification.

54

4) Implicit stratification prevents unexpected concentration of sample points in certain areas

such as is possible with SRSWOR.

Because of these advantages, especially (3) and (4), systematic selection is more often used

than simple random sampling.

Systematic sampling is normally carried out as follows: assuming a whole number interval

I=N/n, where N is the number of units in the frame list and n is the number of units to be selected.

The procedure begins with an integer random number S that is less than or equal to I. The units to be

selected are S, S+I, S+2*I, ..., S+(n-1)*I. When I is not a whole number there may be appreciable

errors in rounding it to the nearest whole number, it is suggested that the decimal interval method be

used. Selection with a decimal interval may be carried out as follows:

1) Calculate the interval I rounded to two decimal places.

2) Generate a random number R between 0 and 1 with two decimal points.

3) Compute the sequence of sampling numbers: R*I, R*I + I, R*I + 2*I, ..., R*I + (n - 1)*I

4) Round up the above calculated sampling numbers to the next highest whole numbers;

these are the selected units’ numbers.

Example 3.2.1:

Let N=100, n=14, so that I=7.14; let the generated random number be R=0.96. The sampling

numbers and the corresponding selected unit numbers are as follows:

6.85 13.99 21.13 28.27 35.41 42.55 49.69 56.83 63.97 71.11 78.25 85.39 92.53 99.67

7 14 22 29 36 43 50 57 64 72 79 86 93 100

In this example, the decimal interval method gives a selection interval which is sometimes 7

or sometimes 8. The household selection templates are all programmed with decimal sampling

intervals.

Often sample design requires numerous systematic samples as is the case when a systematic

sample of households is needed within each selected cluster. In this situation a separate random start

R should be determined independently for each cluster.

With SYS, the probability of selection for any unit i is given by

N

n

I

P

i

==

1

The design weight (assuming no non-response) is given by

n

N

PD

ii

== /1

Let

n

yyy ...,,

21

be the observations made from the selected units on a variable of interest,

then the weighted sample mean which is the same as the unweighted sample mean

==

n

i

n

i

n

ii

y

n

DyDy

111