|
| 1. Introduction | |
| 2. Question Text, Variable Names, & Responses | |
| 3. Editing Instructions | |
| 4. Net Worth Program | |
| 5. Public Data Set Variable List |
For a general overview of the findings of the 1989 SCF, see Arthur B. Kennickell and Janice Shack-Marquez, "Changes in Family Finances from 1983 to 1989: Evidence from the Survey of Consumer Finances," Federal Reserve Bulletin, January 1992. (134 KB PDF)
QUESTIONNAIRE
In this codebook, many variables have been grouped in a way different
from the way that they were originally asked (e.g., lines of credit
for homeowners and non-homeowners were originally asked separately --
as noted in the codebook below, these responses have been merged in a
single set of variables). With the exception of a few variables
(e.g., marital history), the original questions appear in the codebook
with the appropriate variable numbers. Because question ordering is
important in understanding the effective meaning of many questions,
users of the data are encouraged to consult the
questionanaire
(available separately) for a precise guide to where and how the
underlying questions were asked. Note that there are no summary
variables (e.g., net worth) in the public version of the dataset.
FILES INCLUDED
The full public dataset consists of three files in addition to this
codebook file. The first is the
main dataset, which contains most of
the survey variables (a 258 megabyte file in its fully expanded
form, and 3.7 megabytes in the zipped SAS transport version of the
file available here). The second is the
questionnaire. The third
is a file of replicate weights
for use in variance estimation as
described below (42 megabytes in its fully expanded form, and 19.9
megabytes in the zipped SAS transport version of the file available
here).
VARIABLE NAMES The public use version of the 1989 SCF cross-section is a SAS dataset of about 200 megabytes. Unlike the case of our earlier SCFs, there are not separate files of "raw" and "cleaned" data. Rather, virtually every missing variable in the dataset has been imputed and every variable has a "shadow" variable that describes the original state of the variable (i.e., whether it was missing for some reason, a range response was given, etc.). An exception is reported values which have been imputed or otherwise altered for purposes of disclosure avoidance. Such values are not flagged in any systematic way. Users who so desire may use the shadow variables to restore the data to something very close to their original condition. The main data values are stored using variable names corresponding to the numbers given in the codebook below and having a prefix of "X." The shadow variables have the same numbers, but with a prefix of "J." A list of the values of the shadow variables is given in the section below entitled RANGE DATA COLLECTION AND J-CODES.
IMPUTATION
The imputations of missing values provided here are the result
of the sixth iteration of a large model described in a paper I gave at
the annual ASA meetings in 1991 ("Imputation of the 1989 Survey of
Consumer Finances: Multiple Imputation and Stochastic Relaxation).
A copy of the paper is available upon request. In this dataset, values
have been MULTIPLY-IMPUTED. The imputations are stored as complete
replicates of each case. In the release, there are 5 copies of each
survey observation. Thus, the sum of the weights will be equal five
times the number of families in the U.S. in 1989. Users should be
careful to make appropriate degrees of freedom adjustments in any
calculations they make using these data. For an overview of the
theory of multiple imputation and the analysis of multiply-imputed
data, see MULTIPLE IMPUTATION FOR NONRESONSE IN SURVEYS by
Donald B. Rubin, John Wiley and Sons, 1987.
ANALYSIS WEIGHTS
The dataset contains several sets of nonresponse-adjusted sampling
weights. X42001 is the weight strongly recommended for the analysis
of the data for most purposes. This variable is a partially design-based
weight constructed at the Federal Reserve using original selection
probabilities and frame information along with aggregate control
totals estimated from the Current Population Survey. The population
defined by the weights for *each implicate* (see above) is 93.0
million households. This weight is a relatively minor revision of the
consistent weight series (X42000) maintained for the SCFs beginning
with 1989 (For a detailed discussion of these weights, see "Consistent
Weight Design for the 1989, 1992, and 1995 SCFs and the Distribution of
Wealth," by Arthur B. Kennickell and R. Louise Woodburn, Review of
Income and Wealth, Series 45, Number 2, June 1999, pp. 193-215 or the
longer version given on the SCF web site at
http://www.federalreserve.gov/pubs/oss/oss2/method.html). The nature
of the revisions to the consistent weights is described in "Revisions
to the SCF Weighting Methodology: Accounting for Race/Ethnicity and
Homeownership," by Arthur Kennickell (see SCF web site). A version of
the revised weight has been computed for all the surveys beginning
with 1989, and this variable has been added to the public versions of
the SCF datasets. Weights X40125 and X40131 have also been included
in this release for historical reasons. These weights are partially
design-based weights (see Heeringa, Conner and Woodburn [1994] "The
1989 Surveys of Consumer Finances: Sample Design and Weighting
Documentation") that were originally released with the data. The
weight X40125 is the preliminary SRC design-based weight used in the
report on the SCF by Janice Shack-Marquez and me in the January 1992
issue of the Federal Reserve Bulletin. This weight was superceded by
X40131. Users should be aware that the sum of each of the weights
over all sample cases and imputation replicates is equal to five times
the number of households in the sample universe.
SAMPLING ERROR
For a variety of reasons connected with disclosure limitation, it is
not possible to give users the details about the SCF sample design that
they would need to compute reasonable estimates of sampling variance
by standard means. During the construction of X40202, a small set of
replicate weights was created to use for variance estimation. Because
it was difficult at that time to compute such weights, only eleven
such replicates were computed. Until this release, these replicate
weights have not been included with the dataset. They are included
here largely for historical documentation. In two separate files,
available with the main dataset, we have included a sets of replicate
weights computed using the same algorithm as was used for X42000 and
for X42001. Using detailed information about the original sample
design, we selected 999 sample replicates from the final set of
completed cases in a way intended to capture the important dimensions
of sample variation. For each survey case and each replicate, the
file contains a weight and the number of times the case was selected
in the replicate. We computed weights for each replicate using
exactly the same procedures we used for X42000. Replicate weights
were computed only for the first implicate of each case. For most
purposes, users will probably want to multiply the weight times the
multiplicity: in all cases the sum of the weights times the
multiplicities equals the total number of households. To estimate the
sampling variance of the mean of family income, for example, a user
would estimate the mean 999 times using the replicate weights and
compute the standard error of that estimate. An estimate of the
standard error due to these two sources is given by the square root of
the sum of the estimated sampling variance and 6/5 times the
imputation variance. The replicate weights associated with this
release of the data were recomputed along with the main weight to
ensure consistency with the 1992 and 1995 SCFs.
DISCLOSURE REVIEW
Unlike earlier releases of the dataset, this one includes all
cross-section cases and most important dollar variables. However,
the data reported have have been systematically altered by several
means to minimize the possibility of identifying any survey
respondent. For some discrete variables, small or unusual cells were
collapsed as noted in the variable descriptions below. Continuous
variables were rounded. Data were also blurred by other unspecified
means. In addition, 300 cases were identified for more extensive
treatment. Some of these cases were selected on the basis of extreme
or unusual data values. Others of the 300 cases were selected at
random. For each of the 300 cases, a selection of critical
variables was set to missing and statistically imputed subject to
constraints designed to ensure that any distortions induced in key
population statistics would be minimal. Aside from the cell
collapsing, there is no key in this codebook or in the dataset that would
allow users to identify directly either which data items have been
smoothed or otherwise altered, or which cases were selected for
imputation of critical values.
DATA REVIEW
We have spent many hours searching for errors in the data. Many
inconsistencies are actually in the raw data and seem to have no
obvious reconcilliation (most prominently the fact that X5729 -- total
income -- is not always equal to the sum of the income components).
Other types of inconsistencies may have been induced as a byproduct of
imputation, even though elaborate checks are built into the imputation
routines. We ask our colleagues who use this dataset to help us find
the remaining inconsistencies.
CONTACT
It is likely that some users will have trouble dealing with the data
at first. If after having framed a focused question and exhausted all
of your local resources your problem persists, you may call Gerhard
Fries at ((202) 452-2578 or e-mail him at [email protected]) or myself at
((202)-452-2247 or email me at [email protected])). We prefer
correspondence via e-mail. While we would like to be helpful to you,
please realize that we do not have extensive resources to devote to
user services. We hope that by persistence, you will almost always be
able to figure out what you need by consulting the questionnaire and
the codebook below.
0 = value reported on original tape (includes values reported in the
questionnaire that were alered during editing).
1 = question is inapplicable (e.g., R has no checking account
so value of checking account is coded as zero -- NOTE: zero is a
legitimate value for X-variables only for this value of the
associated J-variable).
2 = evidence from problem sheets that data moved from another
location, but no evidence that value was altered,
or data moved.
3 = evidence from problem sheets that data moved from another
location, and evidence that value was somehow altered based on
margin notes or other information.
4 = evidence that data imputed from marginal notes.
5 = 83/86 value brought forward.
8 = recode of survey variables, no missing values in antecedents.
9 = recode of survey variables, insufficient data collected to
compute value, but not imputed.
12 = in case of regular installment loans where term is DK, non-missing
typical payment moved to monthly payment section.
13 = coded value overridden after editing completed.
14 = inapplicable given hard-code decision (15).
15 = hard-coded imputation determined during cleaning.
16 = override of reported 89 data with 86 data.
17 = override of reported 89 data with 83 data.
18 = other panel override of reported data (combination of 16 & 17, logical
consistency, misc. intuition).
19 = imputation of missing 89 data using 86 data.
20 = imputation of missing 89 data using 83 data.
24 = range card response A: $1 to $500.
25 = range card response B: $501 to $1,000.
26 = range card response C: $1,001 to $2,500.
27 = range card response D: $2,501 to $10,000.
28 = range card response E: $10,001 to $50,000.
29 = range card response F: $50,001 to $250,000.
30 = range card response G: $250,001 to $1,000,000.
31 = range card response H: $1,000,001 to $10,000,000.
32 = range card response I: $10,000,001 to $100,000,000.
33 = range card response J: more than $100,000,000.
34 = range card response < 0 A: -$1 to -$500.
35 = range card response < 0 B: -$501 to -$1,000.
36 = range card response < 0 C: -$10,01 to -$2,500.
37 = range card response < 0 D: -$2,500 to -$10,000.
38 = range card response < 0 E: -$10,001 to -$50,000.
39 = range card response < 0 F: -$50,001 to -$250,000.
40 = range card response < 0 G: -$250,001 to -$1,000,000.
41 = range card response < 0 H: -$1,000,001 to -$10,000,000.
42 = range card response < 0 I: -$10,000,001 to -$100,000,000.
43 = range card response < 0 J: less than -$100,000,000.
44 = value < 0, amount DK
45 = value < 0, amount NA
49 = variable imputed during editing from margin notes, reimputed.
50 = original response was DK.
51 = original response was NA (includes refusals, interviewer errors,
and missing data resulting from editing decisions). Does not
include data missing as a result of missing higer-order questions.
52 = original response missing as a result of missing information for
a higher-order question (typically a YES/NO cut question). In
this case, the higher-order question has been imputed in such
a way as to render response appropriate.
53 = refused (available only for aggregate income, income range
questions, whether filed tax return, and AGI: T3/4/7b/7d).
54 = some, DK how many (see B6).
79 = data missing becaue of questionnaire error, or data not collected
80 = recode variable, missing because data not collected for a
sub-group, data to be imputed.
81 = recode variable, some, but not all components originally missing.
82 = recode variable, all components originally missing.
88 = for property value, only assessed value given.
98 = override of reported information (e.g., R says has 1 IRA, but
2 institution types reported) -- value set to missing.
99 = used for absent spouse for J104 or J105 when X104 OR X105 < 0.
100 = Value set to missing while problem with case is being resolved
(temporary code).
183 = demographic and employment recodes for panel and panel/cross-section
cases: only 83 data missing.
184 = employment recodes for panel and panel/cross-section cases: 83
and 86 data missing.
185 = demographic and employment recodes for panel and panel/cross-section
cases: 83 and 89 data missing.
186 = demographic, marital history, and employment recodes for panel
and panel/cross-section cases: only 86 data missing.
187 = demographic, marital history, and employment recodes for panel
and panel/cross-section cases: 86 and 89 data missing.
188 = marital history and employment recodes for panel
and panel/cross-section cases: 83,86 and 89 data missing.
189 = demographic, marital history, and employment recodes for panel
and panel/cross-section cases: only 89 data missing.
200 = marital history variable set to missing due to irreconcilable
inconsistencies between 1986 and 1989 data.