|
1.Introduciton | |
2.Question Text, Variable Names, & Responses | |
3.Editing Instructions | |
4.Net Worth Program | |
5.Public Data Variable List |
For a general overview of the findings of the 1992 SCF, see Arthur Kennickell and Martha Starr-McCluer, "Changes in Family Finances from 1989 to 1992: Evidence from the Survey of Consumer Finances," Federal Reserve Bulletin, October 1994.(110 KB PDF | 518 KB Postscript)
QUESTIONNAIRE
The ordering of the variables in the codebook often differs from that
in the original questionnaire. For example, all installment debt
questions are in one place in the codebook, but were asked throughout
the questionnaire. In some other cases, two sets of questions that
requested the same information of different populations have been
merged (e.g., lines of credit for homeowners and non-homeowners were
originally asked separately, but have been merged in a single set of
variables here). With the exception of a few variables (e.g., marital
history, where the complex underlying questions have been recoded into
a standard format), the original questions appear in the codebook with
the appropriate variable numbers. Because question ordering is
important in understanding the effective meaning of many questions,
users of the data are encouraged to consult the questionnaire for a precise guide
to where and how the underlying questions were asked.
FILES INCLUDED
In addition to this file, the full public dataset consists of three
other pieces. The main
dataset, which contains most of the survey
variables, is a 374 megabyte file (stored as a SAS transport file).
A file of of 32 megabytes contains 999
replicate weights and multiplicity factors intended to be used for
variance estimation.
Other documentation is available from http://www.federalreserve.gov/pubs/oss/oss2/92/scf92home.html.
VARIABLE NAMES
The main data values are stored using variable names corresponding to
the numbers given in the codebook below prefixed by an "X." We have
tried, insofar as it was possible, to retain the variable numbering
system used for the 1989 SCF. Each of these variables in the main
dataset has a "shadow" variable that describes--in almost all
cases--the original state of the variable (i.e., whether it was
missing for some reason, a range response was given, etc.). An
exception is reported values which have been imputed or otherwise
altered to protect the privacy of respondents (see below). Such
values are not flagged in any systematic way. Users who so desire may
use the shadow variables to restore the data to something very close
to their original condition. The shadow variables have the same
numbers as the main variable, but have a prefix of "J." A list of the
values taken by the shadow variables is given in the section below entitled RANGE DATA COLLECTION AND J-CODES.
UNIT OF ANALYSIS
Most of the data in the survey are for a subset of the household unit
referred to as the "primary economic unit" (PEU). In brief, the PEU
consists of an economically dominant single individual or couple
(married or living as partners) in a household and all other
individuals in the household who are financially dependent on that
individual or couple. For example, in the case of a household
composed of a married couple who own their home, a minor child, a
dependent adult child, and a financially independent parent of one of
the members of the couple, the PEU would be the couple and the two
children. Summary information is collected at the end of the
interview for all household members who are not included in the PEU.
Throughout the codebook, we refer to the "head" of the household. The
use of this term is euphemistic and merely reflects the systematic way
in which the dataset is organized. The head is taken to be the single
core individual in a PEU without a core couple. In a PEU with a
central couple, the head is taken to be either the male in a mixed-sex
couple or the older individual in the case of a same-sex couple. No
judgment about the internal organization of the households is implied
by this organization of the data. When the original respondent was
someone other than the person determined to be the head in this sense,
all data (including response codes) were systematically swapped with
that person's spouse or partner. The variable X8000 indicates which
cases have been subjected to such rearrangement.
IMPUTATION
The missing data in the survey have been imputed five times by drawing
repeatedly from an estimate of the conditional distribution of the
data. These imputations are stored in five replicates ("implicates")
of each data record. Thus, the number of observations in the dataset
(19,530) is five times the actual number of respondents (see below in
the weight section of the codebook for a discussion of the use of
these implicates). The imputation procedure is described in detail in
"Imputation of the 1989 Survey of Consumer Finances: Multiple
Imputation and Stochastic Relaxation", by Arthur Kennickell. For a
general discussion of multiple imputation and its uses, see MULTIPLE
IMPUTATION FOR NONRESONSE IN SURVEYS by Donald B. Rubin, John Wiley
and Sons, 1987. The multiple imputations allow users to estimate the
amount of uncertainty in estimates that is due to imputation. To do
so, one could make a given estimate separately with each of the five
implicates and compute the standard error of that estimate. For users
who want to estimate only simple statistics such as means and medians
ignoring imputation error, it will probably be sufficient to divide
the weights by 5. Users who want to estimate more complex statistics,
particularly regressions, should be cautious in their treatment of the
implicates. Many regression packages will treat each of the five
implicates as an independent observation and correspondingly inflate
the reported significance of results. Users who want to calculate
regression estimates, but who have no immediate use for proper
significance tests, could either average the dependent and independent
values across the implicates or multiply their standard errors by the
square root of five. Users who use the SAS procedure PROC UNIVARIATE
are warned that only the FREQ option will compute a weighted
median--the WEIGHT option will not do so.
ANALYSIS WEIGHTS
Because the SCF sample is not an equal-probability design, weights
play a critical role in interpreting the survey data. The main
dataset also contains the final nonresponse-adjusted sampling weights.
These weights are intended to compensate for unequal probabilities of
selection in the original design and for unit nonresponse (failure to
obtain an interview). The weight (X42001) is a partially design-based
weight constructed at the Federal Reserve using original selection
probabilities and frame information along with aggregate control
totals estimated from the Current Population Survey. The population
defined by the weights for *each implicate* (see above) is 95.9
million households. This weight is a relatively minor revision of the
consistent weight series (X42000) maintained for the SCFs beginning
with 1989 (For a detailed discussion of these weights, see "Consistent
Weight Design for the 1989, 1992, and 1995 SCFs and the Distribution of
Wealth," by Arthur B. Kennickell and R. Louise Woodburn, Review of
Income and Wealth, Series 45, Number 2, June 1999, pp. 193-215 or the
longer version given on the SCF web site at
http://www.federalreserve.gov/pubs/oss/oss2/method.html). The nature
of the revisions to the consistent weights is described in "Revisions
to the SCF Weighting Methodology: Accounting for Race/Ethnicity and
Homeownership," by Arthur Kennickell (see SCF web site). A version of
the revised weight has been computed for all the surveys beginning
with 1989, and this variable has been added to the public versions of
the SCF datasets. A file of associated replicate weights has also
been added. Users should be aware that the sum of each of the weights
over all sample cases and imputation replicates is equal to five times
the number of households in the sample universe.
Although the weights should produce reliable results at the level of broad aggregates (e.g., net worth and income), it is important to remember that many of the variables collected in the SCF are highly skewed in their distribution and that many such variables will apply to only a relatively small fraction of the sample. In the SCF group at the Federal Reserve, we routinely review our calculations for the presence of overly-influential outliers, and robust techniques are applied when appropriate. We encourage other users to exercise similar care in analyzing the data.
The original weight first released with the 1992 data (X41000) follows the design of the consistent weights exactly. However, because there have been some minor changes in the data since the original release of the final data, the weights changed slightly when the consistent weights were reestimated as a group. The original weights are retained in the dataset for historical reasons.
SAMPLING ERROR
Because we are unable to give users any sample information about cases
in the dataset, they will be unable on their own to compute
reasonable estimates of the sampling variances of their estimates. To
facilitate such estimation, we have included two files of replicate
weights and multiplicity factors--one corresponding to X42000 and the
other to X42001. Using detailed information about the original sample
design, we selected 999 sample replicates from the final set of
completed cases in a way intended to capture the important dimensions
of sample variation (See Arthur Kennickell,
Douglas McManus and Louise Woodburn, "Weighting design for the 1992
Survey of Consumer Finances" (HTML | 1.1 MB PDF | 2.6 MB Postscript) for details).
For each survey case and each replicate, the file
contains a weight and the number of times the case was selected in the
replicate. We computed weights for each replicate using exactly the
same procedures we used for the main weights. Replicate weights were
computed only for the first implicate of each case. For most
purposes, users will probably want to multiply the weight times the
multiplicity: in all cases the sum of the weights times the
multiplicities equals the total number of households. To estimate the
sampling variance of the mean of family income, for example, a user
would estimate the mean 999 times using the replicate weights and
compute the standard error of that estimate. An estimate of the
standard error due to these two sources is given by the square root of
the sum of the estimated sampling variance and 6/5 times the
imputation variance. The replicate weights associated with this
release of the data were recomputed along with the main weight to
ensure consistency with the 1989 and 1995 SCFs.
SUMMARY VARIABLES
We have not made an effort to include summary variables (e.g., net
worth) in the dataset. Although it is complicated to construct such
variables, it is our belief that a substantial amount of judgment is
involved in selecting which variables to include, and that analysts
should make their own decisions. However, at the end of this file, we
have included code to compute net worth according to our routine
definitions.
DISCLOSURE REVIEW
To protect the privacy of individual respondents, the data in this
release have have been systematically altered by several means to
minimize the possibility of identifying any survey respondent. For
some discrete variables, small or unusual cells were collapsed as
noted in the variable descriptions below. Continuous variables were
rounded. Data were also blurred by other unspecified means. In
addition, a number of other cases were identified for more extensive
treatment. Some of these cases were selected on the basis of extreme
or unusual data values. Other cases were selected at random. For
each of these cases, a selection of critical variables was set to
missing and statistically imputed subject to constraints designed to
ensure that any distortions induced in key population statistics would
be minimal. The geographic identifiers included here have been
systematically altered for a subset of respondents by swapping their
locations with those of otherwise similar respondents.
It is important to note that aside from the cell collapsing, there is no key in this codebook or in the dataset that would allow users to identify directly either which data items have been smoothed or otherwise altered, or which cases were selected for imputation of critical values (that is, the shadow variables in this dataset may not always reflect the true original status of every variable). Although this blurring of the data will have some effect on analysis, that effect should be negligible in almost every case. For further details on the procedures taken to protect the identity of respondents, see "Disclosure review and its implications for the 1992 Survey of Consumer Finances", Gerhard Fries Barry Johnson and R. Louise Woodburn. Users who feel that the restrictions imposed on the public dataset are too constricting are encouraged to submit written proposals for expanded data release, and those requests will be given serious consideration in the release of data from future surveys.
CASE ID NUMBERS
Under the original numbering system (XX1), the sample design is
apparent from the identification numbers. Thus, each case included in
the public version of the dataset has been given an identification
number (YY1), which is intended to mask the knowledge of which cases
were drawn from the SCF list sample. It is not possible to know
with certainty from the information provided in the public version of
this dataset which cases derive from the list sample. Because we
routinely use the original numbers internally, users who direct
questions to us about specific cases might want to be sure to
emphasize that they are using the external ID number to avoid
confusion.
DATA REVIEW
We have spent many hours searching for errors in the data. Many
seeming inconsistencies are actually in the raw data and appear to have
no obvious reconciliation (most prominently the fact that X5729--
total income--is not always equal to the sum of the income components).
Other types of inconsistencies may have been induced as a byproduct of
imputation, even though elaborate checks are built into the imputation
routines. We ask our colleagues who use this dataset to help us find
the remaining resolvable inconsistencies. Our presumptions is always
that the respondent understood each question and reported accurately,
and that the process of transcription and coding did not distort that
information. In the relatively small number of cases where other
information led us beyond a reasonable doubt of the validity of the
data, we have changed data, either by altering that value directly
or by setting it to missing and imputing it.
CONTACT INFORMATION
It is likely that some users will have trouble understanding the
organization of the data at first. If after having framed a focused
question and exhausted all of your local resources your problem
persists, you may call Gerhard Fries at ((202) 452-2578 or e-mail
[email protected]) or me ((202)-452-2247 or e-mail [email protected])).
We prefer correspondence via e-mail. While we
would like to be helpful to you, please realize that we have very
limited resources to devote to user services. We hope that by
persistence, you will almost always be able to figure out what you
need by consulting the questionnaire and the codebook below.
Definitions of the "J" Variables (1992 version) 0 = value reported on original tape (possibly altered during editing, but no evidence on problem sheets to this effect -- NOTE: problem sheet information is not comprehensive). 1 = question is inapplicable for R (e.g., R has no checking account so value of checking account is coded as zero -- NOTE: there are no zeros in the dataset other than such values). 2 = data moved from another location (not including re-arranging columns in a grid); data moved from another location and added to data already at new location (e.g., wage income from spouse reported in independent adult part of section Y added to data reported for R in Section T). 3 = data provided for a question with a branch structure, but not known which branch data should be in (e.g., AGI given, but filing status unknown). 4 = evidence that data imputed from marginal notes. 8 = recode of survey variables, no missing values in antecedents. 9 = recode of survey variables, insufficient data collected to compute value, not imputed. 10 = part of reported value reported elsewhere and edited out here (e.g., wage income of NPEU member also reported at X5701 along with income of PEU resulting in J5702=10) or entire reported value reported elsewhere and edited out here (e.g., all of wage income of NPEU member reported at X5701 resulting in X5701=5, J5701=10, X5702=0 and J5702=14). 12 = in case of regular installment loans where term is DK, non-missing typical payment moved to monthly payment section. 13 = coded value overridden after editing completed 14 = value set to inap given hard-code decision (12, 13 or 15) 15 = hard-coded imputation determined during cleaning. 16 = other reassignment resulting from cleaning that overrides reported data (e.g., the cleaning of the institutions grid in Section A). 17 = value of originally missing data item implied by other variable(s). 24 = Range Card response: A. $1 to $100 25 = Range Card response: B. $101 to $500 26 = Range Card response: C. $501 to $750 27 = Range Card response: D. $751 to $1,000 28 = Range Card response: E. $1,001 to $2,500 29 = Range Card response: F. $2,500 to $5,000 30 = Range Card response: G. $5,001 to $7,500 31 = Range Card response: H. $7,501 to $10,000 32 = Range Card response: I. $10,001 to $25,000 33 = Range Card response: J. $25,001 to $50,000 34 = Range Card response: K. $50,001 to $75,001 35 = Range Card response: L. $75,001 to $100,000 36 = Range Card response: M. $100,001 to $250,000 37 = Range Card response: N. $250,001 to $1,000,000 38 = Range Card response: O. $1,000,001 to $5,000,000 39 = Range Card response: P. $5,000,001 to $10,000,000 40 = Range Card response: Q. $10,000,001 to $25,000,000 41 = Range Card response: R. $25,000,001 to $50,000,000 42 = Range Card response: S. $50,000,001 to $100,000,000 43 = Range Card response: T. More than $100,000,000 44 = Range Card response < 0: A. -$1 to -$100 45 = Range Card response < 0: B. -$101 to -$500 46 = Range Card response < 0: C. -$501 to -$750 47 = Range Card response < 0: D. -$751 to -$1,000 48 = Range Card response < 0: E. -$1,001 to -$2,500 49 = Range Card response < 0: F. -$2,500 to -$5,000 50 = Range Card response < 0: G. -$5,001 to -$7,500 51 = Range Card response < 0: H. -$7,501 to -$10,000 52 = Range Card response < 0: I. -$10,001 to -$25,000 53 = Range Card response < 0: J. -$25,001 to -$50,000 54 = Range Card response < 0: K. -$50,001 to -$75,001 55 = Range Card response < 0: L. -$75,001 to -$100,000 56 = Range Card response < 0: M. -$100,001 to -$250,000 57 = Range Card response < 0: N. -$250,001 to -$1,000,000 58 = Range Card response < 0: O. -$1,000,001 to -$5,000,000 59 = Range Card response < 0: P. -$5,000,001 to -$10,000,000 60 = Range Card response < 0: Q. -$10,000,001 to -$25,000,000 61 = Range Card response < 0: R. -$25,000,001 to -$50,000,000 62 = Range Card response < 0: S. -$50,000,001 to -$100,000,000 63 = Range Card response < 0: T. Less than -$100,000,000 150 = original response was DK. 151 = original response was NA (includes refusals, interviewer errors, and missing data resulting from editing decisions). Does not include data missing as a result of missing higher-order questions. 152 = original response missing as a result of missing information for a higher-order question (typically a YES/NO cut question). In this case, the higher-order question has been imputed in such a way as to render response appropriate. 153 = refused 154 = some, DK how many (see B6). 160 = unresolved data problem (none should remain in final dataset). 179 = data missing because of questionnaire error, or data not collected 180 = recode variable, missing because data not collected for sub-group, data to be imputed. 181 = recode variable, some, but not all components originally missing. 182 = recode variable, all components originally missing. 188 = for property value, only assessed value given. 197 = override of reported information with (at least partially) imputed data (e.g., number/type of institution in Section A is overridden after imputation of institutions to account for new institutions). 198 = override of reported/inap. information (e.g., R says has 1 IRA, but 2 institution types reported; institution reference refers to Section A column, but column inap) -- value set to missing. 199 = used for absent spouse for J104 or J105 when X104 OR X105 < 0. General instructions for J variable coding for recoded variables: When a recoded variable is taken directly from another single X variable, it should have the same J variable code. When a recoded variable may come from a single variable in the original X variables, or as the result of a calculation based on some number of X variables, it is important to distinguish the information content in the J variables. When the value is taken directly, the J variable should have exactly the same value as that for the X variable's shadow J variable. When some calculation is involved, this should be reflected in the J variable -- codes 8, 181, and 182. When a recode cannot be computed because some part of the underlying information was not collected for some subset of cases, the recode's J variable should be coded 9 or 180.