FRB: 1992 SCF Codebook, part1

1992 Survey of Consumer Finances
Here: Introduction
Next: Question, Text, Variable Names, & Responses

CODEBOOK FOR 1992 SURVEY OF CONSUMER FINANCES

Arthur Kennickell

SCF Project Director

Table of Contents

	1.Introduciton
	2.Question Text, Variable Names, & Responses
	3.Editing Instructions
	4.Net Worth Program
	5.Public Data Variable List

INTRODUCTION

This codebook serves as the authoritative guide to the variables included on the final public version of the 1992 SCF dataset. However, not every variable included in this codebook is actually in the public use dataset. Among other things, the dataset does NOT include most variables related to the sample design, details of geography, or the 3-digit industry and occupation codes. The authoritative list of the variables included is given in the section entitled Public Data Set Variable Llist. Please consult that list to determine whether a given variable is available to you.

For a general overview of the findings of the 1992 SCF, see Arthur Kennickell and Martha Starr-McCluer, "Changes in Family Finances from 1989 to 1992: Evidence from the Survey of Consumer Finances," Federal Reserve Bulletin, October 1994.(110 KB PDF | 518 KB Postscript)

QUESTIONNAIRE
The ordering of the variables in the codebook often differs from that in the original questionnaire. For example, all installment debt questions are in one place in the codebook, but were asked throughout the questionnaire. In some other cases, two sets of questions that requested the same information of different populations have been merged (e.g., lines of credit for homeowners and non-homeowners were originally asked separately, but have been merged in a single set of variables here). With the exception of a few variables (e.g., marital history, where the complex underlying questions have been recoded into a standard format), the original questions appear in the codebook with the appropriate variable numbers. Because question ordering is important in understanding the effective meaning of many questions, users of the data are encouraged to consult the questionnaire for a precise guide to where and how the underlying questions were asked.

FILES INCLUDED
In addition to this file, the full public dataset consists of three other pieces. The main dataset, which contains most of the survey variables, is a 374 megabyte file (stored as a SAS transport file). A file of of 32 megabytes contains 999 replicate weights and multiplicity factors intended to be used for variance estimation. Other documentation is available from http://www.federalreserve.gov/pubs/oss/oss2/92/scf92home.html.

VARIABLE NAMES
The main data values are stored using variable names corresponding to the numbers given in the codebook below prefixed by an "X." We have tried, insofar as it was possible, to retain the variable numbering system used for the 1989 SCF. Each of these variables in the main dataset has a "shadow" variable that describes--in almost all cases--the original state of the variable (i.e., whether it was missing for some reason, a range response was given, etc.). An exception is reported values which have been imputed or otherwise altered to protect the privacy of respondents (see below). Such values are not flagged in any systematic way. Users who so desire may use the shadow variables to restore the data to something very close to their original condition. The shadow variables have the same numbers as the main variable, but have a prefix of "J." A list of the values taken by the shadow variables is given in the section below entitled RANGE DATA COLLECTION AND J-CODES.

UNIT OF ANALYSIS
Most of the data in the survey are for a subset of the household unit referred to as the "primary economic unit" (PEU). In brief, the PEU consists of an economically dominant single individual or couple (married or living as partners) in a household and all other individuals in the household who are financially dependent on that individual or couple. For example, in the case of a household composed of a married couple who own their home, a minor child, a dependent adult child, and a financially independent parent of one of the members of the couple, the PEU would be the couple and the two children. Summary information is collected at the end of the interview for all household members who are not included in the PEU. Throughout the codebook, we refer to the "head" of the household. The use of this term is euphemistic and merely reflects the systematic way in which the dataset is organized. The head is taken to be the single core individual in a PEU without a core couple. In a PEU with a central couple, the head is taken to be either the male in a mixed-sex couple or the older individual in the case of a same-sex couple. No judgment about the internal organization of the households is implied by this organization of the data. When the original respondent was someone other than the person determined to be the head in this sense, all data (including response codes) were systematically swapped with that person's spouse or partner. The variable X8000 indicates which cases have been subjected to such rearrangement.

IMPUTATION
The missing data in the survey have been imputed five times by drawing repeatedly from an estimate of the conditional distribution of the data. These imputations are stored in five replicates ("implicates") of each data record. Thus, the number of observations in the dataset (19,530) is five times the actual number of respondents (see below in the weight section of the codebook for a discussion of the use of these implicates). The imputation procedure is described in detail in "Imputation of the 1989 Survey of Consumer Finances: Multiple Imputation and Stochastic Relaxation", by Arthur Kennickell. For a general discussion of multiple imputation and its uses, see MULTIPLE IMPUTATION FOR NONRESONSE IN SURVEYS by Donald B. Rubin, John Wiley and Sons, 1987. The multiple imputations allow users to estimate the amount of uncertainty in estimates that is due to imputation. To do so, one could make a given estimate separately with each of the five implicates and compute the standard error of that estimate. For users who want to estimate only simple statistics such as means and medians ignoring imputation error, it will probably be sufficient to divide the weights by 5. Users who want to estimate more complex statistics, particularly regressions, should be cautious in their treatment of the implicates. Many regression packages will treat each of the five implicates as an independent observation and correspondingly inflate the reported significance of results. Users who want to calculate regression estimates, but who have no immediate use for proper significance tests, could either average the dependent and independent values across the implicates or multiply their standard errors by the square root of five. Users who use the SAS procedure PROC UNIVARIATE are warned that only the FREQ option will compute a weighted median--the WEIGHT option will not do so.

ANALYSIS WEIGHTS
Because the SCF sample is not an equal-probability design, weights play a critical role in interpreting the survey data. The main dataset also contains the final nonresponse-adjusted sampling weights. These weights are intended to compensate for unequal probabilities of selection in the original design and for unit nonresponse (failure to obtain an interview). The weight (X42001) is a partially design-based weight constructed at the Federal Reserve using original selection probabilities and frame information along with aggregate control totals estimated from the Current Population Survey. The population defined by the weights for *each implicate* (see above) is 95.9 million households. This weight is a relatively minor revision of the consistent weight series (X42000) maintained for the SCFs beginning with 1989 (For a detailed discussion of these weights, see "Consistent Weight Design for the 1989, 1992, and 1995 SCFs and the Distribution of Wealth," by Arthur B. Kennickell and R. Louise Woodburn, Review of Income and Wealth, Series 45, Number 2, June 1999, pp. 193-215 or the longer version given on the SCF web site at http://www.federalreserve.gov/pubs/oss/oss2/method.html). The nature of the revisions to the consistent weights is described in "Revisions to the SCF Weighting Methodology: Accounting for Race/Ethnicity and Homeownership," by Arthur Kennickell (see SCF web site). A version of the revised weight has been computed for all the surveys beginning with 1989, and this variable has been added to the public versions of the SCF datasets. A file of associated replicate weights has also been added. Users should be aware that the sum of each of the weights over all sample cases and imputation replicates is equal to five times the number of households in the sample universe.

Although the weights should produce reliable results at the level of broad aggregates (e.g., net worth and income), it is important to remember that many of the variables collected in the SCF are highly skewed in their distribution and that many such variables will apply to only a relatively small fraction of the sample. In the SCF group at the Federal Reserve, we routinely review our calculations for the presence of overly-influential outliers, and robust techniques are applied when appropriate. We encourage other users to exercise similar care in analyzing the data.

The original weight first released with the 1992 data (X41000) follows the design of the consistent weights exactly. However, because there have been some minor changes in the data since the original release of the final data, the weights changed slightly when the consistent weights were reestimated as a group. The original weights are retained in the dataset for historical reasons.

SAMPLING ERROR
Because we are unable to give users any sample information about cases in the dataset, they will be unable on their own to compute reasonable estimates of the sampling variances of their estimates. To facilitate such estimation, we have included two files of replicate weights and multiplicity factors--one corresponding to X42000 and the other to X42001. Using detailed information about the original sample design, we selected 999 sample replicates from the final set of completed cases in a way intended to capture the important dimensions of sample variation (See Arthur Kennickell, Douglas McManus and Louise Woodburn, "Weighting design for the 1992 Survey of Consumer Finances" (HTML | 1.1 MB PDF | 2.6 MB Postscript) for details). For each survey case and each replicate, the file contains a weight and the number of times the case was selected in the replicate. We computed weights for each replicate using exactly the same procedures we used for the main weights. Replicate weights were computed only for the first implicate of each case. For most purposes, users will probably want to multiply the weight times the multiplicity: in all cases the sum of the weights times the multiplicities equals the total number of households. To estimate the sampling variance of the mean of family income, for example, a user would estimate the mean 999 times using the replicate weights and compute the standard error of that estimate. An estimate of the standard error due to these two sources is given by the square root of the sum of the estimated sampling variance and 6/5 times the imputation variance. The replicate weights associated with this release of the data were recomputed along with the main weight to ensure consistency with the 1989 and 1995 SCFs.

SUMMARY VARIABLES
We have not made an effort to include summary variables (e.g., net worth) in the dataset. Although it is complicated to construct such variables, it is our belief that a substantial amount of judgment is involved in selecting which variables to include, and that analysts should make their own decisions. However, at the end of this file, we have included code to compute net worth according to our routine definitions.

DISCLOSURE REVIEW
To protect the privacy of individual respondents, the data in this release have have been systematically altered by several means to minimize the possibility of identifying any survey respondent. For some discrete variables, small or unusual cells were collapsed as noted in the variable descriptions below. Continuous variables were rounded. Data were also blurred by other unspecified means. In addition, a number of other cases were identified for more extensive treatment. Some of these cases were selected on the basis of extreme or unusual data values. Other cases were selected at random. For each of these cases, a selection of critical variables was set to missing and statistically imputed subject to constraints designed to ensure that any distortions induced in key population statistics would be minimal. The geographic identifiers included here have been systematically altered for a subset of respondents by swapping their locations with those of otherwise similar respondents.

It is important to note that aside from the cell collapsing, there is no key in this codebook or in the dataset that would allow users to identify directly either which data items have been smoothed or otherwise altered, or which cases were selected for imputation of critical values (that is, the shadow variables in this dataset may not always reflect the true original status of every variable). Although this blurring of the data will have some effect on analysis, that effect should be negligible in almost every case. For further details on the procedures taken to protect the identity of respondents, see "Disclosure review and its implications for the 1992 Survey of Consumer Finances", Gerhard Fries Barry Johnson and R. Louise Woodburn. Users who feel that the restrictions imposed on the public dataset are too constricting are encouraged to submit written proposals for expanded data release, and those requests will be given serious consideration in the release of data from future surveys.

CASE ID NUMBERS
Under the original numbering system (XX1), the sample design is apparent from the identification numbers. Thus, each case included in the public version of the dataset has been given an identification number (YY1), which is intended to mask the knowledge of which cases were drawn from the SCF list sample. It is not possible to know with certainty from the information provided in the public version of this dataset which cases derive from the list sample. Because we routinely use the original numbers internally, users who direct questions to us about specific cases might want to be sure to emphasize that they are using the external ID number to avoid confusion.

DATA REVIEW
We have spent many hours searching for errors in the data. Many seeming inconsistencies are actually in the raw data and appear to have no obvious reconciliation (most prominently the fact that X5729-- total income--is not always equal to the sum of the income components). Other types of inconsistencies may have been induced as a byproduct of imputation, even though elaborate checks are built into the imputation routines. We ask our colleagues who use this dataset to help us find the remaining resolvable inconsistencies. Our presumptions is always that the respondent understood each question and reported accurately, and that the process of transcription and coding did not distort that information. In the relatively small number of cases where other information led us beyond a reasonable doubt of the validity of the data, we have changed data, either by altering that value directly or by setting it to missing and imputing it.

CONTACT INFORMATION
It is likely that some users will have trouble understanding the organization of the data at first. If after having framed a focused question and exhausted all of your local resources your problem persists, you may call Gerhard Fries at ((202) 452-2578 or e-mail [email protected]) or me ((202)-452-2247 or e-mail [email protected])). We prefer correspondence via e-mail. While we would like to be helpful to you, please realize that we have very limited resources to devote to user services. We hope that by persistence, you will almost always be able to figure out what you need by consulting the questionnaire and the codebook below.

RANGE DATA COLLECTION AND J-CODES

Definitions of the "J" Variables (1992 version)


0  = value reported on original tape (possibly altered during editing,
     but no evidence on problem sheets to this effect -- NOTE: problem
     sheet information is not comprehensive).

1  = question is inapplicable for R (e.g., R has no checking account
     so value of checking account is coded as zero -- NOTE: there are
     no zeros in the dataset other than such values).

2  = data moved from another location (not including re-arranging
     columns in a grid); data moved from another location and added to
     data already at new location (e.g., wage income from spouse
     reported in independent adult part of section Y added to data
     reported for R in Section T).

3  = data provided for a question with a branch structure, but not
     known which branch data should be in (e.g., AGI given, but filing
     status unknown).

4  = evidence that data imputed from marginal notes.

8  = recode of survey variables, no missing values in antecedents.

9  = recode of survey variables, insufficient data collected to
     compute value, not imputed.

10 = part of reported value reported elsewhere and edited out here
     (e.g., wage income of NPEU member also reported at X5701 along
     with income of PEU resulting in J5702=10) or entire reported
     value reported elsewhere  and edited out here (e.g., all of wage
     income of NPEU member  reported at X5701 resulting in X5701=5,
     J5701=10, X5702=0 and J5702=14).

12 = in case of regular installment loans where term is DK, non-missing
     typical payment moved to monthly payment section.
 
13 = coded value overridden after editing completed
14 = value set to inap given hard-code decision (12, 13 or 15)
15 = hard-coded imputation determined during cleaning.
16 = other reassignment resulting from cleaning that overrides
     reported data (e.g., the cleaning of the institutions grid in
     Section A).
17 = value of originally missing data item implied by other variable(s).

24 = Range Card response: A.  $1 to $100
25 = Range Card response: B.  $101 to $500
26 = Range Card response: C.  $501 to $750
27 = Range Card response: D.  $751 to $1,000
28 = Range Card response: E.  $1,001 to $2,500
29 = Range Card response: F.  $2,500 to $5,000
30 = Range Card response: G.  $5,001 to $7,500
31 = Range Card response: H.  $7,501 to $10,000
32 = Range Card response: I.  $10,001 to $25,000
33 = Range Card response: J.  $25,001 to $50,000
34 = Range Card response: K.  $50,001 to $75,001
35 = Range Card response: L.  $75,001 to $100,000
36 = Range Card response: M.  $100,001 to $250,000
37 = Range Card response: N.  $250,001 to $1,000,000
38 = Range Card response: O.  $1,000,001 to $5,000,000
39 = Range Card response: P.  $5,000,001 to $10,000,000
40 = Range Card response: Q.  $10,000,001 to $25,000,000
41 = Range Card response: R.  $25,000,001 to $50,000,000
42 = Range Card response: S.  $50,000,001 to $100,000,000
43 = Range Card response: T.  More than $100,000,000
44 = Range Card response < 0: A.  -$1 to -$100                  
45 = Range Card response < 0: B.  -$101 to -$500                
46 = Range Card response < 0: C.  -$501 to -$750                
47 = Range Card response < 0: D.  -$751 to -$1,000              
48 = Range Card response < 0: E.  -$1,001 to -$2,500            
49 = Range Card response < 0: F.  -$2,500 to -$5,000            
50 = Range Card response < 0: G.  -$5,001 to -$7,500            
51 = Range Card response < 0: H.  -$7,501 to -$10,000           
52 = Range Card response < 0: I.  -$10,001 to -$25,000          
53 = Range Card response < 0: J.  -$25,001 to -$50,000          
54 = Range Card response < 0: K.  -$50,001 to -$75,001          
55 = Range Card response < 0: L.  -$75,001 to -$100,000         
56 = Range Card response < 0: M.  -$100,001 to -$250,000        
57 = Range Card response < 0: N.  -$250,001 to -$1,000,000      
58 = Range Card response < 0: O.  -$1,000,001 to -$5,000,000    
59 = Range Card response < 0: P.  -$5,000,001 to -$10,000,000   
60 = Range Card response < 0: Q.  -$10,000,001 to -$25,000,000  
61 = Range Card response < 0: R.  -$25,000,001 to -$50,000,000  
62 = Range Card response < 0: S.  -$50,000,001 to -$100,000,000 
63 = Range Card response < 0: T.  Less than -$100,000,000      

150 = original response was DK.
151 = original response was NA (includes refusals, interviewer errors,
     and missing data resulting from editing decisions).  Does not
     include data missing as a result of missing higher-order questions.
152 = original response missing as a result of missing information for
     a higher-order question (typically a YES/NO cut question).  In
     this case, the higher-order question has been imputed in such
     a way as to render response appropriate.
153 = refused
154 = some, DK how many (see B6).

160 = unresolved data problem (none should remain in final dataset).

179 = data missing because of questionnaire error, or data not collected
180 = recode variable, missing because data not collected for
     sub-group, data to be imputed.
181 = recode variable, some, but not all components originally missing.
182 = recode variable, all components originally missing.

188 = for property value, only assessed value given.

197 = override of reported information with (at least partially)
      imputed data (e.g., number/type of institution in Section A is
      overridden after imputation of institutions to account for new
      institutions).
198 = override of reported/inap. information (e.g., R says has 1 IRA,
      but 2 institution types reported; institution reference refers to
      Section A column, but column inap) -- value set to missing.

199 = used for absent spouse for J104 or J105 when X104 OR X105 < 0.

General instructions for J variable coding for recoded variables:
  When a recoded variable is taken directly from another single X
    variable, it should have the same J variable code.
  When a recoded variable may come from a single variable in the
    original X variables, or as the result of a calculation based on
    some number of X variables, it is important to distinguish the
    information content in the J variables.  When the value is taken
    directly, the J variable should have exactly the same value as
    that for the X variable's shadow J variable.  When some
    calculation is involved, this should be reflected in the J
    variable -- codes 8, 181, and 182.
  When a recode cannot be computed because some part of the underlying
    information was not collected for some subset of cases, the
    recode's J variable should be coded 9 or 180.

Top of page | Next: Question Text, Variable Names, & Responses

Home | Surveys | OSS | SCF index | 1992 SCF index

To comment on this site, please fill out our feedback form.
Last update: October 20, 1999, 5:00pm