L.A.FANS Data FAQS
Questions asked about LAFANS by Users
As interesting questions about the LAFANS come in over time, we try to add them to this list of previously asked questions. Users might browse through these to see if the responses answer any current questions they may have.
We've organized the questions into loose categories that seem to cover the main topic areas to which most questions appear to be related. This organization is not perfect so users may want to look at several category areas in search of a particular question item. The list of categories may be added to over time.
The main categories (in no particular order) are:
- SAMPLING/SAMPLE WEIGHTS
- OCCUPATION CODING
- MARITAL STATUS/COHABITATION
- LINKING DATA
- NEIGHBORHOOD CHARACTERISTICS
- HOUSEHOLD ECONOMY/IMPUTED INCOME DATA
- IDENTIFYING FAMILY MEMBERS
- EHC DATA
- ADULT MODULE
- PARENT MODULE
- WOODCOCK-JOHNSON ASSESSMENTS
QUESTION: I am working on a project and wanted to double check a few things.
1. If I am running simple cross-sectional regressions for all children ages 12-19 in Wave 2, should I use the XWGT_RSCSIB_LAF or XWGT_RSCSIB_LAC?
2. In the same regressions, I need to account for the clustering of individuals. I was originally using the cluster option with the 1990 census tract indicator. Is this correct? Or, if I use weights, do I still need to use the cluster option?
Whether you use "XWGT_RSCSIB_LAF" or "XWGT_RSCSIB_LAC" depends whether you want to use all children ages 12-19 in (a) just the 65 L.A.FANS tracts or (b) all of L.A. County. If it's (a) then use the former weight, if it's (b) then use the latter.
Yes, you should account for clustering of individuals using the original 1990 census tract indicator, which is the basis for sample selection.
Yes, you should still use the cluster option when you use the weights.
QUESTION: I am working on an analysis using the Wave II adult module (adult2) and have a question about the weights for all of these respondents. I am using the LA County weight, XWGT_ADULT_LAC. According to Wave II codebook p.45 this weight includes RSA, PCG, and RSC/SIB 18+ years. Once I merged this variable from the weight file (indivwgts2) with the Wave II adult module (adult2) I discovered that the individuals with weights (merged based on hhid2 pid2) do not align with the respondents in the adult module. Specifically:
- 455 RSC and SIB 18+ do NOT have weights
- 20 respondents with weights are not in the Wave II adult module at all
- 1864 have a weight and a complete Wave II adult module
Can you help me to understand what might be going on here? Did I select the appropriate weight file? Is there an error?
The user's guide is a bit misleading it seems—something we didn't catch. For RSAs and PCGs you use the XWGT_ADULT_LAC as their weight and for the aged up RSCs and SIBs, you use their XWGT_RSCSIB_LAC value as theirs. Note that an adult respondent may have a weight if they did not do the adult module but did do another module like the PCG or Health Measures.
QUESTION: I was wondering if you could clarify the differences between the sample weights for the RSC and PCG (WGTRSC and WGTPCG) and the Child and Adult sample weights (WGTKID and WGTADLT).
Is there any sample weight for a combined analysis using both the RSC and the PCG?
The sample weights are explained in the LAFANS codebook (DRU-2400/2--it's the pdf) starting on page 41. Please read the sample weights section for specific details on the LAFANS sampling weights.
The weights are designed for different analysis samples. If you are only working with RSCs, you use the WGTRSC weight. If you are using a sample that pools RSCs and SIBS, you use the WGTKID weight. If you are using a sample only of those who are PCGs (this includes PCGonlys and those RSAs who were also selected as the PCG), you use the WGTPCG. If you are using a sample of all adult respondents (this is all RSAs and all PCGs, remembering that some respondents are both an RSA and a PCG), then you use the WGTADLT sampling weight.
The sample weights for either the RSC or the PCG are what you need. It's a classic issue in sampling or demography. If your analysis is based on a sample of mothers (or PCGs, to be more precise) and that is the group you want to make inferences to, you should start with the PCG sample, add in information on RSCs, and use the PCG weights. If, on the other hand, your analysis is based on children and that is the group you want to make inferences to, you should start with the RSC (or the RSC/SIB sample), add in information from the PCG for each child, and use the child weights. Before starting any analysis, I strongly recommend thinking about who your sample should be.
My guess is that you want to make inferences about children. In that case, as I said above, your analysis should be based on the child sample and you should use child weights.
For children, you have two choices in LAFANS, as you know. You can either use the RSCs, which are a random sample of one child per household. In that case, you should use the RSC weights. Or you should use the RSCs and SIBS, in which case we have provided a "child weight" that corrects for the sampling of children. If you use both RSCs and SIBS, however, you should use a statistical procedure that corrects for the fact that you have two children per household.
QUESTION: I'd like to know more about the nonresponse adjustment used in LA FANS. I've been reading the documentation and all I could find was a paragraph on "LA FANS Codebook" (page 46) that says you used a raking procedure on all the 2way cross-classification of age, gender and race/ethnicity.
Is there any other document with more information on these procedures? Where do these 3 variables come from? (were they completed by interviewer observations during the screener?). I already reviewed the report on nonresponse, the one about sampling design and the one about the survey design, and I cannot find these answers.
What I really want to know is if these 3 variables were self-reports of the respondents or were observations made by the interviewers (without asking the respondents).
During the screener, the interviewers are asked to complete these data by themselves. Since you have this "interviewer observations" for the 7,638 households screened, I wanted to know if it were these age/sex/race variables the ones that were used in the nonresponse adjustment.
Or did you used the info provided by the respondent on the roster? Or on the adult questionnaire? Since you have to have information on respondents and nonrespondents to do the nonreponse adjustment I thought you were using the screener data. Am I wrong?
There is no other documentation about the L.A.FANS weighting procedures, other than what's provided in the Codebook.
Just to elaborate on the write up there, the goal of the weighting procedure was to have the marginals of the two-way tables (that reflects the three variables--age, sex, and race/ethnicity) based on the survey match the same marginal distribution from the 2000 census (for Los Angeles County as a whole). Raking was necessary in order to get all the marginals to match. The weighting procedure thus incorporates an implicit adjustment for non-response.
The three variables (age, sex, and race/ethnicity) were obviously collected though interviews with L.A.FANS respondents--in other words, respondents reported on their individual characteristics and parents (typically mothers) reported on their children's characteristics. The matching population figures were collected in the 2000 Census and reported in the summary files as cross-tabulations.
In constructing the weights, we used self-reported (or parent-reported, in the case of children) information on sex, age, and race/ethnicity from the adult (or parent) questionnaire. In cases where self-reported information was missing, we filled it in with information from the Roster. (Note that item non-response levels on race/ethnicity in the questionnaires were quite low.)
We also had interviewer observations of sampled respondents' race/ethnicity--but these were not used for weighting. In fact, we have multiple different measures of race/ethnicity (roster, self-report, self-report "best" race if multiple races reported, interviewer observation, etc.).
Information from the screener was not used in constructing any of the weights. Recall that the screener respondent was not necessarily one of the sampled respondents. For non-response cases, we don't have any further information on household composition to try to figure out the type of respondent and for many screener refusals and incompletes we didn't have any interviewer observation information (because the interviewers did not complete it).
QUESTION: LOW SES neighborhoods were oversampled… Does this need to be taken into account of in analysis or did you guys correct for it in some way?
Please read the section of the L.A.FANS codebook on sampling weights at http://www.rand.org/pubs/drafts/DRU2400.2-1/. The sampling weights take care of both the oversample by neighborhood poverty level and the oversample of households with children. It may also help you to read: http://www.rand.org/pubs/reprints/RP1241/
QUESTION: I have been using Level 2 LAFANS data, and we have some concerns about design effects and generalizability to the larger population. Although I've looked through the codebooks, I don't see any variables or weights that could be used to account for design effects or to weight the data in order to make it match the characteristics of the larger LA population. Can you please let me know if these options are available?
If you look at the L.A.FANS Codebook (http://www.rand.org/pubs/drafts/DRU2400.2-1/) on pp. 41-46 you will find a description of the sample weights that account for the sampling design of the survey. If you are using special estimation procedures, such as "svy" commands in Stata, you can use the weights in conjunction with the stratifying and clustering variables (also described in the manual) to obtain corrected standard errors.
QUESTION: Is there is a variable that incorporates the stratification within the sample design? Was any clustering used in the sample design and if so--what variable would contain the sample cluster?
I am using STATA for my analysis and and want to ensure that my standard errors are correct. I am therefore using the sample weight variable (in my case the sample weight for the RSC), as well as the stratification variable (POVCAT) to adjust my analysis--however I am wondering what the variable is that incorporates the clustering within the sample design?
The POVCAT variable in the MODSTAT1 and ROSTHH1 files contains the sampling strata in the sample design. The POVCAT variable is also described in Chapter 4 of the main LAFANS codebook under the "Community Characteristics" subsection.
The sample design is described in detail in:
Sastry, Narayan, Bonnie Ghosh-Dastidar, John Adams, and Anne R. Pebley
The Design of a Multilevel Survey of Children, Families, and Communities: The Los Angeles Family and Neighborhood Survey — 2003
This and several other papers documenting the study are available on www.lasurvey.rand.org under publications.
The short answer to your question is that there were two oversamples: (1) we oversampled in poor and very poor strata, and (2) we oversampled households with children. But please see the publication listed above for a more complete description.
If you're using Stata for your analysis, you might want to use the "svy" set of routines, which would involve specifying the following sample design parameters:
pweight: appropriate individual or household weight
One thing to note is that if you are using all children (i.e., both RSCs and SIBs) in your analysis, then you will have an additional level of clustering. A similar situation would occur if you are using all adults (i.e., both RSAs and PCGs). There are a number of ways to account for this additional level of clustering--you should probably consult with a statistician to get some specific guidance that best applies to your situation.
Note that to use TRACT, you must have either restricted version 1 data which has the pseudo tract variable TRACTX or the restricted version 2 data that has the actual census tract in TRACTH90.
QUESTION: I want to use the LAFANS data to calculate the number of natives and immigrant (adults) in Los Angeles county. I have identified citizens and immigrants in the data but am not sure on how use the weights to compute the total population numbers. I would appreciate any help with this.
The LAFANS excludes all non-English and non-Spanish speaking households. So using weighted tabulations of native vs immigrant from the LAFANS will not give one an estimate of the native vs immigrant distribution in LA county. One would be missing Asian immigrants, Russian immigrants, etc.
If you were focusing just on natives and immigrants from English- or Spanish-speaking households, you probably could use weighted tabs from the LAFANS and apply them to census numbers based on English- and Spanish-speaking households only.
QUESTION: I am working with just the RSAs, and want to apply the post-stratification sample weights. My first question is: Do I use the wgtrsa variable? In order to apply the sample weights on Stata, I also need to specify the PSU (primary sampling weight) and strata. So my second question is: Do you know what variables correspond to these in the public use data? I would imagine some variable corresponding with census tract or pseudo census tract would be the PSU, but I can't find either in the public use data.
I'd suggest rereading the section of the codebook on weights. As the codebook says, you use the wgtrsa weights if you are using the RSA only sample and wgtadlt if you are using all adults. The weights all account for the strata oversampling, the household selection probabilities by tract, and tract-specific rates of oversampling of households with children and of household nonresponse. To apply the sample weights in STATA, there are at least two ways to do it. You can use the svy procedures or the weight= subprocedure for each procedure (e.g., logit). In the weight= procedure you do not need to specify the strata or PSUs, since they are already incorporated into the weights.
If you want to use the svy procedures and specify both the PSUs and the strata, you have to use restricted version 1 data and not the public use data. Remember that the public use data does not specify which respondents are in the same census tract (PSU). That's the reason the data are not restricted. The public use data do, however, provide information on which stratum a household is in. This variable is povcat.
So basically you have two options: (1) use the public use data with the weights and correcting the standard errors for stratum only (using povcat) and the weight= option or (2) use the version 1 restricted data with the svy commands and full correction of the standard errors for both PSU and stratum.
QUESTION: In the codebook for wave 1, it indicates that there is no reliable household weight so it is not being released: "household weights are not being released because we were unable to construct satisfactory control totals for use in adjusting the weights to match the 2000 Census" (pg 41). I did find a household weight (WGTHH) in the private data, however. I was wondering if this is a good measure that I can use to compare to Census data. I am also a bit curious to find out more about why the household weight wasn't constructed for public use, or just a bit more detail behind the statement on page 41 in the codebook.
Providing household weights significantly increases the probability of deductive identification of respondents. Therefore, we are not able to release this information in the public use data.
We believe that the household weight in the restricted data is a good weight and you should feel comfortable using it. Given the risk of deductive disclosure, and the fact that LAFANS data were designed to use as individual and not household data, we originally decided not to release a household weight at all. But subsequently, we found that users of the restricted data were interest in having it, so we have released it in the restricted data. Like other variables in the restricted data, you have to comply with the security restrictions you agreed to while using this variable.
QUESTION: What is the best way to group the occupation codes (occ_ac17) in the adult questionnaire into broader occupational categories? Are these codes in any meaningful order?
Chapter 4 of the LAFANS main codebook describes how the occupation codes were created. They are based on the codes used in 1990 Census of Population and Housing. The footnote citing the source of the occupation codes says: "CPH-R-4, 1990 Census of Population and Housing: Classified Index of Industries and Occupations: 1990. This document is a companion to the Alphabetical Index (CPH-R-3). Industry and occupation titles are arranged by classification code. It presents for each category in the industrial and occupational classification systems the individual titles that constitute the category."
The following is a document that talks about the relationship between the 2000 Census occupation codes and those used in 1990 (which are the codes LAFANS uses). The pdf has appendices listing the 1990 and 2000 census codes. I don't know anything about this document. I just found it via Google when looking for CPH-R-4 and it seemed to illustrate that 1990 and 2000 census codes differed and that the SOC codes are not what are used in the Census. www.census.gov/hhes/www/ioindex/pdfio/tech_0203.pdf
As noted above, the LAFANS did not use the SOC codes but used those from the 1990 Census. The above pdf shows the 2000 SOC equivalents for the 2000 Census codes. It has the 1980 SOC equivalents for the 1990 Census occupation codes. It does not show any crosswalk between 1990 Census occupation codes and 2000 SOC codes.
The 2-digit codes in the LAFANS represent combinations of 3-digit 1990 Census occupation codes and are not just the 1st two digits. There is a crosswalk table (Table A.1) in Appendix A of the LAFANS main codebook that shows which 3-digit codes are used to create each 2-digit code. Appendix A shows what the 3-digit codes represent. Table 4.4 in the LAFANS Main codebook shows what the 2-digit codes represent. All of this is discussed in Chapter 4 of the LAFANS main codebook.
QUESTION: I am trying to identify which PCGs are legally married and which ones are cohabiting.
I determined married vs. cohabiting based on the following: If ae1=1, I coded PCGs as married and If (ae1==2 |ae32==1 |ae41==1), I coded PCGs as cohabiting.
However my numbers aren't consistent with the married variable in the PCG dataset. The documentation suggests using AE1 rather than the married indicator, but I'm worried about the fact that my numbers don't match up very well. Is there a better way to identify whether the PCG is married or cohabiting?
One more thing--- I cross-tabulated spous_id by ae32 and ae41 (both of which explicitly ask about cohabitation) and there are a few discrepancies. It also looks like there are 65 cases with ae1==2 (married and cohabiting), but no spous_id. What do you think is the best way to handle these discrepant cases?
There is always a problem with people cohabitating who say they are married, so it's not always possible to correctly identify those who are truly married and those who are living with a partner.
What I do is I use AE1 and SPOUS_ID from the adult module (those with RSA_TYPE=2 or 3 are PCGS). If AE1=1 then the person is married; if AE1>1 and SPOUS_ID>0 then they are cohabitating. If they have no partner listed in the household roster(i.e., SPOUS_ID is blank), then we say they are not cohabitating. It's sort of the best one can do. Remember, the MARITAL field in the roster is only self-reported for the roster respondent. The AE1 is a self-report on marital status for the adult respondents.
AE1=2 does not mean "married and cohabitating". It means separated OR married to someone else and living with another. You can be separated and not living with a partner. Most cases of AE1=2 are separated and as opposed to married but living with someone else.
If AE32 says they are living with someone or AE41 says living with someone, then use that info since it is self-reported. The SPOUS_ID is based on the roster respondent saying who the spouse/partner is. Now if that partner happens to be temporarily absent, the roster respondent would not list them. Thus the AE32/AE41 reports are all we have to know about those living with a partner who is temporarily absent.
I should have mentioned this in my earlier email. The use of AE1>1 and spous_id>. to identify cohabitors is a quick and dirty method. The use of AE32/AE41 helps clean that up a bit.
Now there are likely to be a few cases where the person says they are not living with anyone at AE32 or AE41, but the roster respondent says the person is. At that point, you have to decide which to believe. Since some people don't like to admit living with someone, the roster respondent report might be preferred. It's the analyst's call.
QUESTION: How do I link data about the mother to data about the child?
Please read chapter 5 of the LAFANS-1 main codebook. It describes how to identify and link various types of individuals, including parents and children.
QUESTION: I have been trying to merge imputed family income onto the PCG dataset and keep losing records. Based on the documentation, it looks like I should merge on SAMPLEID and HHRF, but I tried doing so and lose 587 records. If I merge on SAMPLEID and PID, I lose the same records. Is there a better way to merge these files?
There are two ways to add imputed income file data to adult respondents, be they RSAs or PCGs.
One is to use SAMPLEID and HHFAM1 in the adult and pcg files, and match it to SAMPLEID and PID in the imputed income file. HHFAM1 is the person id of the household respondent that goes with the given adult/pcg respondent. For linking HHLD1 data, you would use SAMPLEID and PID as well to link to SAMPLEID and HHFAM1. Ignore HHFAM2 in the adult/pcg files as it's blank for virtually all cases except the odd few where we collected two household modules for the same family. If you care about it, you can look at those two households modules and decide which one you want to use or combine them.
The second method uses SAMPLEID and HHRF to merge imputed income data after one has added HHRF from the ROSTER1 file to the relevant adult respondent file (that merge is by SAMPLEID and PID).
Now remember that in the IMPINC1 file there are 49 SAMPLEID/HHRF combinations that appear more than once since FIs wound up giving the household module to two people in the same family when they only needed to do one. Thus, you need to remove those duplicates (they all have the same totals for FAMINC) from IMPINC1 first otherwise when you merge on the income data to the PCG file, you'll get some PCGs with two income records and the file size will grow unexpectedly.
Note that the HHRF variable is really only needed if you want to link IMPINC1 data to NON-W1 sampled respondents, or if you need to identify those people in the same family. For W1 respondents you don't need to merge on HHRF as you can use HHFAM1, the pid of the household module respondent, to link to the person id (PID) of the respondent in the IMPINC1 file.
QUESTION: I am trying to merge the public imputed income file (IMPINC1.dta) into my child-based dataset. I noticed that HHID seems to be the only logical variable to merge by. However, I am ending up with 86 more cases than I should. Do you have an idea of what could be happening? I have merged all of the public files thus far with no problems (mostly using sampid_n).
If you check the documentation for the impinc1 file (imputed_income.doc or imputed_income.pdf), you'll see that the income info is for FAMILIES within a household. Thus the identifiers to merge by are SAMPLEID and HHRF. As noted in that documentation, the HHRF variable can be found on the ROSTER1 file. You can merge it onto you child-level file using SAMPLEID PID, regardless if PID is the roster id number of the PCG or of the child. The PCG and RSC/SIBs are from the same family.
SAMPLEID and SAMPID_N are the same thing--the former is a string variable, the latter its numeric counterpart. HHID can be used in place of SAMPLEID, as they both identify the household, however we encourage the use of SAMPLEID/SAMPID_N since it is what we use in all our LAFANS work.
As discussed in the LAFANS documentation (L.A.FANS Codebook introduction.pdf), a household might have more than one household economy module if the RSA (or RSA spouse) was not the PCG of the RSC. In such cases a second household economy module was done to be sure that income information was collected for the family of the RSC. The income questions pertain only to the given respondent, the respondent's spouse and the respondent's children. Other household members are not included.
QUESTION: I have been working with the Adult, Parent, and Child modules. When I merge the Adult and Parent modules (using HHID and PID to merge), there are 66 primary caregivers who do not have observations in the adult module. While a couple of these are documented, most of them are not mentioned in the codebook.
Similarly, there are 36 children who completed the Child module but do not have matching completed Parent modules, and the majority of these are not documented.
Is there a reason these files are not lining up well? I have searched through the documentation for an explanation, to no avail.
The main documentation codebook section on response rates (Section 2) shows that partial completes exist for different types of respondents and explains that partials are those respondents who did not complete all of their requisite modules. I guess we assumed that would be sufficient to alert users to the fact that some respondents won't have all their expected modules.
In section 3 where we briefly discuss the MODSTAT1 file, we note that the final disposition variable, FIN_STAT, has codes of 496-499 for households where not everyone completed everything required of them.
It did not seem necessary to list the individual cases since users could identify problem households from the info in MODSTAT1. The individual module completion status variables in MODSTAT1 show what was done and not done by each type of respondent.
So, as you noticed, there are PCGs who did the Parent module but did not do the Adult module. There are PCGs who did the Adult module but not the Parent module. There are kids in the CHILD module whose mother/PCG did not do the Parent module. If a respondent is not in a module file he/she should be in, it is because the person did not start the module.
The length of the LAFANS sometimes meant that not all modules could be administered in one visit and at the subsequent visits the FI could not connect with the respondent to finish the rest of the modules. Also, some respondents decided it was too much and refused to continue after having done a module or two. You'll also see cases where a respondent started a module and quit shortly thereafter (e.g., in the Adult module). The completion flags in each module file identify those who finished vs those who did not finish the given module.
Because some respondents will not have all the modules required for them, there is obviously a missing data problem that users will have to address. How to handle missing data is up to individual users and their given analyses.
QUESTION: I am trying to merge in the actual census tract numbers from the private data into my public data file. I am using the data in restricted data version 2 called rstrv2_1.dta. It includes the following variables: sampleid pid tracth00 city tracth90. I think this is what I'm looking for, but I am not sure. When I merge my datasets by sampleid and limit my sample to those children who were in the assess1.dta sample, I should get an N of 2,500. Instead, I get over 8 thousand. I am clearly doing something wrong, but I thought all people in the same household should have the same census tract number. Can you please offer any insight?
RSTRV2_1 is a PERSON-LEVEL file, not a household-level file. That's why it has nearly 13,000 records while there are only a little over 3,000 households in the LAFANS data.
To merge on TRACTH90 from it to assess1.dta (which is also a person-level file), you need to merge by SAMPLEID and PID, keeping only those records in the assess1.dta file. By only using SAMPLEID you got not only those people in assess1.dta but everyone else in their households. That's why you had over 8,000 records after the merge.
If you had used RSTHV2_1, a household-level file, and merged TRACTH90 onto assess1.dta by SAMPLEID, keeping only those records in the assess1.dta file, you would have had no problem.
A quick check of the LAFANS restricted Version 2 data documentation, which lists the contents of xxxxV2_1 files, would have shown that RSTHV2_1 is a household-level file because it only has SAMPLEID, and RSTRV2_1 is a person-level file because it has SAMPLEID and PID.
QUESTION: I requested and received the public use LAFANS data set from the LAFANS web page, in order to investigate the influence of neighborhood characteristics on English language acquisition amongst immigrant households in the LA area. Am I right in thinking that the files I have from the download page contain a household survey, but no neighborhood survey, no survey of neighborhood characteristics, and no information about the neighborhoods in which the households in the household survey live?
It sounds like you may want to read through the codebook which is available on the website and documents the entire data set (http://www.rand.org/pubs/drafts/DRU2400.2-1/). The description of neighborhood characteristics starts on page 75. There are two neighborhood characteristics in the public use data.
Researchers working with L.A.FANS data generally use two different types of neighborhood characteristics in their analyses. One is census data (available at www.census.gov) or other administrative data that they themselves merge into the L.A.FANS data for individuals and households. To do this merging, you need restricted data version 2 (see the link on the website for "Restricted Data" or the description in the codebook) in order to have access to the actual census tract numbers.
The other is the aggregated responses in section B of the Adult Questionnaire. As described at http://lasurvey.rand.org/documentation/questionnaries/origin_questions.pdf, Section B includes a series of items developed by Robert J. Sampson and his colleagues in the PHDCN study (see Sampson et al. (1999) American Journal of Sociology) to describe social interaction in neighborhoods, plus other neighborhood characteristics as reported by the adult respondents themselves. These items were collected from the RSA sample that is a stratified probability sample of adults in each neighborhoods who are therefore representative of adults living in the neighborhood. The measures are intended to be aggregated across respondents living in each neighborhood and used as neighborhood-level characteristics. To do this aggregation, you need restricted data version 1 (see the link on the website for "Restricted Data") which provides "pseudo-tract IDs." These pseudo-IDs are not actual census tract IDs, but numbers from 1 to 65 that indicate which respondents live in the same tract.
QUESTION: A quick question on merging information from the 2000 census to the 65 neighborhoods in the LAFANS sampling frame. I compared maps of the 65 tracts in the 1990 census to the 90 tracts that appear in the 2000 census and confirmed that the most common reason for the increase in tracts between the two censuses was that they had been split (although there were some exceptions). I was curious as to how you have dealt with this issue. How do I get information from the 2000 census for the original 65 tracts as sampled from the 1990 census (and not the 90 tracts that now appear in the 2000 census)?
There are a lot of different methods used by geographers to deal with this problem, including taking weighted averages and more sophisticated approaches. I suggest that you get in touch with a GIS person or geographer at your instituion for some advice on how to handle this (although, of course, given that you are working with confidential data, you cannot give the geographer or GIS person the numbers of the tracts within L.A.FANS without getting permission from RAND -- unless the person works for you and signs a Supplementary Agreement).
Another option is to look at the 1990-2000 Census tract relationship and block relationship files available online from the US Census Bureau. These contain proportions to use in converting 2000 tract data back to 1990 tracts.
The URLs are:
For tract relationship files:
For block relationship files:
Another source for this information is the L.A. Neighborhood Services and Characteristics database (L.A.NSC) that is available on the L.A.FANS Web site (http://lasurvey.rand.org/data/contextual/lansc/). This database provides a cross-walk from 2000 tracts into 1990 tracts for all tracts in Los Angeles which is based on the Census files mentioned above. The L.A.NSC also provides detailed information from the censuses and other sources for all Los Angeles census tracts.
HOUSEHOLD ECONOMY/IMPUTED INCOME DATA
QUESTION: I have question about the faminc variable. There are 112 people with faminc==0; they are a screwy bunch; there are only originally about 39 missing faminc variables.
How does one explain a faminc of 0? How was this variable created? These 112 people are an influential bunch of numbers.
I've imputed them as missing in a NON-multilevel analysis, and then income works out as it's supposed to. Could they possibly originally be missing?
As noted in the documentation for INPINC1/IMPINC1R (imputed_income.doc), faminc is the sum of famearn, alltrans, and astotinc.
In IMPINC1/IMPINC1R, there are 138 households with FAMINC=0. 68 of them did not do a household module (no_hhmod=1). So there are only 70 where faminc is zero and a household module was done.
Some of those 70 faminc=0 might well be true. Remember that this is family not household income. So if, say, the RSA has no spouse or children, and the RSA did not work in the previous year and has no transfer or asset income , then faminc could be legitmately zero. This RSA might be age 27 and living at home with his/her parents. Due to the definitions of who would be given the household module, such an RSA would get the household module and not his/her parents. Another example is an RSA who lives with friends and received no transfer or asset income or earned anything in the previous year. It's those faminc=0 cases where there is only the respondent (and their spouse/partner and children if relevant) in the household where the faminc=0 is a bit more suspect.
On the other hand, some of those with faminc=0 may be cases where the respondent said in the household module that they had no earnings or transfer income or asset income when in fact they did. In those cases, faminc=0 would be a nonresponse issue.
The IMPINC1/IMPINC1R data was created solely from the data collected in the household module. It did not look at earnings that might have been reported in the EHC since we don't have yearly earnings in the EHC. It seems that around 14 of the 70 faminc=0 cases had a job in the previous year based on the EHC. However, those 14 said in the household module that they did not earn anything in the previous year. Those might be cases to treat is "missing" faminc as opposed to a legitimate zero.
The public assistance data in the EHC was also not used since it provides not amounts. However, it could be used to see if the respondent said in the EHC that they received SSI, foodstamps or public assistance in the previous year. If it appears from the EHC that this respondent did get public assistance in the previous year, then the faminc=0 case could be treated as "missing" as opposed to a true zero.
You'll need to do a little cross checking with the EHC to see which of the FAMINC=0 might well be missing as opposed to true zeros.
QUESTION: I am confused about the variable "Roster Age" (AGE_YR) in the HHINC module. Whose age is being reported here? I would assume it is the respondent to the survey, but this module is for household income and there are a number of responses indicating that the "Roster Age" is 0-4 years old (93), 5 - 9 years old (153), etc.
As mentioned in the codebook for the HHINC file, the file has one record for each person in the household who received a type of income listed in the A12-A13 and A16-A17 questions of the household questionnaire.
The PID variable refers to the person receiving the income. Thus the AGE_YR and SEX characteristics are for the person receiving income, as are the variables MARITAL, SPOUS_ID, SP_RA11, RB1, and RB2_1-RB2_6. They are not for the household module respondent.
The RESP_ID variable is the id of the household module respondent for the family to which the person receiving income belongs.
HHRF is the family indicator for which family where a household module was done to which this person receiving income belongs.
In the HHLD file, which is one record household, the PID and PID_B variables are the ids of the household module respondents and the age/sex/etc. characteristics in the HHLD file are for the household module respondents.
QUESTION: I read the codebook and I guess my only source of confusion is how a child ages 0-5 receives any income? For example, I see two records that appear to indicate that children, one age 3 and once age 5, received Social Security Income (HA16D_= 1).
RESPONSE: Regarding young recipients of income in the HHINC file, we can only go by what respondents reported. Some respondents might have interpreted the question about who receives income of particular type to include all those in the family of the person getting the income. While the public assistance check or social security check may go to person A, the respondent may view it as the family of A receiving the money. There might also have been some interviewer error but when there's nothing to prove definitively that such error is the reason for an odd response, we don't change the response.
As you work with household survey data, you'll discover that respondents will give answers that are not what any one expected. Some respondents will interpret questions differently than expected, others may be confused as to what is being asked but don't ask the interviewer for clarification. Others may be making up answers. Interviewer typos can also happen.
It's up to the analyst to decide how to deal with responses that seem unusual or contradictory.
QUESTION: My understanding is that the imputed income/assets file (IMPINC1R) contains one observation for each household in HHLD1 (even those for which income/assets values were not imputed). So if we merge the imputed values into the household data, we can replace the household income/assets values with the imputed values for all cases. Right?
Also, I cannot seem to locate an imputed "asset" measure; that is, assets that include housing value. I can locate the NONHASST. Do I need to add the house value onto the value of NONHASST to get an imputed composite "assets" measure?
You are correct regarding imputed assets. NONHASST is imputed non-housing assets. You would have to add imputed home value (C_HOUSE) to that for a composite measure. Unfortunately, home value was only asked as a categorical variable. So only a category was imputed. You'll have to figure out how to add a categorical value to NONHASST.
Remember that if the RSA is not the PCG or spouse of the PCG of the RSC, then a second household module was done with the PCG as the respondent (this was only a couple hundred households or so).
Thus, there is one record in IMPINC1R for each selected household module respondent (HHRF is the family number within the household) for households where more than just a roster was completed. If a household only had a roster done and no other module, we did not impute income and assets for them.
QUESTION: The imputed home value (C_HOUSE) is the market value of the house, right? If the individual doesn't fully own the house yet, this is equity plus debt. If we want to measure wealth, we need to subtract off the debt first. Do we have a measure of equity (housing wealth)? I couldn't find an imputed measure of equity or an imputed amount of debt remaining on the house in the new data, and also could not find non-imputed measures in the codebook.
Yes, C_HOUSE is the market value---it's what you think you'd get if you sold the house today. The variable PRINLEFT is the estimate of the amount of the mortgage that would be outstanding (assuming a 30-year fixed rate mortgage) was calculated using a standard mortgage formula. It is discussed in the documentation for IMPINC1R.
IDENTIFYING FAMILY MEMBERS
QUESTION: I am wondering if you can direct me to the variable in the public use data that lists the total number of people (adults and children) within a family? Is there a way to differentiate between adults and children?
In the ROSTER1 file there is variable called HHRF (it is described in the imputed income documentation file imputed_income.doc) that is the family id number within the household (SAMPLEID defines a household). All people in the same family in a household will have the same value of HHRF. You have to count up the number of people in the ROSTER1 file with the same combination of SAMPLEID and HHRF. The count for that SAMPLEID/HHRF combination is the number of people in the family.
If you look at the codebook for the roster file, you'll see that there is an age variable (AGE_YR) for all those listed on the roster. YOu can use it to identify those age 18 and over (adults) and those under 18 (children). You can then count up those who are adults that have the same SAMPLEID/HHRF combination and separately count up children with the same combination.
QUESTION: I am using the public use LA FANS data and I am calculating the number of people in each family. I understand that the variable HHRF in the Roster1 file indicates if a person belongs to the family of the first or second household module respondent and that this field is left blank if a person is in neither family--does this mean that if the field is left blank that all of the respondents within a given household for which there is an empty field on this variable are then part of the same family?
Is there anyway to get family income data on those individuals who are not a member of one of the nuclear families for which there is a household module?
Those in the roster with HHRF=. are not necessarily in the same family with each other. They are just people who are not in one of the nuclear families for which we have a household module. For those with HHRF=. in the roster, you'll have to look at the RA7_A, SPOUS_ID, RA21ID, RA25ID, and the like to see if you can determine how those individuals are related to one another and whether they are members of the same nuclear family.
Income information was only collected for those in the nuclear family of the RSA and, if the RSC was not in the RSA's family, the family of the RSC. So we do not have even individual income info for any of those with HHRF=. let alone the family they belong to.
QUESTION: Is there some way to use the roster files to identify whether adults in the household (other than the RSC?s parents) are related to the RSC?
The roster ids of the siblings of the RSC are listed in the RA23ID and RA27ID variables (RA23 are sibs with same mother, RA27 are sibs with same father).
The mother and father ids, as noted below, are in RA21ID and RA25ID on the RSC's roster record.
As for anyone other than parents or sibs, you need to use the RA7_A (relationship to household head). You need to see how the RSC is related to the head and how others are related to the head to figure out how the RSC is related to the others. There's no simple way around it.
For example, if the RSC is the grandchild of the head, then you know that the household head and his/her spouse are grandparents, anyone who is the child of the household head (but is not the parent of the child) is an aunt or uncle, etc. Some relationships will be difficult to tell, but you can usually figure out if the person is a relative of some sort or not.
QUESTION: How can I determine whether the RSC?s mother and/or father reside in the household?
On the PARENT1 (and CHILD1) data file, you'll find the variables BIOMOMID and BIODADID which are the roster ids of the child's parents if they are in the roster. The RSC is the one with KIDTYPE=RSC. If BIOMODID has a value, then the mother is in the household; if the BIODADID has a value, the father is in the household.
You can also check the RSC's roster record. The variable RA20=1 if the mother is in the household and the variable RA21ID is the roster person number of the mother if she is in the household. The variable RA24=1 if the father is in the household and the variable RA25ID is the roster person number of the father.
QUESTION: What is the best way to limit analyses to households with children where the PCG is the mother. Do you happen to know what the sample size should be for this population?
On the PARENT1 (and CHILD1) data file, if BIOMOMID = PCG_ID then the PCG is the bio/adopt mother of the child. If you want to include stepmothers, you can include those where PCGNOPAR=0 (PCG is parent or spouse of parent) and PCG_SEX=2 (female).
QUESTION: I just wanted to verify a question concerning the nativity status of each child's father figure (if he is in the household). Is it correct that the only way to get this information (ie if he was US born or not) is if he was a RSA or the PCG. If he is neither of these then there is no way to ascertain his nativity status? I just want to check that I am not missing anything here.
You are correct. If the child's father figure is not an RSA or a PCG (i.e., someone who was given the adult module), then we do not know where the father figure was born.
The Parent module in Wave 2 will ask where a child's parents were born, but alas we did not ask that question in the Wave 1 parent module.
QUESTION: I have another question about the best way to identify single vs dual parent HHs. I was using a combination of MOMID, DADID, RSCDADRS and PE58 but noticed some discrepancies, particularly between RSCDADRS and PE58. Can you help me out with this issue?
I think you really want to look at the roster records for the RSC and SIB for the information that will better identify single vs dual parent households.
On the roster records for the RSC and SIB, RA20=1 if the bio/adopt mother is in the household and RA24=1 if the bio/adopt father is in the household. RA21ID is the mother's id in the roster, RA25ID is the father's id. You can check the resident mother's and/or father's roster records to see if they are greater than half-time residents (RA11=1) or not.
If your definition of "dual parent" includes step parents, if RA20=1 and RA24 ne 1 you can check to see if the child's resident mother has a spouse in the household. Similarly, if RA20 ne 1 and RA24=1, you can check to see if the child's resident father has a spouse in the household.
MOMID and DADID are really mother- and father-figures. So for kids with no bio/adopt parent in the household, they represent the IDs of those who are primarily responsible for the child's care.
Things like RSCDADRS and PE58 in the PARENT1 file are CAPI variables which may be inconsistent with the roster. The roster was cleaned after the survey, but CAPI variables originally generated during the interview were not always adjusted in every case to be consistent with changes made in roster cleaning.
QUESTION: I am trying to figure out how to develop a variable for foster child status for all children 9-17 years of age. On roster question A7, the child's relationship to hhh is asked and foster child is a possible response. How would I use the right identifier(s) to link a foster child with PCG?
If the PCG is the household head or the spouse of the household head, then any child age 9-17 in the roster where A7 says "foster child" is a foster child of the PCG.
If the PCG is not the household head, then you need to see which children age 9-17 in the roster have RA33=2 (foster mother is like a mom--no bio/adopt mom in hh) and the PCG's id is listed in RA34ID. The PCG would be the foster mother of such children.
These are the instances where the term "foster mother" is specifically used and may thus imply a child fostered out by a government agency.
There may be other children who are cared for by the PCG but who are not formal foster children (such as children being raised by other relatives and not their parents). If you are interested in those children, you need to see for which children age 9-17 in the roster the PCG's id turns up in RA34ID or RA36ID and look at the relationships in RA33, RA35_A, and RA38_A.
QUESTION: I am trying to research grandparenting issues specific to Los Angeles. In searching through your survey, I found only two items related to grandparents: Where does child live: with grandparents.
I could not find any relevance in survey of primary caregiver or other areas. Have you collected any data pertinent to grandparenting in LA neighborhoods?
While the LAFANS did not ask specific questions about grandparents, one can discern information about them if they are involved in the child's care or the child resides or has resided with them recently.
One knows if the primary care giver (the PCG) is the grandparent (RKIDMOM="GRANDMOTHER" OR "GRANDPARENT"), and thus the PCG and parent modules, as well as the adult module and household module, give you information about that grandparent, the child they are raising and the household they all live in.
If the PCG is not the grandparent, but the grandparents provide childcare, then that will be covered in Section G of the Parent module for each child.
The roster will show if the grandparents of the child are in the household, although you will need to review the "relationship to household head codes" and the like to determine that. There is no specific question about who the given child lives with in the current household. In the roster, though, for children age 0-17, we did ask who the mother figure is for the child if the biological/adopted mother is not in the household, and also for the father figure. Those questions (RA32-RA42) also help show if a grandparent is responsible for a child.
The questions you may have seen about "who the child lives with" from the parent module relate to places the child lives or has lived other than the current residence. PB4 is for other places the child regularly stays besides the current residence and PC8 refers to where the child lived before moving in with the PCG. In section J of the parent module, we asked about siblings of the child if the mother was not the PCG and who those siblings lived with if not in the current household.
Indirectly there is some information about grandparents obtained from the parents of the child if the parent was given an adult module. Such info covers where the grandparents were born, if they are still alive and the like. See the adult module section C for details.
Thus there is information related to grandparenting in the LAFANS. You just have to work a little to pull it all together.
QUESTION: I am having some difficulty trying to figure out how to calculate number of siblings (sibship size). I see in the roster1.dta file that ra22_b is a good start. I see the note that only the first child of the bio mother is given a value and the others are not measured, but now I'm not sure how to calculate number of biological siblings. I guess I should also technically include half-siblings that share the same bio mom but not bio dad, and same bio dad but not bio mom. Do you have some advice? Right now I am working with a subsample dataset of just RSCs and SIBs, but perhaps I need to make this calculation in the roster1.dta file and then merge the sibship size variable into my dataset.
You do need to use all children in the roster to locate all siblings and not just the RSC and SIB. Remember that the roster only gives you co-resident siblings, and that only works best for siblings under age 18 since we did not ask about who in the household is the mother or father of those age 18 and up.
What you really want to do is look at the birth histories for the PCG to determine the number of children born as these will be blood siblings of the RSC and SIB. For adopted siblings, you need to look at the PCG's adoptive child history.
If the PCG is the mother of the RSC, then you can find the number of siblings ever born to the PCG in the birth history section (Section E) of the Adult module data. That list of children ever born will include the RSC. These would be siblings of the RSC who have the PCG as their mother.
If the PCG is not the mother of the RSC, then you need to check Section J of the parent module data for the RSC. Section J is where we asked about the siblings of the RSC if the PCG is not the mother of the RSC.
If the PCG is the mother of the RSC and the PCG did not do an adult module, then you only know of the siblings that live in the household with the RSC, but won't know about siblings that live elsewhere or may have died. If the father of the RSC did an adult module, you might be able to get siblings from his fertility section, although you'd need check his marital history dates and the birthdate of the RSC to check for sibs of the RSC with the same mother.
Remember we defined siblings as having the same mother for purposes of selecting a SIB to interview.
A check of the marital/relationship history in Section E of the adult module against the PCG's birth history will isolate the number of full sibs and half-sibs of the RSC with a different father.
If the father of the RSC was selected as the RSA, then you could also check the children everborn for the father against those of the mother (the PCG of the RSC in this case), both found in their adult module data, to see where there are full siblings. This would also show you any half-sibs with a different mother but the same father as the RSC.
Obviously the logic that applies for the RSC can be used to get SIB siblings as well.
QUESTION: I'm using the public use data and I had a couple of questions on some of the scales that are used:
1) In the primary caregiver questionnaire items B1 a though g--is this a scale that has been used in other studies or was it created specifically for LA FANS? What specific construct are these questions measuring? Is there any information on psychometric properties for this scale?
2) And then I had the same questions for the child questionnaire items A2 through A5
As noted in the PCG individual file codebook, questions GB1A to GB1G are related to the Pearlin Self-Efficacy Scale. In the main LAFANS codebook under the section describing the questionnaire modules, it gives the citation for the Pearlin Scale (“The Stress Process” in the Journal of Health and Social Behavior, 22:337-356 by Pearlin etal.)
As for the Child questionnaire items A2-A5, these are questions about the neighborhood, based, in part, on previous surveys. If you are interested in the origin of individual variables, see the “Origin of the L.A.FANS Questions”.
QUESTION: A quick question on questions ph71a-ph71dd on the parent questionnaire. I have looked at past work that uses the child behavior checklist to create validated clinical constructs for various behaviors (e.g. internalizing/externalizing score, anxious depressed score etc). There are also published clinical cutoffs for these conditions based on established t scores. But all of these are created from the entire child behavior checklist. LAFANS only asked a select number of questions from the child behavior checklist. I was wondering if an abbreviated instrument exists that the LAFANS team used to choose which particular questions it asked? Any advice on how to go about creating these constructs without having all of the questions?
What we used is the Behavior Problems Index (BPI) created by Zill and Peterson. They created a subset of questions in the CBCL for use in large scale surveys. If you look on page 86 of the codebook, you'll find a full description plus reference for the BPI.
You can also look at appendix B of this online report which used the BPI from L.A.FANS:
Los Angeles County Young Children's Literacy Experiences, Emotional Well-Being and Skills Acquisition: Results from the Los Angeles Family and Neighborhood Survey — 2003
Sandraluz Lara-Cinisomo, Anne R. Pebley
QUESTION: Could you provide some guidance as to whether the items in the neighborhood observations form specific scales. If so, what evidence is there of their reliability and do they relate to adult and children outcomes?
As described in the codebook, most of these items are drawn from the Project on Human Development in Chicago Neighborhoods (PHDCN). In addition to the codebook, you should also see the link on the LAFANS website that describes where each item in the survey comes from. Specifically, we used neighborhood items from R. J. Sampson et al. (1999) "Beyond Social Capital..." in the American Journal of Sociology. See their work for psychometrics, etc.
QUESTION: I would like to identify individuals who are currently uninsured
You should use the EHC Health Insurance data for those who have an HEYEAR1='cont'. For those that do not have EHC data, or for whom the type of insurance is missing for HEYEAR1='cont', then you can use the RB6 health insurance report from the Roster. Always use the EHC if you have it for health insurance since it is asked directly of the respondent. The RB6 is used only to fill in as needed since if the adult respondent was not also the roster respondent, RB6 is not self-reported.
QUESTION: In some cases in the EHC data there are overlapping dates between 2 homes - does this mean the person lived in both homes during the overlapping period or is it likely bad data?
We know there were a few cases of people who reported more than one current residence, so it's possible that there could be some overlapping of past residences as well. Remember that our respondents are halftime or greater residents so some might have a second residence where they spend time and thus might report in the EHC.
It is always possible some overlaps on residences are due to misreported dates. Unfortunately, we have no way to tell.
If the overlap is only less than a month or a month, it may not be a real overlap and is just an artifact of how the EHC recorded thirds of months when people could not give an exact date.
QUESTION: I have a question on the date variable EHCJ5 in the Adult Public Use Data. For this variable, which stands for the "date moved to current residence," I noticed that there were quite a few observations with a coding of 'EO 3.' I couldn't find anything in the codebooks that explained what this might be. I just wanted to make sure I wasn't missing something before assuming this was just incorrect info.
The "E0 3" is a bad date field generated by the CAPI program. We did not clean up EHCJ5, which was created by the CAPI program from the EHC data. Sometimes there were problems with the CAPI program pulling info from one module into another.
You can fill in the missing date for when the respondent moved to the current residence by looking at ASDATE1 in the EHC1 file for the current residence (the current residence is listed first in the EHC residential history section). Note that ASDATE is YYYY-MM-DD and EHCJ5 is MM-DD-YYYY in terms of how the dates are displayed.
Note that you only need to do this for those with HASEHC=1. There are records with HASEHC=0 and EHCJ5="E0 3" which can be ignored.
QUESTION: I was wondering how some people who have hasehc==0 (no EHC info.) have date values for the ehcj5 variable since that variable was created from the EHC info. Was their date of current residence gathered some other way and included as an ehcj5 value?
Adult respondents who don't have EHC data but who did not break off at or before the EHC module are people for whom the EHC data was lost due to a computer problem after the interview was completed. Remember that the EHC is a different program than that used for the rest of the adult interview. For those people, they would have things like EHCJ5 as those would have been filled in from the EHC at the time of the interview by CAPI. It was when the case was downloaded from the laptop after the interview was completed where the loss of the collected EHC data occurred.
QUESTION: In the LAFANS-2 Adult module questionnaire, there is no skip listed for AC41=5 (no, do not have a visa, etc.). Do they get asked the AC42 to AC48 questions about visas and other documents?
No, those with AC41=5 were skipped to AC49 in the CAPI program, which is what was supposed to happen. There is just a simple typo in the L.A.FANS-2 Adult questionnaire where the note on the skip was inadvertently omitted.
QUESTION: I've been compiling a longitudinal dataset based on 1055 RSCs and SIBs, aged 11 or over, who completed the Child Module at LAFANS1 (W1). I'm looking now at the data for those respondents who were interviewed at LAFANS2 (W2). Out of the 1055, I've identified 413 who had "aged-up" (age >=18) and completed an Adult Module at W2. I am starting to look at the W2 education-related information (Section D) for these 413 cases. My work involves looking at W2 education reports in comparison to W1 reports. I am running into some significant problems.
I've constructed a data file that incorporates the W2 adult data (for those who responded to it) into the W1 child data. It is confusing because some of the item numbers and skip patterns in the codebook and datafile do not clearly match those in the questionnaire. The most troubling issue is: Although I have data to verify that these 413 cases are panel respondents (a mix of RSCs and SIBs from W1), I am encountering data that are completely missing on items on which I would expect there to be data, and data are present on items that I would expect to not have data. As an example:
For these 413 cases:
AD1A: (CAPI check) Respondent Was Not Interviewed at W1?
Yes (1) = 413
AD1: (CAPI check) Respondent born in USA
Yes (1) = 332
No (5) = 81
AD2-AD10 do appear to have the correct # of respondents (#s are based on the N=81 from AD1 and area reduced further based on response to AD2).
Based on info in the questionnaire, I was expecting only those respondents who had not completed a W1 interview (and who completed some/all schooling in US) to have valid data on AD11. That is not the case as all 413 have a value in AD11.
In contrast, based on info about my sample and the questionnaire, I was expecting ALL of my respondents to have something recorded on AD26 — the item that marks the start of questions confirming (or correcting) educational attainment and status reported at W1 (AD26-AD41). But, that is not the case as AD26-AD41 are missing for all 413 respondents.
Could you please help me understand what is going on? Am I missing something or misunderstanding information that was presented in the codebook or elsewhere?
As you are discovering, the W2 adult module mostly had preloaded information for those who did the W1 adult module. W1 info for the aged up RSC/SIB is largely not in the W2 adult module because preloaded info was mainly pulled from the W1 adult module, which those kids obviously did not do.
The wording for the "2" response on AD1A in the questionnaire has a typo in that it should not have mentioned RSCs or SIBs. This is because what AD1A really is, and this is confirmed by checking the CAPI coding, is an indicator for whether the person did the W1 ADULT module—that is what is was always supposed to be so the CAPI did what we had intended.
RSCs and SIBs did not do the adult module in W1, so they would have AD1A=1 (yes, did not do W1 adult module). Thus they would be asked the education questions that immediately follow. This was much simpler than trying to mirror W1 education preloaded info by creating it from W1 child and parent module data and then use it to ask them AD26-AD58.
So all RSC18+ and SIB18+ would do AD1-AD22 and then skip to AD62 where they are then asked the detailed questions about schooling since W1.
The W1_AD19 only applies to those who did the W1 adult module as does the W1_SCHCOMP since those variables were pulled from W1 adult module data. You would have to create something similar for those RSC18+ and SIB18+ using data from their W1 child and parent modules.
QUESTION: This regards an error I found in the L.A. FANS adult data. It seems that the individual with hhid=28659 and pid=1 has a mistake in the adult public use data set. Although their interview date was 10/1/00, and their EHC start date is thus 10/1/98, the variables "bwmo", "bwday" & "bwyear" were coded as 3/27/98. I discovered it when I found an error in a variable I calculated using both the bw variables and the interview date variables.
Case HHID=28659 is one where the person was interviewed twice--probably to finish up an interview. The FI redid the adult module for the person some months later. So the ADATE reflects the second interview as CAPI used it to overwrite the first date. However, the BWMO/BWDAY/BWYEAR, which are fields created by the CAPI program, were apparently set at the first interview and not reset at the second by the CAPI program. So, the BWMO/BWDAY/BWYEAR represent what the CAPI program was using for the start date of the 2 year interval in the Adult module at the time of the later interview. Thus the two year window might be a bit longer depending on when a section using the 2-year window was done.
Needless to say, the CAPI program was not designed to deal with a module interview being stretched out over a long period of time. The assumption is that a module will be done all in one day or within a few days of when the interview started. No one assumed there'd a case where it might take 6 months.
I wouldn't worry about the possibility that the 2-year window may be 2.5 years for this one case. It's likely not to have made any difference unless the person had an event change around that time (i.e., 3/27/98 to 10/1/99). Also, since we have no way to know for sure if any of the dates a respondent reported are when something actually happened (people don't always remember dates that well), there's already a certain amount of error that naturally exists in any computed intervals from recall data.
QUESTION: There are discrepancies in the numbers of foreign-born and US-born adult respondents, so I am unsure which to use. As far as I can tell, there are 3 variables in the public use data set regarding country of birth, AC34_CYR, CAPI check AC34_4, and AD1. The total number of US-born vs. foreign-born in each of these variables contradict each other.
I am also curious about the numbers of legitimate skips appearing in these questions... In what situation would question C34 and D1 in the adult module be skipped?
With regards to AC34 and AD1, there were some problems in the CAPI program and with FI errors. We tried to clean those us as best we could, but sometimes we did not catch everything. Note that AC34_STR may have a value for California or a US region, but AC34_CYR is missing. AC34_4 is based on the responses to AC34_STR and AC34_CYR, not AC34_CYR. AD1 had various problems as noted in the adult1 codebook. It has people with AD1=. who have AC34 info that denotes native vs foreign birth. We did not always go in and clean up missings or DK responses because we had info from elsewhere that suggested what the value really should be. If we did not have confirming info from another source, we often left inconsistencies for analysts to decide. As noted in the main codebook intro materials, we tried not to "over clean" the data. That results in these sorts of little discrepancies.
The term "legitimate skip" is used for blank values. 99 percent of the time, blanks mean the person was deliberately not asked that question. However, there were CAPI and FI problems that resulted in blanks where questions should have been asked. We tried to assign -5 values to such cases, but did not always catch all of them. AD1 as noted above had a lot of problems, so it's not surprising that it would show more discrepancies. There are 3 people who have AC34_4=1 but AD1=. as does AD2-AD10, which is expected for native born. Setting AD1=1 was just missed.
Basically, AC34_4 is the best measure for actually being born in the US. We did try to clean that up so it was correct as best as we could tell.
QUESTION: I have a set of questions are in regard to the sibling questions (G7-G19) of the adult module. What I would like to use is the educational attainment, gender, and age of the sibling closest in age to the respondent, asked about in questions G14-16. However, there are 1,430 legitimate skips in these 3 questions that I cannot explain (I have also included the distribution of these 3 variables from the adult dataset in the attached file). This number of skips does not match the numbers of respondents who reported having no brothers or sisters in question G7 (N=248), or those who reported that all their brothers and sisters are no longer living in question G8 (N=693). Do you know why 1,430 skipped these questions? Were they able to opt out in some way other than reporting they don't have a sibling, refusing to answer, or reporting "don't know"? Since I aim to use these data, I need to be clear what my denominator is and why.
I am also unclear which sibling the respondents are being asked about in questions G10-13. Only 448 respondents are answering these questions, and I cannot figure out from the questionnaire on the LA FANS website or the codebook which of the previously reported siblings in questions G7-9 is being asked about in G10-13. Again, I have included the distributions of these variables from the adult dataset in the attached file to help you recall.
Regarding section G, if you have read the questionnaire, remember that only RSAs were given Section G. Thus any PCG only or Emancipated minor, would not do Section G. However, initially, the FIs were giving Section G to PCGonlys/EMs so before that was corrected, 309 PCGs got Section G. So there are 629 PCGonlys/EMs who actually did not get Section G. There were around 40 or so people who started the Adult module but did not make it through to Section G. Remember that aside from those who said they had no siblings (AG7=0 ) or had no siblings who are still alive (AG9=1), those who have AG13>. will also skip to AG19 (i.e., those with only one living sibling). Accounting for all of those gets a number close to 1400.
So, basically, the responses in AG14-AG16 are for all the people who were supposed to get those questions. I think you just forgot to check all the skip patterns in Section G.
The questions in G10-G13, if you read the questionnaire, are for those with only one living sibling.
The LAFANS individual codebooks are not meant to supplant the LAFANS questionnaires since the codebooks do not have the actual questions. I think you'll find it helpful to use the questionnaires in conjunction with the codebooks in figuring out response patterns.
QUESTION: How can I tell I if the spouse of the Adult respondent is an RSA or PCG or neither?
The easiest way to know if the spouse was the RSA or PCG or not, is to merge on the RRSAi and RPCGi varibles from ROSTHH1 to ADULT1 and compare them to SPOUS_ID.
To know if the spouse is in the household, look at SP_RA11 (1 means FT hhld member, 2 means PT hhld member, . means no spouse in roster). To know if the spouse did an adult, pcg or parent module, use the SP_ADLT, SP_PCG and SP_PAR variables in ADULT1.
QUESTION: I wanted to know the immigration status (type of visa etc) and health insurance status of the RSA's spouse but could not find any information except for current health insurance (yes or no) from the household roster. Is this the only information in LAFANS about the spouses health insurance and immigration or am I missing something.
If the spouse of the RSA also was given the adult module (i.E., The spouse is the PCG or is one of a few cases where the spouse inadvertently was given the adult module), you can get that immigration information from the adult module record for them, and health insurance coverage can be found also in the EHC.
However, if the RSA's spouse was not interviewed, you don't know anything about the spouse's immigration status. With regards to health insurance info other than the roster, you might be able to make some assumptions from the type of coverage mentioned in the ehc by the RSA.
Unfortunately, no specific questions on immigration or any detailed info on health insurance was asked of RSA spouses who were not also adult module respondents.
In cases where the RSA is a married male with children under 18, you'll likely have detailed info on the spouse because she's usually the PCG and thus has an adult module. However, when the RSA is a married female, then you have more limited info on her husband. Similarly married rsa males without children will also have limited info on their wives.
QUESTION: Is there a way to differentiate between different Latino groups e.g. Mexican Americans and Puerto Ricans?
If you look at the Adult module questionnaire and codebook, you'll see that at question C27 we asked Adult respondents who said they were Latino to tell us if they were Mexican, Central American, Puerto Rican, Cuban, or some other Latin American group. We only asked this question of Adult respondents (i.e., RSAs and PCGs).
In the public use data, we combined Puerto Rican and Cuban to address privacy concerns. The Restricted Version 1 data contains the original more detailed C27 response.
QUESTION: I have a question about the non-response codes. I know from the codebooks (re: online at http://www.rand.org/pubs/drafts/DRU2400.2-1/ on p. 63 and also the codebooks sent to me--that "-5" is the code used for missing data (where a question should have been asked but was not) and that "." is a legitimate skip code.
But my question is that for some variables, it looks as if they should not have been skipped, yet there are "." values for these variables.
For example, I am looking at RSA's (both RSA only's and RSA/PCG), which is a total sample of n=2,623. I was looking specifically at variable am5 (which asks "Do you smoke cigarettes?"). From what I can tell from the questionnaire and the codebooks, it looks like this question should have been asked of all RSA's, so the frequencies should add up to 2,623. But the total is 2,547, so it looks like 76 are missing (2623 minus 2547=76). So I was wondering why those weren't coded "-5" rather than "." Similarly for some other variables, I see that some are legitimate skips based on the skip pattern, but others seem to be missing data and are coded "." rather than "-5". I see though that with other questions, there are "-5" codes.
Can you provide any clarification on that?
The 76 with AM5=. in your example are adult respondents who never got to the AM5 question due to a breakoff. They all have ADLTCOMP=0. Those with ADLTCOMP=0 were not given -5 missing codes because they never got to those questions and thus were like those who with ADLTCOMP=1 who legitimately skipped those questions.
The -5 missing code is used for people who actually got to the relevant section of the questionnaire but due to some FI or CAPI problem, a question they should have been asked was accidentally skipped.
It's possible that there may be a few cases where we missed assigning a -5 code to such problems as we had to do these ex post, and as you know there are a lot of variables and a lot of skip patterns in the LAFANS.
Depending on what you're doing, you may want to drop those with ADLTCOMP=0 to avoid such confusion.
QUESTION: I'm using the parenting style variables from the parent-child module (e.g., ph1-ph10_s) and I'm wondering what the rationale was behind categorizing the age groups as 1-2, 3-5, 6-9 and 10-15?
These are variables that come from the HOME scale and were modified by Jeanne Brooks-Gunn and one of her post-docs for use in LAFANS. The point of asking slightly different questions at different ages is that, given that children change with age, some questions only become appropriate at some ages. For example, you can imagine a parent "putting" a child in his/her room as punishment for a young child, but "sending" an older child to his/her room. Other examples include discussing TV programs with children -- probably something parents would do more with older than younger kids.
Many of the measures are identical across age groups, but not all.
QUESTION: In the parent questionnaire there is a CAPI check (PF1) that determines if the child falls into three age groups (3 or younger, 4-6, 7+). This then determines if the PCG was asked questions about the child's schooling. I have found quite a few cases (more than 10) where the child is coded as being in the first category (3 years or younger) and therefore is not asked any of the schooling questions even though their age (RSC_AGE or SIB_AGE) indicates that they should have fallen in the 7+ years category. Has this been identified as a problem already?
The age variable used by CAPI in the Parent module to determine skips was PAGE, the age of the child based on the reported birthdate, and not RSC_AGE/SIB_AGE which are from the roster. As you know the age in the roster might differ because the roster respondent is not the child's PCG and does not know the exact age or because the roster respondent rounded the age.
PF1 and PAGE are in agreement.
If PAGE was missing (PAGE=.) because no birthdate was reported, then the CAPI code treated PAGE as being "less than 4" and put them in the PF1 category. The CAPI code unfortunately did not pull in the age from the roster to use if birthdate was not collected. So, you'll have some kids with PF1=1 who have a roster age in RSC_AGE or SIB_AGE (these were variables added by us to the Parent file just before the data was released) that is greater than 3.
There were minor problems with CAPI check variables using missing values when codes were generated by CAPI. The code did not always adequately account for such instances. Also, ex post, some variables originally used by the CAPI program were revised later after fieldwork, and the CAPI check variables using them were not always adjusted for those changes.
Thus, one should always be aware that CAPI check variables especially may have some inconsistencies.
QUESTION: In the PCG data, is there any variable that, if the PCG is the adoptive OR biological parent, indicates whether the PCG is adoptive or biological?
If you look at the PARENT module data (which as one record for the RSC and one for the SIB if present), you'll see there is a variable on that file called PGPNOPAR that equals 0 if the PCG is the bio/adopt parent of the given child in the Parent module and equals 1 if the PGP is not the bio/adopt parent.
Also, if you look at the roster, you'll see that for children in the roster (including the RSC and SIB), the variables RA21ID and RA25ID contain the person id of the bio/adoptive parent of children. If the person id of the PCG equals RA21ID or RA25ID then the PCG is the bio/adoptive parent.
It pays to review the entire LAFANS set of modules since information of interest may be found in numerous places and not always in the module you're presently looking at.
QUESTION: A quick question regarding the skip pattern for some of the questions on the parent questionnaire. If the RSC was between the ages of 1-15 then questions ph3-ph71a where not asked in reference to the SIB. Is this correct? And from what I read in the codebook, the reasoning behind this was that the answers to these questions would be the same for both children, so they were only asked in reference to the RSC?
The idea was to ask these questions only once per family -- i.e., we did not have the time nor did we want the respondents to have the burden of answering these questions twice if there were two children in the household. As the codebook says, these questions are part of the "HOME Inventory" designed to measure the home environment. To reduce respondent burden, we decided that measures of this environment for one child could be used as proxies for the home environment in general.
However, the questions were originally designed only for children ages 1 to 15. The effect of the skips in H1 and H2 is that if the RSC is 1 to 15, the questions get asked for the RSC. Remember that the PCG completes the Parent Qx for the RSC first. Then if these questions were not completed for the RSC, there is a SIB, and the SIB is 1 to 15 years old, the questions are completed for the SIB. Since the RSC is selected randomly this means that these questions are essentially asked for one randomly selected child per household.
QUESTION: For the variables PC7 and PC11 it looks like info. was collected on other residential locations where children lived if they didn't live with their parents (the PCG) for the 2 yrs prior to interview - but it looks like these locations were not geocoded and assigned census tracts - is this correct? I use the locations to assign air monitors so I just want to make sure these are ones that I will not be able to assign exposures to.
PC7 and PC11 were inadvertently missed in geocoding addresses. However, there were only 12 PC7 addresses where geocoding could have been attempted (out of 52) and only 3 PC11 addresses (of out 12) because we were only geocoding LA area addresses and we needed a valid street address. Thus the decision was to not go back and try to geocode those few once we discovered the omission.
QUESTION: In the day care data, it looks like the 3rd non-relative provider and 3rd day care center provider locations did not get geocoded - is this because there were no subjects with more than 2 non-relative providers or day care center providers or were the data just not mapped for this?
There were no third non-relative providers or third daycare center providers to geocode.
QUESTION: I would greatly appreciate it if you could please help me with a couple of questions about the assessment file.
It looks like the assessment file has 1,543 RSCs, whereas the PCG file has 1,603 RSCs between the ages of 3-17 (as determined by the RSC_AGE variable). I've been through the codebook but am still unsure why these numbers don't match. Was there some criteria other than child's age which determined which children received the assessment tests?
Also, when I merged the assessment file (for RSC only) to the PCG file, I have 22 sample IDS that are on the Assessment file but not on the PCG file.
Table 2.11 in the main codebook (not the individual file codebooks) shows that several hundred eligible RSC refused to do the assessment test.
Note that the individual WJR assessment codebook says that the ASSESS1 data represents all those to whom an assessment test was given. As seen in Table 2.11, not all RSC (or SIB or PCG) who were eligible for assessment test were able to be given the test so no assessment booklet was filled out for them---no assessment booklet, no record in ASSESS1. Thus, one would not expect that all possible RSC age 3-17 would appear in the ASSESS1 file. Among RSC, around 100 RSC age 3-17 were not given the assessment test but did have a parent module started and thus they appear in the Parent1 file but not the ASSESS1 file.
Also, there were cases where an assessment test was done for an RSC (or SIB or PCG) where a parent module was not able to be done. The FI did the assessment test before trying to administer the parent module--they weren't supposed to, but that's what they did. That's why you see 20-some odd RSC who are in the assessment data but not in the PARENT1 file.
Because the assessments were not always done right after the parent or child module, a subsequent visit may have been scheduled to do the assessment and that visit was never kept by the respondent. If the FI was not able to meet with the child to attempt to do the assessment test, then no assessment booklet was filled out.
So basically, if an RSC is not in the ASSESS1 file, then no assessment test was able to be attempted, basically due to nonresponse (which can be viewed as a refusal).