Designing Social Inquiry Part 8 Online

Designing Social Inquiry - novelonlinefull.com

You’re read light novel Designing Social Inquiry Part 8 online at NovelOnlineFull.com. Please use the follow button to get notification about the latest chapter next time when you visit NovelOnlineFull.com. Use F11 button to read novel in full-screen(PC only). Drop by anytime you want to read free – fast – latest novel. It’s great if you could leave a comment, share your opinion about the new chapters, new novel with others on the internet. We’ll do our best to bring you the finest, latest novel everyday. Enjoy

Most qualitative social scientists appreciate the importance of controlling for the possibly spurious effects of other variables when estimating the effect of one variable on another. Ways to effect this control include, among others, John Stuart Mill's (1843) methods of difference and similarity (which, ironically, are referred to by Przeworski and Teune (1982) as most similar and most different systems designs, respectively), Verba's (1967) "disciplined-configurative case comparisons," (which are similar to George's [1982] "structured-focused comparisons"), and diverse ways of using ceteris paribus a.s.sumptions and similar counterfactuals. These phrases are frequently invoked, but researchers often have difficulty applying them effectively. Unfortunately, qualitative researchers have few tools for expressing the precise consequences of failing to take into account additional variables in particular research situations: that is, of "omitted variable bias." We provide these tools in this section.

We begin our discussion of this issue with a verbal a.n.a.lysis of the consequences of omitted variable bias and follow it with a formal a.n.a.lysis of this problem. Then we will turn to broader questions of research design raised by omitted variable bias.

5.2.1 Gauging the Bias from Omitted Variables.

Suppose we wish to estimate the causal effect of our explanatory variable X1 on our dependent variable Y. If we are undertaking a quant.i.tative a.n.a.lysis, we denote this causal effect of X1 on Y as 1. One way of estimating 1 is by running a regression equation or another form of a.n.a.lysis, which yields an estimate b1 of 1. If we are carrying out qualitative research, we will also seek to make such an estimate of the causal effect; however, this estimate will depend on verbal argument and the investigator's a.s.sessment, based on experience and judgment.

Suppose that after we have made these estimates (quant.i.tatively or qualitatively) a colleague takes a look at our a.n.a.lysis and objects that we have omitted an important control variable, X2. We have been estimating the effect of campaign spending on the proportion of the votes received by a congressional candidate. Our colleague conjectures that our finding is spurious due to "omitted variable bias." That is, she suggests that our estimate b1 of 1 is incorrect since we have failed to take into account another explanatory variable X2 (such as a measure of whether or not the candidate is an inc.u.mbent). The true model should presumably control for the effect of the new variable.

How are we to evaluate her claim? In particular, under what conditions would our omission of the variable measuring inc.u.mbency affect our estimate of the effect of spending on votes and under what conditions would it have no effect? Clearly, the omission of a term measuring inc.u.mbency will not matter if inc.u.mbency has no effect on the dependent variable; that is, if X2 is irrelevant, because it has no effect on Y, it will not cause bias. This is the first special case: irrelevant omitted variables cause no bias. Thus, if inc.u.mbency had no electoral consequences we could ignore the fact that it was omitted.

The second special case, which also produces no bias, occurs when the omitted variable is uncorrelated with the included explanatory variable. Thus, there is also no bias if inc.u.mbency status is uncorrelated with our explanatory variable, campaign spending. Intuitively, when an omitted variable is uncorrelated with the main explanatory variable of interest, controlling for it would not change our estimate of the causal effect of our main variable, since we control for the portion of the variation that the two variables have in common, if any. Thus, we can safely omit control variables, even if they have a strong influence on the dependent variable, as long as they do not vary with the included explanatory variable.60 If these special cases do not hold for some omitted variable (i.e., this variable is correlated with the included explanatory variable and has an effect on the dependent variable), then failure to control for it will bias our estimate (or perception) of the effect of the included variable. In the case at hand, our colleague would be right in her criticism since inc.u.mbency is related to both the dependent variable and the independent variable: inc.u.mbents get more votes and they spend more.

This insight can be put in formal terms by focusing on the last line of equation (5.5) from the box below: (5.3).

This is the equation used to calculate the bias in the estimate of the effect of X1 on the dependent variable Y. In this equation, F represents the degree of correlation between the two explanatory variables X1 and X2.61 If the estimator calculated by using only X1 as an explanatory variable (that is b1) was unbiased, it would equal 1 on average; that is, it would be true that E(b1) = 1. This estimator is unbiased in the two special cases where the bias term F2 equals zero. It is easy to see that this formalizes the conditions for unbiasedness that we stated above. That is, we can omit a control variable if either* The omitted variable has no causal effect on the dependent variable (that is, 2 = 0, regardless of the nature of the relationship between the included and excluded variables F); or * The omitted variable is uncorrelated with the included variable (that is, F = 0, regardless of the value of 2.) If we discover an omitted variable that we suspect might be biasing our results, our a.n.a.lysis should not end here. If possible, we should control for the omitted variable. And even if we cannot, because we have no good source of data about the omitted variable, our model can help us to ascertain the direction of bias, which can be extremely helpful. Having an underestimate or an overestimate may substantially bolster or weaken an existing argument.

For example, suppose we study a few sub-Saharan African states and find that coups d'etat appear more frequently in politically repressive regimes-that 1 (the effect of repression on the likelihood of a coup) is positive. That is, the explanatory variable is the degree of political repression, and the dependent variable is the likelihood of a coup. The unit of a.n.a.lysis is the sub-Saharan African countries. We might even expand the sample to other African states and come to the same conclusion. However, suppose that we did not consider the possible effects of economic conditions on coups. Although we might have no data on economic conditions, it is reasonable to hypothesize that unemployment would probably increase the probability of a coup d'etat (2 > 0), and it also seems likely that unemployment is positively correlated with political repression (F> 0). We also a.s.sume, for the purposes of this ill.u.s.tration that economic conditions are prior to our key causal variable, the degree of political repression. If this is the case, the degree of bias in our a.n.a.lysis could be severe. Since unemployment has a positive correlation with both the dependent variable and the explanatory variable (F2 > 0 in this case), excluding that variable would mean that we were inadvertently estimating the effect of repression and unemployment on the likelihood of a coup instead of just repression (1 + F2 instead of (1). Furthermore, because the joint impact of repression and unemployment is greater than the effect of repression alone (1 + F2 is greater than 1), the estimate of the effect of repression (b1) will be too large on average. Therefore, this a.n.a.lysis shows that by excluding the effects of unemployment, we overestimated the effects of political repression. (This is different from the consequences of measurement error in the explanatory variables since omitted variable bias can sometimes cause a negative relationship to be estimated as a positive one.) Omitting relevant variables does not always result in overestimates of causal effects. For example, we could reasonably hypothesize that in some other countries (perhaps the subject of a new study), political repression and unemployment were inversely related (that F is negative). In these countries, political repression might enable the government to control warring factions, impose peace from above, and put most people to work. This in turn means that the effect of bias introduced by the negative relationship of unemployment and repression (F2) will also be negative, so long as we are still willing to a.s.sume that more unemployment will increase the probability of a coup in these countries. The substantive consequence is that the estimated effect of repression on the likelihood of a coup (E(b1)) will now be less than the true effect (1). Thus, if economic conditions are excluded, b1 will generally be an underestimate of the effect of political repression. If F is sufficiently negative and 2 is sufficiently large, then we might routinely estimate a positive 1 to be negative and incorrectly conclude that more political repression decreases the probability of a coup d'etat! Even if we had insufficient information on unemployment rates to include it in the original study, an a.n.a.lysis like this can still help us generate reasonable substantive conclusions.

As these examples should make clear, we need not actually run a regression to estimate parameters, to a.s.sess the degrees and directions of bias, or to arrive at such conclusions. Qualitative and intuitive estimates are subject to the same kinds of biases as are strictly quant.i.tative ones. This section shows that in both situations, information outside the existing data can help substantially in estimating the degree and direction of bias.

If we know that our research design might suffer from omitted variables but do not know what those variables are, then we may very well have flawed conclusions (and some future researcher is likely to find them). The incentives to find out more are obvious. Fortunately, in most cases, researchers have considerable information about variables outside their a.n.a.lysis. Sometimes this information is detailed but available for only some subunits, or partial but widely applicable, or even from previous research studies. Whatever the source, even incomplete information can help one focus on the likely degree and direction of bias in our causal effects.

Of course, even scholars who understand the consequences of omitted variable bias may encounter difficulties in identifying variables that might be omitted from their a.n.a.lysis. No formula can be provided to deal with this problem, but we do advise that all researchers, quant.i.tative and qualitative, systematically look for omitted control variables and consider whether they should be included in the a.n.a.lysis. We suggest some guidelines for such a review in this section.

Omitted variables can cause difficulties even when we have adequate information on all relevant variables. Scholars sometimes have such information, and believing the several variables to be positively related to the dependent variable, they estimate the causal effects of these variables sequentially, in separate "bivariate" a.n.a.lyses. It is particularly tempting to use this approach in studies with a small number of observations, since including many explanatory variables simultaneously creates very imprecise estimates or even an indeterminate research design, as discussed in section 4.1. Unfortunately, however, each a.n.a.lysis excludes the other relevant variables, and this omission leads to omitted variable bias in each estimation. The ideal solution is not merely to collect information on all relevant variables, but explicitly and simultaneously to control for all relevant variables. The qualitative researcher must recognize that failure to take into account all relevant variables at the same time leads to biased inferences. Recognition of the sources of bias is valuable, even if small numbers of observations make it impossible to remove them.

Concern for omitted variable bias, however, should not lead us automatically to include every variable whose omission might cause bias because it is correlated with the independent variable and has an effect on the dependent variable. In general, we should not control for an explanatory variable that is in part a consequence of our key causal variable.

Consider the following example. Suppose we are interested in the causal effect of an additional $10,000 in income (our treatment variable) on the probability that a citizen will vote for the Democratic candidate (our dependent variable). Should we control for whether this citizen reports planning to vote Democratic in an interview five minutes before he arrives at the polls? This control variable certainly affects the dependent variable and is probably correlated with the explanatory variable. Intuitively, the answer is no. If we did control for it, the estimated effect of income on voting Democratic would be almost entirely attributed to the control variable, which in this case is hardly an alternative causal explanation. A blind application of the omitted variable bias rules, above, might incorrectly lead one to control for this variable. After all, this possible control variable certainly has an effect on the dependent variable-voting Democratic--and it is correlated with the key explanatory variable-income. But including this variable would attribute part of the causal effect of our key explanatory variable to the control variable.

To take another example, suppose we are interested in the causal effect of a sharp increase in crude-oil prices on public opinion about the existence of an energy shortage. We could obtain measures of oil prices (our key causal variable) from newspapers and use opinion polls as our dependent variable to gauge the public's perception of whether there is an energy shortage. But we might ask whether we should control for the effects of television coverage of energy problems. Certainly television coverage of energy problems is correlated with both the included explanatory variable (crude oil prices) and the dependent variable (public opinion about an energy shortage). However, since television coverage is in part a consequence of real-world oil prices, we should not control for that coverage in a.s.sessing the causal influence of oil prices on public opinion about an energy shortage. If instead we were interested in the causal effect of television coverage, we would control for oil prices, since these prices come before the key explanatory variable (which is now coverage).62 Thus, to estimate the total effect of an explanatory variable, we should list all variables that, according to our theoretical model, could cause the dependent variable. To repeat the point made above: in general, we should not control for an explanatory variable that is in part a consequence of our key explanatory variable. Having eliminated these possible explanatory variables, we should then control for other potential explanatory variables that would otherwise cause omitted variable bias-those that are correlated with both the dependent variable and with the included explanatory variables.63 The argument that we should not control for explanatory variables that are consequences of our key explanatory variables has a very important implication for the role of theory in research design. Thinking about this issue, we can see why we should begin with or at least work towards a theoretically-motivated model rather than "data-mining": running regressions or qualitative a.n.a.lyses with whatever explanatory variables we can think of. Without a theoretical model, we cannot decide which potential explanatory variables should be included in our a.n.a.lysis. Indeed, in the absence of a model, we might get the strongest results by using a trivial explanatory variable-such as intention to vote Democratic five minutes before entering the polling place-and controlling for all other factors correlated with it. We cannot determine whether to control for or ignore possible explanatory variables that are correlated with each other without a theoretically motivated model, without which we have serious dangers either of omitted variable bias or triviality in research design.

Choosing when to add additional explanatory variables to our a.n.a.lysis is by no means simple. The number of additional variables is always unlimited, our resources are limited, and, above all, the more explanatory variables we include, the less leverage we have for estimating any of the individual causal effects. Avoiding omitted variable bias is one reason to add additional explanatory variables. If relevant variables are omitted, our ability to estimate causal inferences correctly is limited.

A Formal a.n.a.lysis of Omitted Variable Bias. Let us begin with a simple model with two explanatory variables (5.4).

Suppose now that we came upon an important a.n.a.lysis which reported the effect of X1 on Y without controlling for X2. Under what circ.u.mstances would we have grounds for criticizing this work or justification for seeking funds to redo the study? To answer this question, we formally evaluate the estimator with the omitted control variable.

The estimator of 1 where we omit X2 is To evaluate this estimator, we take the expectation of b1 across hypothetical replications under the model in equation (5.4): (5.5).

where F =, the slope coefficient from the regression of X1 on X2. The last line of this equation is reproduced in the text in equation (5.3) and is discussed in some detail above.

5.2.2 Examples of Omitted Variable Bias.

In this section, we consider several quant.i.tative and qualitative examples, some hypothetical and some from actual research. For example, educational level is one of the best predictors of political partic.i.p.ation. Those who have higher levels of education are more likely to vote and more likely to take part in politics in a number of other ways. Suppose we find this to be the case in a new data set but want to go further and see whether the relationship between the two variables is causal and, if so, how education leads to partic.i.p.ation.

The first thing we might do would be to see whether there are omitted variables antecedent to education that are correlated with education and at the same time cause partic.i.p.ation. Two examples might be the political involvement of the individual's parents and the race of the individual. Parents active in politics might inculcate an interest in partic.i.p.ation in their children and at the same time be the kind of parents who foster educational attainment in their children. If we did not include this variable, we might have a spurious relationship between education and political activity or an estimate of the relationship that was too strong.

Race might play the same role. In a racially discriminatory society, blacks might be barred from both educational opportunities and political partic.i.p.ation. In such a case, the apparent effect of education on partic.i.p.ation would not be real. Ideally, we would want to eliminate all possible omitted variables that might explain away part or all of the relationship between education and partic.i.p.ation.

But the fact that the relationship between education and partic.i.p.ation diminishes or disappears when we control for an antecedent variable does not necessarily mean that education is irrelevant. Suppose we found that the education-partic.i.p.ation link diminished when we controlled for race. One reason might be, as in the example above, that discrimination against blacks meant that race was a.s.sociated separately with both educational attainment and partic.i.p.ation. Under these conditions, no real causal link between education and partic.i.p.ation would exist. On the other hand, race might affect political partic.i.p.ation through education. Racial discrimination might reduce the access of blacks to education. Education might, in turn, be the main factor leading to partic.i.p.ation. In this case, the reduction in the relationship between education and partic.i.p.ation that is introduced when the investigator adds race to the a.n.a.lysis does not diminish the importance of education. Rather, it explains how race and education interact to affect partic.i.p.ation.

Note that these two situations are fundamentally different. If lower partic.i.p.ation on the part of blacks was due to a lack of education, we might expect partic.i.p.ation to increase if their average level of education increased. But if the reason for lower partic.i.p.ation was direct political discrimination that prevented the partic.i.p.ation of blacks as citizens, educational improvement would be irrelevant to changes in patterns of partic.i.p.ation.

We might also look for variables that are simultaneous with education or that followed it. We might look for omitted variables that show the relationship between education and partic.i.p.ation to be spurious. Or we might look for variables that help explain how education works to foster partic.i.p.ation. In the former category might be such a variable as the general intelligence level of the individual (which might lead to doing well in school and to political activity). In the latter category might be variables measuring aspects of education such as exposure to civics courses, opportunities to take part in student government, and learning of basic communications skills. If it were found that one or more of the latter, when included in the a.n.a.lysis, reduced the relationship between educational attainment and partic.i.p.ation (when we controlled for communications skills, there was no independent effect of educational attainment on partic.i.p.ation), this finding would not mean that education was irrelevant. The requisite communications skills were learned in school and there would be a difference in such skills across educational levels. What the a.n.a.lysis would tell us would be how education influenced partic.i.p.ation.

All of these examples ill.u.s.trate once again why it is necessary to have a theoretical model in mind to evaluate. There is no other way to choose what variables to use in our a.n.a.lysis. A theory of how education affected civic activity would guide us to the variables to include. Though we do not add additional variables to a regression equation in qualitative research, the logic is much the same when we decide what other factors to take into account. Consider the research question we raised earlier: the impact of summit meetings on cooperation between the superpowers. Suppose we find that cooperation between the United States and the USSR was higher in years following a summit than preceding one. How would we know that the effect is real and not the result of some omitted variable? And if we are convinced it is real, can we explicate further how it works?

We might want to consider antecedent variables that would be related to the likelihood of a summit and might also be direct causes of cooperation. Perhaps when leaders in each country have confidence in each other, they meet frequently and their countries cooperate. Or perhaps when the geopolitical ambitions of both sides are limited for domestic political reasons, they schedule meetings and they cooperate. In such circ.u.mstances, summits themselves would play no direct role in fostering cooperation, though the scheduling of a summit might be a good indicator that things were going well between the superpowers. It is also possible that summits would be part of a causal sequence, just as race might have affected educational level which in turn affected partic.i.p.ation. When the superpower leaders have confidence in one another, they call a summit to reinforce that mutual confidence. This, in turn, leads to cooperation. In this case, the summit is far from irrelevant. Without it, there would be less cooperation. Confidence and summits interact to create cooperation. Suppose we take such factors into account and find that summits seem to play an independent role-i.e., when we control for the previous mutual confidence of the leaders and their geopolitical ambitions, the conclusion is that a summit seems to lead to more cooperation. We might still go further and ask how that happens. We might compare among summits in terms of characteristics that might make them more or less successful and see if such factors are related to the degree of cooperation that follows. Again we have to select factors to consider, and these might include: the degree of preparation, whether the issues were economic rather than security, the degree of domestic harmony in each nation, the weather at the summit, and the food. Theory would have to guide us; that is, we would need a view of concepts and relationships that would point to relevant explanatory variables and would propose hypotheses consistent with logic and experience about their effects.

For researchers with a small number of observations, omitted variable bias is very difficult to avoid. In this situation, inefficiency is very costly; including too many irrelevant control variables may make a research design indeterminate (section 4.1). But omitting relevant control variables can introduce bias. And a priori the researcher may not know whether a candidate variable is relevant or not.

We may be tempted at this point to conclude that causal inference is impossible with small numbers of observations. In our view, however, the lessons to be learned are more limited and more optimistic. Understanding the difficulty of making valid causal inferences with few observations should make us cautious about making causal a.s.sertions. As indicated in chapter 2, good description and descriptive inference are more valuable than faulty causal inference. Much qualitative research would indeed be improved if there were more attention to valid descriptive inference and less impulse to make causal a.s.sertions on the basis of inadequate evidence with incorrect a.s.sessments of their uncertainty. However, limited progress in understanding causal issues is nevertheless possible, if the theoretical issues with which we are concerned are posed with sufficient clarity and linked to appropriate observable implications. A recent example from international relations research may help make this point.

Helen Milner's study, Resisting Protectionism (1988), was motivated by a puzzle: why was U.S. trade policy more protectionist in the 1920s than in the 1970s despite the numerous similarities between the two periods? Her hypothesis was that international interdependence increased between the 1920s and 1970s and helped to account for the difference in U.S. behavior. At this aggregate level of a.n.a.lysis, however, she had only the two observations that had motivated her puzzle which could not help her distinguish her hypothesis from many other possible explanations of this observed variation. The level of uncertainty in her theory would therefore have been much too high had she stopped here. Hence she had to look elsewhere for additional observable implications of her theory.

Milner's approach was to elaborate the process by which her causal effect was thought to take place. She hypothesized that economic interdependence between capitalist democracies affects national preferences by influencing the preferences of industries and firms, which successfully lobby for their preferred policies. Milner therefore studied a variety of U.S. industries in the 1920s and 1970s and French industries in the 1970s and found that those with large multinational investments and more export dependence were the least protectionist. These findings helped confirm her broader theory of the differences in overall U.S. policy between the 1920s and 1970s. Her procedures were therefore consistent with a key part of our methodological advice: specify the observable implications of the theory, even if they are not the objects of princ.i.p.al concern, and design the research so that inferences can be made about these implications and used to evaluate the theory. Hence Milner's study is exemplary in many ways.

The most serious problem of research design that Milner faced involved potential omitted variables. The most obvious control variable is the degree of compet.i.tion from imports, since more intense compet.i.tion from foreign imports tends to produce more protectionist firm preferences. That is, import compet.i.tion is likely to be correlated with Milner's dependent variable, and it is in most cases antecedent to or simultaneous with her explanatory variables. If this control variable were also correlated with her key causal explanatory variables, multinational investment and export dependence, her results would be biased. Indeed, a negative correlation between import compet.i.tion and export dependence would have seemed likely on the principles of comparative advantage, so this hypothetical bias would have become real if import compet.i.tion were not included as a control.

Milner dealt with this problem by selecting for study only industries that were severely affected by foreign compet.i.tion. Hence, she held constant the severity of import compet.i.tion and eliminated, or at least greatly reduced, this problem of omitted variable bias. She could have held this key control variable constant at a different level-such as only industries with moderately high levels of import penetration-so long as it was indeed constant for her observations.

Having controlled for import compet.i.tion, however, Milner still faced other questions of omitted variables. The two major candidates that she considered most seriously, based on a review of the theoretical and empirical literature in her field, were (1) that changes in U.S. power would account for the differences between outcomes in the 1920s and 1970s, and (2) that changes in the domestic political processes of the United States would do so. Her attempt to control for the first factor was built into her original research design: since the proportion of world trade involving the United States in the 1970s was roughly similar to its trade involvement in the 1920s, she controlled for this dimension of American power at the aggregate level of U.S. policy, as well as at the industry and firm level. However, she did not control for the differences between the political isolationism of the United States in the 1920s and its hegemonic position as alliance leader in the 1970s; these factors could be a.n.a.lyzed further to ascertain their potentially biasing effects.

Milner controlled for domestic political processes by comparing industries and firms within the 1920s and within the 1970s, since all firms within these groups faced the same governmental structures and political processes. Her additional study of six import-competing industries in France during the 1970s obviously did not help her hold domestic political processes constant, but it did help her discover that the causal effect of export dependence on preferences for protectionism did not vary with changes in domestic political processes. By carefully considering several potential sources of omitted variable bias and designing her study accordingly, Milner greatly reduced the potential for bias.

However, Milner did not explicitly control for several other possible omitted variables. Her study focused "on corporate trade preferences and does not examine directly the influence of public opinion, ideology, organized labor, domestic political structure, or other possible factors" (1988: 15-16). Her decision not to control for these variables could have been justified on the theoretical grounds that these omitted variables are unrelated to, or are in part consequences of, the key causal variables (export dependence and multinational investment), or have no effect on the dependent variable (preferences for protectionism at the level of the firm, aggregated to industries). However, if these omitted variables were plausibly linked to both her explanatory and dependent variables and were causally prior to her explanatory variable, she would have had to design her study explicitly to control for them.64 Finally, Milner's procedure for selecting industries risked making her causal inferences inefficient. As we have noted, her case-selection procedure enabled her to control for the most serious potential source of omitted variable bias by holding import compet.i.tion constant, which on theoretical grounds was expected to be causally prior to and correlated with her key causal variable and to influence her dependent variables. She selected those industries that had the highest levels of import compet.i.tion and did not stratify by any other variable. She then studied the preferences of each industry in her sample, and of many firms, for protectionism preferences (her dependent variable) and researched the degree of international economic dependence (her explanatory variable).

This selection procedure is inefficient with respect to her causal inferences because her key causal variables varied less than would have been desirable (Milner 1988:39-42). Although this inefficiency turned out not to be a severe problem in her case, it did mean that she had to do more case studies than were necessary to reach the same level of certainty about her conclusions (see section 6.2). Put differently, with the same number of cases, chosen so that they varied widely on her explanatory variable, she could have produced more certain causal inferences.That is, her design would have been more efficient had she chosen some industries and firms with no foreign ties and some with high levels of foreign involvement, all of which suffered from constant levels of economic distress and import penetration.

Researchers can never conclusively reject the hypothesis that omitted variables have biased their a.n.a.lyses. However, Milner was able to make a stronger, more convincing case for her hypothesis than she could have done had she not tried to control for some evident sources of omitted variable bias. Milner's rigorous study indicates that social scientists who work with qualitative material need not despair of making limited causal inferences. Perfection is unattainable, perhaps even undefinable; but careful linking of theory and method can enable studies to be designed in a way that will improve the plausibility of our arguments and reduce the uncertainty of our causal inferences.

5.3 INCLUDING IRRELEVANT VARIABLES: INEFFICIENCY.

Because of the potential problems with omitted variable bias described in section 5.2, we might naively think that it is essential to collect and simultaneously estimate the causal effects of all possible explanatory variables. At the outset, we should remember that this is not the implication of section 5.2. We showed there that omitting an explanatory variable that is uncorrelated with the included explanatory variables does not create bias, even if the variable has a strong causal impact on the dependent variable, and that controlling for variables that are the consequences of explanatory variables is a mistake. Hence, our argument should not lead researchers to collect information on every possible causal influence or to criticize research which fails to do so.

Of course, a researcher might still be uncertain about which antecedent control variables have causal impact or are correlated with the included variables. In this situation, some researchers might attempt to include all control variables that are conceivably correlated with the included explanatory variables as well as all those that might be expected on theoretical grounds to affect the dependent variable. This is likely to be a very long list of variables, many of which may be irrelevant. Such an approach, which appears at first glance to be a cautious and prudent means of avoiding omitted variable bias, would, in fact, risk producing a research design that could only produce indeterminate results. In research with relatively few observations, indeterminacy, as discussed in section 4.1, is a particularly serious problem, and such a "cautious" design would actually be detrimental. This section discusses the costs of including irrelevant explanatory variables and provides essential qualifications to the "include everything" approach. The inclusion of irrelevant variables can be very costly. Our key point is that even if the control variable has no causal effect on the dependent variable, the more correlated the main explanatory variable is with the irrelevant control variable, the less efficient is the estimate of the main causal effect.

To ill.u.s.trate, let us focus on two different procedures (or "estimators") for calculating an estimate of the causal effect of an appropriately included explanatory variable. The first estimate of this effect is from an a.n.a.lysis with no irrelevant control variables; the second includes one irrelevant control variable. The formal a.n.a.lysis in the box below provides the following conclusions about the relative worth of these two procedures, in addition to the one already mentioned. First, both estimators are unbiased. That is, even when controlling for an irrelevant explanatory variable, the usual estimator still gives the right answer on average. Second, if the irrelevant control variable is uncorrelated with the main explanatory variable, the estimate of the causal effect of the latter is not only unbiased, but it is as efficient as if the irrelevant variable had not been included. Indeed, if these variables are uncorrelated, precisely the same inference will result. However, if the irrelevant control variable is highly correlated with the main explanatory variable, substantial inefficiency will occur.

The costs of controlling for irrelevant variables are therefore high. When we do so, each study we conduct is much more likely to yield estimates far from the true causal effects. When we replicate a study in a new data set in which there is a high correlation between the key explanatory variable and an irrelevant included control variable, we will be likely to find different results, which would suggest different causal inferences. Thus, even if we control for all irrelevant explanatory variables (and make no other mistakes), we will get the right answer on average, but we may be far from the right answer in any single project and possibly every one. On average, the rea.n.a.lysis will produce the same effect but the irrelevant variable will increase the inefficiency, just as if we had discarded some of our observations. The implication should be clear: by including an irrelevant variable, we are putting more demands on our finite data set, resulting in less information available for each inference.

As an example, consider again the study of coups d'etat in African states. A preliminary study indicated that the degree of political repression, the main explanatory variable of interest, increased the frequency of coups. Suppose another scholar argued that the original study was flawed because it did not control for whether the state won independence in a violent or negotiated break from colonial rule. Suppose we believe this second scholar is wrong and that the nature of the break from colonial rule had no effect on the dependent variable-the frequency of coups (after the main explanatory variable, political repression, is controlled for). What would be the consequences of controlling for this irrelevant, additional variable?

The answer depends on the relationship between the irrelevant variable, which measures the nature of the break from colonial rule, and the main explanatory variable, which measures political repression. If the correlation between these variables is high-as seems plausible-then including these control variables would produce quite inefficient estimates of the effect of political repression. To understand this, notice that to control for how independence was achieved, the researcher might divide his categories of repressive and nonrepressive regimes according to whether they broke from colonial rule violently or by negotiation. The frequency of coups in each category could be counted to a.s.sess the causal effects of political repression, while the means of breaking from colonial rule is controlled. Although this sort of design is a reasonable way to avoid omitted variable bias, it can have high costs: when the additional control variable has no effect on the dependent variable but is correlated with an included explanatory variable, the number of observations in each category is reduced and the main causal effect is estimated much less efficiently. This result means that much of the hard work the researcher has put in was wasted, since unnecessarily reducing efficiency is equivalent to discarding observations. The best solution is to always collect more observations, but if this is not possible, researchers are well-advised to identify irrelevant variables and not control for them.

A Formal a.n.a.lysis of Included Variable Inefficiencies. Suppose the true model is E(Y) = X1 and V(Y) = 2. However, we incorrectly think that a second explanatory variable X2 also belongs in the equation. So we estimate (5.6).

not knowing that in fact 2 = 0. What consequence does a simultaneous estimation of both parameters have for our estimate of 1?

Define b1 as the correct estimator, based only on a regression of Y on X1, and1 as the first coefficient on Xi from a regression of Y on X1 and X2. It is easy to show that we cannot distinguish between these two estimators on the basis of unbiasedness (being correct on average across many hypothetical experiments), since both are unbiased: (5.7).

The estimators do differ, however, with respect to efficiency. The correct estimator has a variance (calculated in equation [3.9]) of (5.8).

whereas the other estimator has variance.

(5.9).

where the correlation between X1 and X2 is r12 (see Goldberger 1991:245).

From the last line in equation (5.9), we can see the precise relationship between the variances of the two estimators. If the correlation between the two explanatory variables is zero, then it makes no difference whether you include the irrelevant variable or not, since both estimators have the same variance. However, the more correlated two variables are, the higher the variance, and thus lower the efficiency, of1.

5.4 ENDOGENEITY.

Political science research is rarely experimental. We do not usually have the opportunity to manipulate the explanatory variables; we just observe them. One consequence of this lack of control is endogeneity-that the values our explanatory variables take on are sometimes a consequence, rather than a cause, of our dependent variable. With true experimental manipulation, the direction of causality is unambiguous. But for many areas of qualitative and quant.i.tative research, endogeneity is a common and serious problem.65 In the absence of investigator control over the values of the explanatory variables, the direction of causality is always a difficult issue. In nonexperimental research-quant.i.tative or qualitative-explanatory and dependent variables vary because of factors out of the control (and often out of sight) of the researcher. States invade; army officers plot coups; inflation drops; government policies are enacted; candidates decide to run for office; voters choose among candidates. A scholar must try to piece together an argument about what is causing what.

An example is provided by the literature on U.S. congressional elections. Many scholars have argued that the dramatic rise of the electoral advantage of inc.u.mbency during the late 1960s was due in large part to the increase in const.i.tuency service performed by members of Congress. That is, the franking privilege, budgets for travel to the district, staff in the district to handle specific const.i.tuent requests, pork-barrel projects, and other perquisites of office have allowed congressional inc.u.mbents to build up support in their districts. Many citizens vote for inc.u.mbent candidates on these grounds.

This const.i.tuency-service hypothesis seems perfectly reasonable, but does the evidence support it? Numerous scholars have attempted to provide such evidence (for a review of this literature, see Cain, Ferejohn, and Fiorina 1987), but the positive evidence is scarce. The modal study of this question is based on measures of the const.i.tuency service performed by a sample of members of Congress and of the proportion of the vote for the inc.u.mbent candidate. The researchers then estimate the causal impact of service on the vote through regression a.n.a.lysis. Surprisingly, many of these estimates indicate that the effect is zero or even negative.

It seems likely that the problem of endogeneity accounts for these paradoxical results. In other words, members at highest risk of losing the next election (perhaps because of a scandal or hard times in their district) do extra const.i.tuency service. Inc.u.mbents who feel secure about being reelected probably focus on other aspects of their jobs, such as policy-making in Washington. The result is that those inc.u.mbents who do the most service receive the fewest votes. This does not mean that const.i.tuency service reduces the vote, only that a strong expected vote reduces service. By ignoring the feedback effect, one's inferences will be strongly biased.

David Laitin outlines an example of an endogeneity problem in one of the cla.s.sics of early twentieth century social science, Max Weber's The Protestant Ethic and the Spirit of Capitalism. "Weber attempted to demonstrate that a specific type of economic behavior-the capitalist spirit-was (inadvertently) induced by Protestant teachings and doctrines. But . . . Weber and his followers could not answer one objection that was raised to their thesis: namely that the Europeans who already had an interest in breaking the bonds of precapitalist spirit might well have left the church precisely for that purpose. In other words, the economic interests of certain groups could be seen as inducing the development of the Protestant ethic. Without a better controlled study, Weber's line of causation could be turned the other way." (Laitin 1986:187; see also R. H. Tawney 1935 who originated the criticism).

In the remainder of this section, we will discuss five methods of coping with the difficult problem of endogeneity:* Correcting a biased inference (section 5.4.1); * Parsing the dependent variable and studying only those parts that are consequences, rather than causes, of the explanatory variable (section 5.4.2); * Transforming an endogeneity problem into bias due to an omitted variable, and controlling for this variable (section 5.4.3); * Carefully selecting at least some observations without endogeneity problems (section 5.4.4); and * Parsing the explanatory variables to ensure that only those parts which are truly exogenous are in the a.n.a.lysis (section 5.4.5).

Each of these five procedures can be viewed as a method of avoiding endogeneity problems, but each can also be seen as a way of clarifying a causal hypothesis. For a causal hypothesis that ignores an endogeneity problem is, in the end, a theoretical problem, requiring respecification so that it is at least possible that the explanatory variables could influence the dependent variable. We will discuss the first two solutions to endogeneity in the context of our quant.i.tative const.i.tuency service example and the remaining three with the help of extended examples from qualitative research.

5.4.1 Correcting Biased Inferences.

The last line of equation (5.13) in the box below provides a procedure for a.s.sessing the exact direction and degree of bias due to endogeneity. For convenience, we reproduce equation (5.13) here: This equation implies that if endogeneity is present, we are not making the causal inference we desire. That is, if the bias term is zero, our method of inference (or estimator b) will be unbiased on average (that is, equal to ). But if we have endogeneity bias, we are estimating the correct inference plus a bias factor. Endogeneity is a problem because we are generally unaware of the size or direction of the bias. This bias factor will be large or small, negative or positive, depending on the specific empirical example. Fortunately, even if we cannot avoid endogeneity bias in the first place, we can sometimes correct for it after the fact by ascertaining the direction and perhaps the degree of the bias.

Equation (5.13) demonstrates that the bias factor depends on the correlation between the explanatory variable and the error term-the part of the dependent variable unexplained by the explanatory variable. For example, if the const.i.tuency-service hypothesis is correct, then the causal effect of const.i.tuency service on the vote ( in the equation) is positive. If, in addition, the expected vote affects the level of const.i.tuency service we observe, then the bias term will be negative. That is, even after the effect of const.i.tuency service on the vote is taken into account, const.i.tuency service will inversely correlate with the error term because inc.u.mbents who have lower expected votes will perform more service. The result is that the bias term is negative, and uncorrected inferences in this case are biased estimates of the causal effect (or, equivalently, unbiased estimates of [ + bias]). Thus, even if the const.i.tuency-service hypothesis is true, endogeneity bias would cause us to estimate the effect of service as a smaller positive number than it should be, as zero, or even as negative, depending on the size of the bias factor. Hence, we can conclude that the correct estimate of the effect of service on the vote is larger than we estimated in an a.n.a.lysis conducted with no endogeneity correction. As a result, our uncorrected a.n.a.lysis yields a lower bound on the effect of service, making the const.i.tuency-service hypothesis more plausible.

Thus, even if we cannot avoid endogeneity bias, we can sometimes improve our inferences after the fact by estimating the degree of bias. At a minimum, this enables us to determine the direction of bias, perhaps providing an upper or lower bound on the correct estimate. At best, we can use this technique to produce fully unbiased inferences.

5.4.2 Parsing the Dependent Variable.

One way to avoid endogeneity bias is to reconceptualize the dependent variable as itself containing a dependent and an explanatory component. The explanatory component of the dependent variable interferes with our a.n.a.lysis through a feedback mechanism, that is, by influencing our key causal (explanatory) variable. The other component of our dependent variable is truly dependent, a function, and not a cause, of our explanatory variable. The goal of this method of avoiding endogeneity bias is to identify and measure only the dependent component of our dependent variable.

For example, in a study of the const.i.tuency-service hypothesis, King (1991a) separated from the total vote for a member of congress the portion due solely to inc.u.mbency status. In recent years, the electoral advantage of inc.u.mbency status is about 8-10 percentage points of the vote, as compared to a base for many inc.u.mbents of roughly 52 percent of the two-party vote. Through a statistical procedure, King then estimated the inc.u.mbency advantage, which was a solely dependent component of the dependent variable, and he used this figure in place of the raw vote to estimate the effects of const.i.tuency service. Since the inc.u.mbent's vote advantage, being such a small portion of the entire vote, would not have much of an effect on the propensity for inc.u.mbent legislators to engage in const.i.tuency service, he avoided endogeneity bias. His results indicated that an extra $10,000 added to the budget of the average state legislator for const.i.tuency service (among other things) gives this inc.u.mbent an additional 1.54 percentage point advantage (plus or minus about 0.4 percent) in the next election, hence providing the first empirical support for the const.i.tuency-service hypothesis.

5.4.3 Transforming Endogeneity into an Omitted Variable Problem.

We can always think of endogeneity as a case of omitted variable bias, as the following famous example from the study of comparative electoral systems demonstrates. One of the great puzzles of political a.n.a.lysis for an earlier generation of political scientists was the fall of the Weimar Republic and its replacement by the n.a.z.i regime in the early 1930s. One explanation, supported by some close and compelling case studies of Weimar Germany, was that the main cause was the imposition of proportional representation as the mode of election in the Weimar Const.i.tution. The argument, briefly stated, is that proportional representation allows small parties representing specific ideological, interest, or religious groups to achieve representation in parliament. Under such an electoral system, there is no need for a candidate to compromise his or her position in order to achieve electoral success such as there is under a single-member-district, winner-take-all electoral system. Hence parliament will be filled with small ideological groups unwilling and unable to work together. The stalemate and frustration would make it possible for one of those groups-in this case the National Socialists-to seize power. (For the cla.s.sic statement of this theory, see Hermens 1941).

The argument in the above paragraph was elaborated in several important case studies of the fall of the Weimar Republic. Historians and political scientists traced the collapse of Weimar to the electoral success of small ideological parties and their unwillingness to compromise in the Reichstag. There are many problems with the explanation, as of course there would be for an explanation of a complex outcome that is based on a single instance, but let us look only at the problem of endogeneity. The underlying explanation involved a causal mechanism with the following links in the causal chain: proportional representation was introduced and enabled small parties with narrow electoral bases to gain seats in the Reichstag (including parties dedicated to its overthrow, like the National Socialists). As a result, the Reichstag was stalemated and the populace was frustrated. This, in turn, led to a coup by one of the parties.

But further study-of Germany as well as of other observable implications-indicated that party fragmentation was not merely the result of proportional representation. Scholars reasoned that if party fragmentation led to adoption of proportional representation, it would also be the cause. By applying the same explanatory variable to other observations (following our rule from chapter 1 that evidence should be sought for hypotheses in data other than that in which they were generated), scholars found that societies with a large number of groups with narrow and intense views in opposition to other groups-minority, ethnic, or religious groups, for instance-are more likely to adopt proportional representation, since it is the only electoral system that the various factions in society can agree on. A closer look at German politics before the introduction of proportional representation confirmed this idea by locating many small factions. Proportional representation did not create these factions, although it may have facilitated their parliamentary expression. Nor were the factions the sole cause of proportional representation; however, both the adoption of proportional representation and parliamentary fragmentation seem to have been effects of social fragmentation. (See Lakeman and Lambert 1955:155 for an early explication of this argument.) Thus, we have transformed an endogeneity problem into omitted variable bias. That is, prior social fragmentation is an omitted variable that causes proportional representation, is causally prior to it, and led in part to the fall of Weimar. By transforming the problem in this way, scholars were able to get a better handle on the problem since they could explicitly measure this omitted variable and control for it in subsequent studies. In this example, once the omitted variable was included and controlled for, scholars found that there was a reasonable probability that the apparent causal relationship between proportional representation and the fall of the Weimar Republic was almost entirely spurious.

Designing Social Inquiry Part 8

Designing Social Inquiry - novelonlinefull.com

RECENTLY UPDATED MANGA

Supreme Magus

My Doomsday Territory

Dragon Ball God Mu

Big Life

My Rich Wife

Complete Martial Arts Attributes

My Wife Is A Transmigrated Master Cultivator

Vampire's Slice Of Life

Monster Integration

Martial Peak

You Cannot Afford To Offend My Woman

Reincarnated As a Fox With System

Designing Social Inquiry Part 8 summary