This section will explain how to introduce survey weights in statistical calculations using the AmericasBarometer datasets. We will continue what was worked on in the “Data Manipulation” document, available here.
The data that we are going to use should be cited as follows: Source: AmericasBarometer by the Latin American Public Opinion Project (LAPOP), wwww.LapopSurveys.org. In order to reproduce these calculations, this section starts from scratch, reloading the dataset from the “materials_edu” repository of LAPOP’s GitHub account. We suggested to clean the Environment of any other previous dataframe. This can be done with the broom.
We use the library rio
and the command
import
to import this dataset again from the beginning from
this repository.
library(rio)
lapop18 = import("https://raw.github.com/lapop-central/materials_edu/main/LAPOP_AB_Merge_2018_v1.0.sav")
lapop18 = subset(lapop18, pais<=35)
We also load the dataset for the 2021 round.
lapop21 = import("https://raw.github.com/lapop-central/materials_edu/main/lapop21.RData")
When a researcher opens a dataset in any statistical software, it assumes that the data comes from a simple random sample. When working with public opinion data, such as the AmericasBarometer data, the sample design is not a simple random sampling, but a multistage probabilistic design, with stratifications, clusterization, and quotas. As indicated in the technical report of the 2018/19 round of the AmericasBarometer, available here, the samples in each country were designed using a multi-stage probabilistic design (with household-level quotas for most countries), and were stratified by major regions in the country, size of the municipality, and by urban and rural areas within the municipalities. This complex sampling design has to be incorporated into the calculations. On certain occasions, if it is not incorporated, it can lead to differences in the results. A more detailed explanation on the use of survey weights and the potential consequences of not using them with the AmericasBarometer data can be read in Methodological Note 007 (Castorena, 2021), available aquí. This Methodological Note describes three scenarios of uses of expansion factors:
As the Methodological Note indicates, “unweighted analyzes may result in biased estimates” (p.9). For example, we replicated the results on support for democracy in Honduras (45%) and Uruguay (76.2%) in the section of Data Manipulation, for which the recoded variable was calculated and described.
library(car)
lapop18$ing4rec <- car::recode(lapop18$ing4, "1:4=0; 5:7=1")
table(lapop18$ing4rec)
##
## 0 1
## 11463 15623
In this dataframe, the distribution of support for democracy in these two countries can be calculated and the rounded percentages can be reported.
round(prop.table(table(lapop18$ing4rec[lapop18$pais==4]))*100, 1)
##
## 0 1
## 55 45
round(prop.table(table(lapop18$ing4rec[lapop18$pais==14]))*100, 1)
##
## 0 1
## 23.8 76.2
We observed that these results are the same as those that appear in Figure 1.2 of the report “The Pulse of Democracy” (p.12), available here. This is to be expected because, as Table 5 of the Methodological Note indicates, both countries have a self-weighted sample design, so these calculations, which do not include the design, coincide with those of the report, which do include the survey weights into the calculations.
A different case is that of Brazil, which, according to the Methodological Note, has a weighted sample design, so it would require using a survey weight to adjust the oversample in the design. If the descriptive of support for democracy in Brazil is calculated without including the expansion factor, a different result is obtained from that of the report.
round(prop.table(table(lapop18$ing4rec[lapop18$pais==15]))*100, 1)
##
## 0 1
## 40.2 59.8
In this calculation we obtain 59.8%, while we observe 60.0% in Figure
1.2 of the report. This difference is due to the fact that the
table
and the prop.table
commands do not
include the survey weights.
Some libraries and commands in R allow the inclusion of a weight
variable in calculations. The descr
package, for example,
includes several commands, such as compmeans
or
crosstab
that allow this weight inclusion. To reproduce the
data shown in Figure 1.2 of the report, you can use the
compmeans
command that allows you to calculate the mean of
a variable (such as ing4rec, whose mean is equal to the proportion) by
groups of a factor variable, such as “pais”, weighting the results by a
variable, such as “weight1500”. The plot = FALSE
specification is added to disable plot production.
library(descr)
compmeans(lapop18$ing4rec, lapop18$pais, lapop18$weight1500, plot=FALSE)
## Mean value of "lapop18$ing4rec" according to "lapop18$pais"
## Mean N Std. Dev.
## 1 0.6272307 1436 0.4837099
## 2 0.4888451 1432 0.5000501
## 3 0.5856655 1454 0.4927762
## 4 0.4501005 1436 0.4976772
## 5 0.5153743 1451 0.4999359
## 6 0.7235940 1457 0.4473736
## 7 0.5380612 1479 0.4987179
## 8 0.5978999 1460 0.4904899
## 9 0.5443122 1479 0.4982010
## 10 0.4914110 1454 0.5000983
## 11 0.4926471 1475 0.5001155
## 12 0.5121786 1463 0.5000225
## 13 0.6387097 1419 0.4805438
## 14 0.7619359 1451 0.4260454
## 15 0.5999750 1470 0.4900697
## 17 0.7110368 1468 0.4534353
## 21 0.5922659 1458 0.4915818
## 23 0.5118871 1334 0.5000461
## Total 0.5771191 26078 0.4940263
According to these results, we see that Brazil (country = 15) has a support for democracy of 0.599975. If we transform this number into a percentage, approaching 1 decimal place, we reproduce the value of 60% that is observed in Figure 1.2 of the report. Not only that, but it is also observed that for the rest of the countries, the data is replicated. For example, this table shows a support for democracy of 0.6272307 for Mexico (country = 1), or, in a percentage close to 1 decimal, 62.7%, equal to the data in the report.
Another way to replicate the results incorporating survey weights is
using the survey
package, a package specially developed to
work with complex sample designs. The Methodological Note includes an
appendix with the STATA code to use the survey weights in the
AmericasBarometer data. Here we will do the same in R, for which we will
use the svydesign
command (similar to the svyset command in
STATA). With this command a new object called “lapop.design18” is
created, which stores the information of the variables contained in the
dataframe, including a specified survey weight in the calculations.
Therefore, if a new variable is created later, this command would have
to be run again so that this “lapop.design” object includes this new
variable.
This sampling design not only depends of the “weight1500” variable, but also of variables that define strata “estratopri” and sampling primary unit “upm”.
#install.packages("survey") To install the package you must use this code
library(survey)
lapop.design18 = svydesign(ids = ~upm, strata = ~estratopri, weights = ~weight1500, nest=TRUE, data=lapop18)
Once the data is created with the weight included in the object
“lapop.design18”, we can use the native commands of the
survey
package to make calculations. For example, to
calculate the mean of the variable “ing4rec” (support for democracy) in
the entire dataset for the 2018/19 round, we use the command
svymean
.
svymean(~ing4rec, lapop.design18, na.rm=T)
## mean SE
## ing4rec 0.57712 0.0032
In this way, the value of the last row of results of the
compmeans
command is reproduced, which corresponds to the
mean of the entire sample. That is, the same result is being found in
both ways. To reproduce the results by country, you can use the
svyby
command that allows you to find results (such as the
mean, using svymean
) of a variable (“ing4rec”), by values
of other variable (“pais”).
svyby(~ing4rec, ~pais, design=lapop.design18, svymean, na.rm=T)
## pais ing4rec se
## 1 1 0.6272307 0.01245940
## 2 2 0.4888451 0.01358318
## 3 3 0.5856655 0.01267273
## 4 4 0.4501005 0.01197688
## 5 5 0.5153743 0.01419558
## 6 6 0.7235940 0.01512205
## 7 7 0.5380612 0.01372306
## 8 8 0.5978999 0.01212261
## 9 9 0.5443122 0.01357881
## 10 10 0.4914110 0.01374835
## 11 11 0.4926471 0.01337323
## 12 12 0.5121786 0.01624846
## 13 13 0.6387097 0.01161029
## 14 14 0.7619359 0.01240878
## 15 15 0.5999750 0.01556882
## 17 17 0.7110368 0.01415857
## 21 21 0.5922659 0.01050698
## 23 23 0.5118871 0.01325745
In this case, we see that this table is exactly the same as the one
reported with compmeans
, since both use the same survey
weight.
In the same way, results of Figure 1.1 of the report The Pulse of Democracy for the 2021 round are not equal to those obtained in the previous section. For example, results for support for democracy by country without using survey weights indicate that Mexico has 65.2% of respondents that support democracy, while the figure in the report indicate 63%. Uruguay shows 80% of support for democracy in the report, while unweighted results indicate 84.5%.
To replicate results in Figure 1.1, we follow a similar procedure to
that used in the 2018/19 round. First, we recode the variable and then
use the compmeans
command.
lapop21$ing4rec = car::recode(lapop21$ing4, "1:4=0; 5:7=1")
compmeans(lapop21$ing4rec, lapop21$pais, lapop21$weight1500, plot = F)
## Mean value of "La democracia es mejor que cualquier otra forma de gobierno"
## according to "País"
## Mean N Std. Dev.
## 1 0.6319034 1450 0.4824541
## 2 0.5194371 1397 0.4998009
## 3 0.7251405 1460 0.4465964
## 4 0.4876840 1425 0.5000237
## 5 0.6269484 1464 0.4837809
## 6 0.7142359 1476 0.4519307
## 7 0.6120172 1471 0.4874563
## 8 0.5299566 1471 0.4992716
## 9 0.6317680 1483 0.4824875
## 10 0.6097236 1462 0.4879791
## 11 0.4992388 1490 0.5001672
## 12 0.4997591 1453 0.5001720
## 13 0.6758364 1469 0.4682209
## 14 0.7997897 1479 0.4002930
## 15 0.6666251 1479 0.4715787
## 17 0.6885568 1479 0.4632402
## 21 0.6151482 1450 0.4867282
## 22 0.4551642 903 0.4982617
## 23 0.5699446 1324 0.4952707
## 24 0.6583942 680 0.4745972
## 40 0.7439125 1498 0.4366164
## 41 0.7338750 1500 0.4420778
## Total 0.6256306 30764 0.4839675
We can also produce a new object called “lapop.design21”, which saves
information of variables into this dataset, including a survey weight.
The command svydesign
does not accept that variables that
define the design have missing values “NA”. The dataset of the 2021
round has 10 NAs in the variable “weight1500” and 1426 NAs in the
variable “estratopri”. To be able to create an design object, we have to
drop these observations with missing values. The dataset with these
deleted observations is saved in a new dataframe “lapop21a”. The command
svydesign
uses this dataframe.
lapop21a = subset(lapop21, !is.na(estratopri))
lapop21a = subset(lapop21a, !is.na(weight1500))
lapop.design21 = svydesign(ids = ~upm, strata = ~estratopri, weights = ~weight1500, nest=TRUE, data=lapop21a)
svyby(~ing4rec, ~pais, design=lapop.design21, svymean, na.rm=T)
## pais ing4rec se
## 1 1 0.6316272 0.010566603
## 2 2 0.5199516 0.011260850
## 3 3 0.7256456 0.009307515
## 4 4 0.4871786 0.011431643
## 5 5 0.6269559 0.011877269
## 6 6 0.7156601 0.009841549
## 7 7 0.6114910 0.010348017
## 8 8 0.5291314 0.011788637
## 9 9 0.6319269 0.010361710
## 10 10 0.6096981 0.010533582
## 11 11 0.4987885 0.011084079
## 12 12 0.4997387 0.011035268
## 13 13 0.6762073 0.010349583
## 14 14 0.7997897 0.009234862
## 15 15 0.6667058 0.012588648
## 17 17 0.6883720 0.010656950
## 21 21 0.6143721 0.011124695
## 22 22 0.4487836 0.015521089
## 23 23 0.5698944 0.010519540
## 24 24 0.6610324 0.017120055
## 40 40 0.7439125 0.015649932
## 41 41 0.7338750 0.011610860
In this way, we have seen two ways to incorporate the sampling design effect in the basic calculations with the AmericasBarometer data. Later, we will see the inclusion of the survey weights in other more complex calculations, such as the calculation of confidence intervals or regressions. In these documents we will frist work with the unweighted version, and then we will present the complex version, including the survey weights in the calculations.