dplyr - Let treatment of NA-values depend on their number relative to the number of available values among groups in a data frame, in R -
i have got dataset containing contracts between states. number of contracting states varies 2 94. in data frame, each state attributed value called “polity” - although some, value missing.
with of forum, merged 2 data frames, , summarized contracts taking difference of min() , max() "polity"-values of contracting states.
now, don't want either ignore or exclude na-values. want treat polity value of contract na if number of na-values among contracting states exceeds fraction of number of contracting states (for these data frames, convenient 4/5 of polity-values must available in order contract taken in analysis).
these 2 simplified versions of data sets:
treaties <- data.frame(treaty.id=c(1,1,2,2,3,3,3,4,4,4,4,4), treaty=c("hungary slovenia 1994", "hungary slovenia 1994", "taiwan hungary 1994", "taiwan hungary 1994", "treaty of izmir 1977", "treaty of izmir 1977", "treaty of izmir 1977", "treaty of 5 1909", "treaty of 5 1909", "treaty of 5 1909", "treaty of 5 1909","treaty of 5 1909"), scode=c("hun","slv","taw","hun", "irn", "tur", "pak", "aus","aul","new","usa","can"), year=c(1994, 1994, 1994, 1994, 1977, 1977, 1977, 1909, 1909, 1909, 1909, 1909), pr.dem=c(1,1,0,0,0,0,0,1,1,1,1,1)) pol <- data.frame(country=c("hungary", "slovenia", "taiwan","austria", "australia", "new zealand", "usa", "canada", "iran","turkey", "pakistan"), scode=c("hun", "slv", "taw", "aus", "aul", "new", "usa", "can", "irn", "tur", "pak"), year=c(1994, 1994, 1994, 1909, 1909, 1909, 1909, 1909, 1977, 1977, 1977), polity = c(7, na, 9, 8, 8, 10, 10, na, -10, 9, na))
(hence, treaties 1 , 3 should show na "polity" in end)
i joined them together, reduced multiple rows same treaty 1 while taking difference of maximum , minium of polity values:
require(dplyr) left_join(treaties, pol, c("scode","year")) %>% group_by(treaty) %>% summarise(politydiff=max(polity)-min(polity))
i know if possible let treatment of na-values depend on number opposed number of available values in grouped data frame?
i tried include ifelse-function:
diff <- left_join(treaties, polity, c("scode","year")) %>% group_by(diff, file) summarise(diff, polity.diff=max(polity, na.rm = ifelse(length(polity = na) >= 0.2*length(polity), true, false))- min(polity, na.rm = ifelse(length(polity = na) >= 0.2*length(polity), true, false)))
but returns error:
error: index out of bounds
can use ifelse() function after “na.rm = ” ? did make mistake? appreciate help.
this should want:
left_join(treaties, pol, c("scode","year")) %>% group_by(treaty) %>% summarise(polity.diff = max(polity, na.rm = sum(is.na(polity)) >= 0.2*n()) - min(polity, na.rm = sum(is.na(polity)) >= 0.2*n())) #source: local data frame [4 x 2] # # treaty polity.diff #1 hungary slovenia 1994 0 #2 taiwan hungary 1994 2 #3 treaty of 5 1909 2 #4 treaty of izmir 1977 19
first of all, use is.na()
instead of length(xx = na)
, secondly use dplyr's special function n()
instead of length(polity)
, thirdly, removed ifelse
, left logical test there - return true or false according specification. note in 3 of cases, na's removed , in 1 case (taiwan hungary 1994) not removed because there not nas @ in group - that's why end without nas in polity.diff
column.
you'll notice same logical test both max
, min
- might solved more efficiently first creating new variable, e.g. nacheck, in data , referring variable in na.rm =
definition. however, you'd need remove variable afterwards (e.g. using select(-nacheck)
).
Comments
Post a Comment