dplyr - Let treatment of NA-values depend on their number relative to the number of available values among groups in a data frame, in R -

- September 15, 2012

i have got dataset containing contracts between states. number of contracting states varies 2 94. in data frame, each state attributed value called “polity” - although some, value missing.

with of forum, merged 2 data frames, , summarized contracts taking difference of min() , max() "polity"-values of contracting states.

now, don't want either ignore or exclude na-values. want treat polity value of contract na if number of na-values among contracting states exceeds fraction of number of contracting states (for these data frames, convenient 4/5 of polity-values must available in order contract taken in analysis).

these 2 simplified versions of data sets:

treaties <- data.frame(treaty.id=c(1,1,2,2,3,3,3,4,4,4,4,4),                    treaty=c("hungary slovenia 1994", "hungary slovenia 1994",                             "taiwan hungary 1994", "taiwan hungary 1994",                              "treaty of izmir 1977", "treaty of izmir 1977",                             "treaty of izmir 1977", "treaty of 5 1909",                              "treaty of 5 1909", "treaty of 5 1909",                             "treaty of 5 1909","treaty of 5 1909"),                    scode=c("hun","slv","taw","hun", "irn", "tur", "pak",                             "aus","aul","new","usa","can"),                    year=c(1994, 1994, 1994, 1994, 1977, 1977, 1977, 1909,                            1909, 1909, 1909, 1909),                    pr.dem=c(1,1,0,0,0,0,0,1,1,1,1,1))  pol <- data.frame(country=c("hungary", "slovenia", "taiwan","austria",                            "australia", "new zealand", "usa", "canada",                            "iran","turkey", "pakistan"),                  scode=c("hun", "slv", "taw", "aus", "aul", "new", "usa",                          "can", "irn", "tur", "pak"),                  year=c(1994, 1994, 1994, 1909, 1909, 1909, 1909, 1909,                         1977, 1977, 1977),                  polity = c(7, na, 9, 8, 8, 10, 10, na, -10, 9, na))

(hence, treaties 1 , 3 should show na "polity" in end)

i joined them together, reduced multiple rows same treaty 1 while taking difference of maximum , minium of polity values:

require(dplyr) left_join(treaties, pol, c("scode","year")) %>%                                 group_by(treaty) %>%                                 summarise(politydiff=max(polity)-min(polity))

i know if possible let treatment of na-values depend on number opposed number of available values in grouped data frame?

i tried include ifelse-function:

diff <- left_join(treaties, polity, c("scode","year")) %>%                        group_by(diff, file)  summarise(diff, polity.diff=max(polity, na.rm = ifelse(length(polity = na) >= 0.2*length(polity), true, false))-             min(polity, na.rm = ifelse(length(polity = na) >= 0.2*length(polity), true, false)))

but returns error:

error: index out of bounds

can use ifelse() function after “na.rm = ” ? did make mistake? appreciate help.

this should want:

left_join(treaties, pol, c("scode","year")) %>%   group_by(treaty) %>%   summarise(polity.diff = max(polity, na.rm = sum(is.na(polity)) >= 0.2*n()) -                           min(polity, na.rm = sum(is.na(polity)) >= 0.2*n())) #source: local data frame [4 x 2] # #                 treaty polity.diff #1 hungary slovenia 1994           0 #2   taiwan hungary 1994           2 #3   treaty of 5 1909           2 #4  treaty of izmir 1977          19

first of all, use is.na() instead of length(xx = na), secondly use dplyr's special function n() instead of length(polity) , thirdly, removed ifelse , left logical test there - return true or false according specification. note in 3 of cases, na's removed , in 1 case (taiwan hungary 1994) not removed because there not nas @ in group - that's why end without nas in polity.diff column.

you'll notice same logical test both max , min - might solved more efficiently first creating new variable, e.g. nacheck, in data , referring variable in na.rm = definition. however, you'd need remove variable afterwards (e.g. using select(-nacheck)).

Search This Blog

Add

dplyr - Let treatment of NA-values depend on their number relative to the number of available values among groups in a data frame, in R -

Comments

Post a Comment

Popular posts from this blog

c++ - QTextObjectInterface with Qml TextEdit (QQuickTextEdit) -

xcode - Swift Playground - Files are not readable -

jboss7.x - JBoss AS 7.3 vs 7.4 and differences -