pandas - Efficient way to add to a series without duplicates -
i need add a dataframe (or series if that's more efficient) quite often, while making sure additions don't create duplicates. dataframe grows, seems inefficient, concating calling drop_duplicates, whole dataset needs checked duplicates each addition.
the data has 2 columns guessing turning 1 index might speed things up. (or both columns hierarchical index). pandas has way of disallowing duplicate indexes?
here sample problem:
print accumulating_result c1 c2 0 x1 1 b x2 2 b x3 3 c x4 print new c1 c2 0 b x3 1 c x4 2 c x5
perform addition of new accumulating_result , get:
print accumulating_result c1 c2 0 x1 1 b x2 2 b x3 3 c x4 4 c x5
for what's it's worth, every entry in column c2 unique.
any ideas?
you can use combine_first()
:
data1 = """ c1 c2 0 x1 1 b x2 2 b x3 3 c x4""" data2 = """ c1 c2 0 x x3 1 y x4 2 z x5""" import io import pandas pd df1 = pd.read_csv(io.bytesio(data1), delim_whitespace=true) df2 = pd.read_csv(io.bytesio(data2), delim_whitespace=true) df1.set_index("c2", inplace=true) df2.set_index("c2", inplace=true) df1.combine_first(df2)
the output:
c1 c2 x1 x2 b x3 b x4 c x5 z
but copy data every time. maybe use hdf5 or database better.
Comments
Post a Comment