Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Append mode doesn't replace entire row if key collision happens #351

Open
tsekityam opened this issue Aug 4, 2022 · 1 comment
Open

Comments

@tsekityam
Copy link

tsekityam commented Aug 4, 2022

What did I do

df = (
  spark
  .sql("SELECT 'test' AS key, 123 AS col_a, 223 AS col_b")
)
(
  df
  .write
  .format("org.apache.spark.sql.redis")
  .option("host", redis_host)
  .option("port", redis_port)
  .option("ssl", "true")
  .option("table", "test_append_behavour")
  .option("key.column", "key")
  .mode("overwrite")
  .save()
)
r = redis.Redis(host=redis_host, port=redis_port, db=0, ssl=True)
print(r.hgetall("test_append_behavour:test"))

# {b'col_b': b'223', b'col_a': b'123'}
df2 = (
  spark
  .sql("SELECT 'test' AS key, 324 AS col_a, 423 AS col_c")
)
(
  df2
  .write
  .format("org.apache.spark.sql.redis")
  .option("host", redis_host)
  .option("port", redis_port)
  .option("ssl", "true")
  .option("table", "test_append_behavour")
  .option("key.column", "key")
  .mode("append")
  .save()
)
r = redis.Redis(host=redis_host, port=redis_port, db=0, ssl=True)
print(r.hgetall("test_append_behavour:test"))

# {b'col_b': b'223', b'col_a': b'324', b'col_c': b'423'}

What did I see

test_append_behavour:test now has 3 fields

{b'col_b': b'223', b'col_a': b'324', b'col_c': b'423'}

What did I expect

test_append_behavour:test should only have 2 fields from df2

{b'col_a': b'324', b'col_c': b'423'}

Please note, when key collision happens and SaveMode.Append is set, the former row is replaced with a new one.

According to the docs, the row of df1 should be replace by df2 in append mode, because they share the same key.

However, the col_a from df1 is still there after append, that means not entire row is replaced. We only replace the field if there is any key collision.

@fe2s
Copy link
Contributor

fe2s commented Aug 5, 2022

Hi @tsekityam ,
the SaveMode.Append uses hmset command internally, so it may not completely overwrite the row if the scheme of new dataframe is different. You are right, the documentation is not accurate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants