Sparkのバージョン上げる時にハマりそうな考慮点のメモ
null値を含むDataFrameをCSVファイルを書き込む際、~Spark 3.2とSpark 3.3~で書き込まれる値が変わる。~Spark 3.2ではnullの項目は[""]=空文字としてcsvに書かれるのに対して、Spark 3.3~では[]=空白となる。
Migration Guide: SQL, Datasets and DataFrameより、
Since Spark 3.3, nulls are written as empty strings in CSV data source by default. In Spark 3.2 or earlier, nulls were written as empty strings as quoted empty strings, "". To restore the previous behavior, set nullValue to "", or set the configuration spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv to true.
Migration Guide: SQL, Datasets and DataFrame - Spark 3.4.1 Documentation
Spark 3.3~利用時に~Spark 3.2の挙動=nullを""として書き込みたい場合は、spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsvをtrueに設定する。
検証
Spark 2.4.8
____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.8 /_/ Using Python version 3.7.3 (default, Jun 29 2023 18:03:57) SparkSession available as 'spark'. >>> from pyspark.sql import SparkSession >>> spark = SparkSession.builder.getOrCreate() >>> spark.version '2.4.8' >>> data = [("zunda",None,"M"),("kiritan","tohoku","F"),("goto",None,None)] >>> columns = ["name","state","gender"] >>> df = spark.createDataFrame(data,columns) >>> df.show() +-------+------+------+ | name| state|gender| +-------+------+------+ | zunda| null| M| |kiritan|tohoku| F| | goto| null| null| +-------+------+------+ >>> df = df.coalesce(1) >>> df.write.mode('overwrite').option("header",True).csv("output-2.4.8") >>> exit() #2.4.8ではnullの値が""として書き込まれているのが確認できる root@39ee28c834d4:/opt/spark# cat output-2.4.8/part-00000-77add388-3015-4964-83d8-9c713ecdee5f-c000.csv name,state,gender zunda,"",M kiritan,tohoku,F goto,"",""
Spark 3.4.1
____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.4.1 /_/ Using Python version 3.7.3 (default, Jun 29 2023 18:03:57) Spark context Web UI available at http://484a503f38c4:4040 Spark context available as 'sc' (master = local[*], app id = local-1689155820584). SparkSession available as 'spark'. >>> from pyspark.sql import SparkSession >>> spark = SparkSession.builder.getOrCreate() >>> spark.version '3.4.1' >>> data = [("zunda",None,"M"),("kiritan","tohoku","F"),("goto",None,None)] >>> columns = ["name","state","gender"] >>> df = spark.createDataFrame(data,columns) ^[[O>>> df.show() +-------+------+------+ | name| state|gender| +-------+------+------+ | zunda| null| M| |kiritan|tohoku| F| | goto| null| null| +-------+------+------+ >>> df = df.coalesce(1) >>> df.write.mode('overwrite').option("header",True).csv("output-3.4.1") >>> spark.conf.set("spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv", "true") >>> df.write.mode('overwrite').option("header",True).csv("output-3.4.1-option") >>> exit() #3.0.0ではnullの値が[]になっているのが確認できる。 root@484a503f38c4:/opt/spark# cat output-3.4.1/part-00000-0223bb8c-f991-4f33-a234-40089f6f8af6-c000.csv name,state,gender zunda,,M kiritan,tohoku,F goto,, #nullValueWrittenAsQuotedEmptyStringCsvをtrueにして書き込んだoutput-3.4.1-optionではnullの値が""になっている root@484a503f38c4:/opt/spark# cat output-3.4.1-option/part-00000-b4e02eaa-94d5-4007-b5a4-dc6ad707906d-c000.csv name,state,gender zunda,"",M kiritan,tohoku,F goto,"",""