{"id":358,"date":"2018-10-05T00:27:25","date_gmt":"2018-10-05T00:27:25","guid":{"rendered":"http:\/\/wpthemetestdata.wordpress.com\/?p=358"},"modified":"2021-11-26T05:38:30","modified_gmt":"2021-11-26T05:38:30","slug":"handling-nulls-in-apache-spark","status":"publish","type":"post","link":"https:\/\/www.npntraining.com\/blog\/handling-nulls-in-apache-spark\/","title":{"rendered":"Handling Nulls in Apache SPark"},"content":{"rendered":"<p>In this blog post ,I will explain how to handle Nulls in Apache Spark.<\/p>\n<h1>Introduction<\/h1>\n<p>It is a best practice we should always use nulls to represent missing or empty data in a DataFrame.\u00a0The main reason we should handle is because Spark can optimize when working with null values\u00a0more than it can if you use empty strings or other values.<\/p>\n<p>The primary way of interacting with null values at DataFrame is to use the .na subpackage on a DataFrame.<\/p>\n<p>All the blank values and empty strings are read into a DataFrame as null by the Spark CSV library<\/p>\n<p>Let\u2019s look at the following file as an example of how Spark considers blank and empty CSV fields as null values.<\/p>\n<p>We will be using scala language to code.<\/p>\n<pre>name,company,salary\nAnand,Infosys,1500000\nKiran,TCS,2000000\nPawan,Cerner,2100000\n\"\",IBM,700000\nGirish,,7979<\/pre>\n<pre>val employeeDF = spark.read.option(\"header\",\"true\").option(\"inferSchema\",\"true\").csv(\"d:\/data-set\/employee.dat\")\nemployeeDF.show()<\/pre>\n<pre>+-------+-------+-------+\n|   name|company| salary|\n+-------+-------+-------+\n|  Anand|Infosys|1500000|\n|  Kiran|    TCS|2000000|\n|  Pawan| Cerner|2100000|\n|   null|    IBM| 700000|\n| Girish|   null|   7979|\n|Kishore|    TCS|   null|\n|   null|   null|   null|\n+-------+-------+-------+<\/pre>\n<h2>drop<\/h2>\n<p>The simplest function is drop, which removes rows that contains nulls. The default is to drop any row in which any value is null.<\/p>\n<pre>val result = employeeDF.na.drop()\nor\nval result = employeeDF.na.drop(\"any\")<\/pre>\n<pre>+-----+-------+-------+\n| name|company| salary|\n+-----+-------+-------+\n|Anand|Infosys|1500000|\n|Kiran|    TCS|2000000|\n|Pawan| Cerner|2100000|\n+-----+-------+-------+<\/pre>\n<h2>fill<\/h2>\n<p>Using fill() function, we can fill one ore more columns with a set of values. This can be done by specifying a map - that is a particular value and a set of columns<\/p>\n<pre>val result = employeeDF.na.fill(\"NULL IN SOURCE\",Seq(\"name\",\"company\"))\nresult.show<\/pre>\n<pre>+--------------+--------------+-------+\n|          name|       company| salary|\n+--------------+--------------+-------+\n|         Anand|       Infosys|1500000|\n|         Kiran|           TCS|2000000|\n|         Pawan|        Cerner|2100000|\n|NULL IN SOURCE|           IBM| 700000|\n|        Girish|NULL IN SOURCE|   7979|\n|       Kishore|           TCS|   null|\n|NULL IN SOURCE|NULL IN SOURCE|   null|\n+--------------+--------------+-------+<\/pre>\n<h2>replace<\/h2>\n<p>In addition to replacing null values, there are more flexible options that you can use with more than just null values.<\/p>\n<pre>val result = employeeDF.na.replace(\"company\",Map(\"TCS\" -&gt; \"Tata Consultancy Service\"))\nresult.show()<\/pre>\n<pre>+-------+--------------------+-------+\n|   name|             company| salary|\n+-------+--------------------+-------+\n|  Anand|             Infosys|1500000|\n|  Kiran|Tata Consultancy ...|2000000|\n|  Pawan|              Cerner|2100000|\n|   null|                 IBM| 700000|\n| Girish|                null|   7979|\n|Kishore|Tata Consultancy ...|   null|\n|   null|                null|   null|\n+-------+--------------------+-------+<\/pre>\n<h2>nullable columns<\/h2>\n<pre>package com.npntraining.spark_sql\nimport org.apache.spark.sql.SparkSession\nimport org.apache.log4j.Level\nimport org.apache.log4j.Logger\nimport org.apache.spark.sql.types.StructType\nimport org.apache.spark.sql.types.StructField\nimport org.apache.spark.sql.types.StringType\nimport org.apache.spark.sql.types.LongType\nobject DealingNullValues_01 {\n  def main(args: Array[String]): Unit = {\n    Logger.getLogger(\"org\").setLevel(Level.ERROR)\n    val sparkSession = SparkSession.builder().appName(\"Dealing Null Values\").master(\"local\").getOrCreate();\n    val customSchema = new StructType().\n      add(\"name\", StringType, true).\n      add(\"company_name\", StringType, false).\n      add(\"salary\", LongType, true)\n    val employeeDF = sparkSession.read.option(\"header\",\"true\").schema(customSchema).csv(\"d:\/data-set\/employee.dat\")\n    employeeDF.show()\n  }\n}<\/pre>\n<p>Also check other post to know What is the difference between cache vs persist methods in Apache Spark<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this blog post ,I will explain how to handle Nulls in Apache Spark. Introduction It is a best practice we should always use nulls to represent missing or empty&hellip;<\/p>\n","protected":false},"author":1,"featured_media":5289,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[60,67],"tags":[],"class_list":["post-358","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-java-programming","category-test-automation"],"_links":{"self":[{"href":"https:\/\/www.npntraining.com\/blog\/wp-json\/wp\/v2\/posts\/358","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.npntraining.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.npntraining.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.npntraining.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.npntraining.com\/blog\/wp-json\/wp\/v2\/comments?post=358"}],"version-history":[{"count":13,"href":"https:\/\/www.npntraining.com\/blog\/wp-json\/wp\/v2\/posts\/358\/revisions"}],"predecessor-version":[{"id":6927,"href":"https:\/\/www.npntraining.com\/blog\/wp-json\/wp\/v2\/posts\/358\/revisions\/6927"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.npntraining.com\/blog\/wp-json\/wp\/v2\/media\/5289"}],"wp:attachment":[{"href":"https:\/\/www.npntraining.com\/blog\/wp-json\/wp\/v2\/media?parent=358"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.npntraining.com\/blog\/wp-json\/wp\/v2\/categories?post=358"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.npntraining.com\/blog\/wp-json\/wp\/v2\/tags?post=358"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}