Loading...

SPARK DATAFRAME MANIPULATIONS | SPARK INTERVIEW QUESTIONS

39 views 1________

In Apache Spark, DataFrames are a key abstraction for working with structured and semi-structured data. A DataFrame is similar to a table in relational databases or a data frame in pandas (Python) or R, but it is distributed across a cluster and optimized by Spark’s Catalyst optimizer.

Here's a breakdown of common Spark DataFrame manipulations:

1. Creating a DataFrame
From RDD (Resilient Distributed Dataset):
python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Example").getOrCreate()
rdd = spark.sparkContext.parallelize([(1, "Alice"), (2, "Bob")])
df = rdd.toDF(["id", "name"])
df.show()
From a CSV file:
python
df = spark.read.csv("data.csv", header=True, inferSchema=True)
df.show()
From a JSON file:
python
df = spark.read.json("data.json")
df.show()

2. Repartitioning and Coalescing
Repartitioning changes the number of partitions in your DataFrame, which can help optimize the distribution of data across the cluster.

Repartition:
python
df = df.repartition(5) # Repartition the DataFrame into 5 partitions
Coalesce:
python
df = df.coalesce(1) # Coalesce into fewer partitions (often used before writing to disk)

3. Saving DataFrames
Writing to CSV:
python
df.write.csv("output/path", header=True) # Save as CSV with a header
Writing to Parquet:
python
df.write.parquet("output/path") # Save as Parquet
Writing to JSON:
python
Copy code
df.write.json("output/path") # Save as JSON

4. Caching and Persisting DataFrames
Spark allows you to cache or persist DataFrames to optimize performance when the same DataFrame is used multiple times in a pipeline.

Cache DataFrame:
python
df.cache() # Cache the DataFrame in memory
df.show()
Persist DataFrame:
python
from pyspark import StorageLevel

df.persist(StorageLevel.MEMORY_AND_DISK) # Persist DataFrame in memory and disk
df.show()
These are some of the essential Spark DataFrame operations that can be used to manipulate and process large-scale data efficiently in Spark.

コメント