Loading...

Databricks | Reading Data - CSV

112 2________

#databricks #datascience #data #database #dataanalytics #csv

Reading Data - CSV | Free Codes
1. %fs ls /mnt/training/wikipedia/pageviews/
2. %fs head /mnt/training/wikipedia/pageviews/pageviews_by_second.tsv
3. # A reference to our tab-separated-file
csvFile = "/mnt/training/wikipedia/pageviews/pageviews_by_second.tsv"
tempDF = (spark.read # The DataFrameReader
.option("sep", "\t") # Use tab delimiter (default is comma-separator)
.csv(csvFile) # Creates a DataFrame from CSV after reading in the file
)

Use the File's Header
4. (spark.read # The DataFrameReader
.option("sep", "\t") # Use tab delimiter (default is comma-separator)
.option("header", "true") # Use first line of all files as header
.csv(csvFile) # Creates a DataFrame from CSV after reading in the file
.printSchema()
)


Infer the Schema
5. (spark.read # The DataFrameReader
.option("header", "true") # Use first line of all files as header
.option("sep", "\t") # Use tab delimiter (default is comma-separator)
.option("inferSchema", "true") # Automatically infer data types
.csv(csvFile) # Creates a DataFrame from CSV after reading in the file
.printSchema()
)

Reading from CSV w/User-Defined Schema
6. # Required for StructField, StringType, IntegerType, etc.
from pyspark.sql.types import *

csvSchema = StructType([
StructField("timestamp", StringType(), False),
StructField("site", StringType(), False),
StructField("requests", IntegerType(), False)
])



7. (spark.read # The DataFrameReader
.option('header', 'true') # Ignore line #1 - it's a header
.option('sep', "\t") # Use tab delimiter (default is comma-separator)
.schema(csvSchema) # Use the specified schema
.csv(csvFile) # Creates a DataFrame from CSV after reading in the file
.printSchema()
)

8. csvDF = (spark.read
.option('header', 'true')
.option('sep', "\t")
.schema(csvSchema)
.csv(csvFile)
)
print("Partitions: " + str(csvDF.rdd.getNumPartitions()) )
printRecordsPerPartition(csvDF)
print("-"*80)

コメント