Databricks 第5篇：Databricks文件系統（DBFS）

時間 2021-01-12

標籤 python 算法 sql 分佈式學習測試 this spa 3d 日誌欄目 Python 简体版

原文原文鏈接

Databricks 文件系統 (DBFS，Databricks File System) 是一個裝載到 Azure Databricks 工做區的分佈式文件系統，能夠在 Azure Databricks 羣集上使用。一個存儲對象是一個具備特定格式的文件，不一樣的格式具備不一樣的讀取和寫入的機制。python

DBFS 是基於可縮放對象存儲的抽象，具備如下優點：算法

裝載（mount）存儲對象，無需憑據便可無縫訪問數據。
使用目錄和文件語義（而不是存儲 URL）與對象存儲進行交互。
將文件保存到對象存儲，所以在終止羣集後不會丟失數據。

一，DBFS根

DBFS 中默認的存儲位置稱爲 DBFS 根（root），如下 DBFS 根位置中存儲了幾種類型的數據：sql

/FileStore：導入的數據文件、生成的繪圖以及上傳的庫
/databricks-datasets：示例公共數據集，用於學習Spark或者測試算法。
/databricks-results：經過下載查詢的完整結果生成的文件。
/tmp：存儲臨時數據的目錄
/user：存儲各個用戶的文件
/mnt：（默認是不可見的）裝載（掛載）到DBFS的文件，寫入裝載點路徑(/mnt)中的數據存儲在DBFS根目錄以外。

在新的工做區中，DBFS 根具備如下默認文件夾：分佈式

DBFS 根還包含不可見且沒法直接訪問的數據，包括裝入點元數據（mount point metadata）和憑據（credentials ）以及某些類型的日誌。學習

DBFS還有兩個特殊根位置是：FileStore 和 Azure Databricks Dataset。測試

FileStore是一個用於存儲文件的存儲空間，能夠存儲的文件有多種格式，主要包括csv、parquet、orc和delta等格式。
Dataset是一個示例數據集，用戶能夠經過該示例數據集來測試算法和Spark。

訪問DBFS，一般是經過pysaprk.sql 模塊、dbutils和SQL。

二，使用pyspark.sql模塊訪問DBFS

使用pyspark.sql模塊時，經過相對路徑"/temp/file" 引用parquet文件，如下示例將parquet文件foo寫入 DBFS /tmp 目錄。this

#df.write.format("parquet").save("/tmp/foo",mode="overwrite")
df.write.parquet("/tmp/foo",mode="overwrite")

並經過Spark API讀取文件中的內容：spa

#df =  spark.read.format("parquet").load("/tmp/foo")
df = spark.read.parquet("/tmp/foo")

三，使用SQL 訪問DBFS

對於delta格式和parquet格式的文件，能夠在SQL中經過 delta.`file_path` 或 parquet.`file_path`來訪問DBFS：3d

select *
from delta.`/tmp/delta_file`

select *
from parquet.`/tmp/parquet_file`

注意，文件的格式必須跟擴展的命令相同，不然報錯；文件的路徑不是經過單引號括起來的，而是經過 `` 來實現的。日誌

四，使用dbutils訪問DBFS

dbutils.fs 提供與文件系統相似的命令來訪問 DBFS 中的文件。本部分提供幾個示例，說明如何使用 dbutils.fs 命令在 DBFS 中寫入和讀取文件。

1，查看DBFS的目錄

在python環境中，能夠經過dbutils.fs來查看路徑下的文件：

display(dbutils.fs.ls("dbfs:/foobar"))

2，讀寫數據

在 DBFS 根中寫入和讀取文件，就像它是本地文件系統同樣。

# create folder
dbutils.fs.mkdirs("/foobar/")

# write data
dbutils.fs.put("/foobar/baz.txt", "Hello, World!")

# view head
dbutils.fs.head("/foobar/baz.txt")

# remove file
dbutils.fs.rm("/foobar/baz.txt")

# copy file
dbutils.fs.cp("/foobar/a.txt","/foobar/b.txt")

3，命令的幫助文檔

dbutils.fs.help()

dbutils.fs 主要包括兩跟模塊：操做文件的fsutils和裝載文件的mount

fsutils

cp(from: String, to: String, recurse: boolean = false): boolean -> Copies a file or directory, possibly across FileSystems
head(file: String, maxBytes: int = 65536): String -> Returns up to the first 'maxBytes' bytes of the given file as a String encoded in UTF-8
ls(dir: String): Seq -> Lists the contents of a directory
mkdirs(dir: String): boolean -> Creates the given directory if it does not exist, also creating any necessary parent directories
mv(from: String, to: String, recurse: boolean = false): boolean -> Moves a file or directory, possibly across FileSystems
put(file: String, contents: String, overwrite: boolean = false): boolean -> Writes the given String out to a file, encoded in UTF-8
rm(dir: String, recurse: boolean = false): boolean -> Removes a file or directory

mount

mount(source: String, mountPoint: String, encryptionType: String = "", owner: String = null, extraConfigs: Map = Map.empty[String, String]): boolean -> Mounts the given source directory into DBFS at the given mount point
mounts: Seq -> Displays information about what is mounted within DBFS
refreshMounts: boolean -> Forces all machines in this cluster to refresh their mount cache, ensuring they receive the most recent information
unmount(mountPoint: String): boolean -> Deletes a DBFS mount point