Databricks 文件系統 (DBFS,Databricks File System) 是一個裝載到 Azure Databricks 工做區的分佈式文件系統,能夠在 Azure Databricks 羣集上使用。 一個存儲對象是一個具備特定格式的文件,不一樣的格式具備不一樣的讀取和寫入的機制。python
DBFS 是基於可縮放對象存儲的抽象,具備如下優點:算法
DBFS 中默認的存儲位置稱爲 DBFS 根(root),如下 DBFS 根位置中存儲了幾種類型的數據:sql
在新的工做區中,DBFS 根具備如下默認文件夾:分佈式
DBFS 根還包含不可見且沒法直接訪問的數據,包括裝入點元數據(mount point metadata)和憑據(credentials )以及某些類型的日誌。學習
DBFS還有兩個特殊根位置是:FileStore 和 Azure Databricks Dataset。測試
寫入 DBFS /tmp
#df.write.format("parquet").save("/tmp/foo",mode="overwrite") df.write.parquet("/tmp/foo",mode="overwrite")
並經過Spark API讀取文件中的內容:spa
#df ="parquet").load("/tmp/foo") df ="/tmp/foo")
對於delta格式和parquet格式的文件,能夠在SQL中經過 delta.`file_path` 或 parquet.`file_path`來訪問DBFS:3d
select * from delta.`/tmp/delta_file` select * from parquet.`/tmp/parquet_file`
注意,文件的格式必須跟擴展的命令相同,不然報錯;文件的路徑不是經過單引號括起來的,而是經過 `` 來實現的。日誌
dbutils.fs 提供與文件系統相似的命令來訪問 DBFS 中的文件。 本部分提供幾個示例,說明如何使用 dbutils.fs
命令在 DBFS 中寫入和讀取文件。
在 DBFS 根中寫入和讀取文件,就像它是本地文件系統同樣。
# create folder dbutils.fs.mkdirs("/foobar/") # write data dbutils.fs.put("/foobar/baz.txt", "Hello, World!") # view head dbutils.fs.head("/foobar/baz.txt") # remove file dbutils.fs.rm("/foobar/baz.txt") # copy file dbutils.fs.cp("/foobar/a.txt","/foobar/b.txt")
dbutils.fs 主要包括兩跟模塊:操做文件的fsutils和裝載文件的mount
cp(from: String, to: String, recurse: boolean = false): boolean -> Copies a file or directory, possibly across FileSystems
head(file: String, maxBytes: int = 65536): String -> Returns up to the first 'maxBytes' bytes of the given file as a String encoded in UTF-8
ls(dir: String): Seq -> Lists the contents of a directory
mkdirs(dir: String): boolean -> Creates the given directory if it does not exist, also creating any necessary parent directories
mv(from: String, to: String, recurse: boolean = false): boolean -> Moves a file or directory, possibly across FileSystems
put(file: String, contents: String, overwrite: boolean = false): boolean -> Writes the given String out to a file, encoded in UTF-8
rm(dir: String, recurse: boolean = false): boolean -> Removes a file or directorymount
mount(source: String, mountPoint: String, encryptionType: String = "", owner: String = null, extraConfigs: Map = Map.empty[String, String]): boolean -> Mounts the given source directory into DBFS at the given mount point
mounts: Seq -> Displays information about what is mounted within DBFS
refreshMounts: boolean -> Forces all machines in this cluster to refresh their mount cache, ensuring they receive the most recent information
unmount(mountPoint: String): boolean -> Deletes a DBFS mount point