Load_Data_Command
1、導入數據-加載csv文件數據做爲spark 臨時表DataSource(不須要提早建立表,方便數據分析)
該命令將csv文件導入到臨時表中,命令格式爲java
load data '文件路徑' table [表名] [options(key=value)]
# sql做業中,兩個sql 一塊兒選中執行
load data '/user/datacompute/platformtool/resources/169/latest/dc_load_data.csv' table tdl_spark_test options(header=true, inferSchema=true, delimiter=',');
select * from tdl_spark_test where type='login';
注意:文件路徑爲hdfs上面的路徑,同時csv文件的列名不能包含空格
options爲可選屬性,其中的取值和spark.read中的option取值相同,python
- header: when set to true the first line of files will be used to name columns and will not be included in data. All * types will be assumed string. Default value is false.
- delimiter: by default columns are delimited using ,, but delimiter can be set to any character
- quote: by default the quote character is ", but can be set to any character. Delimiters inside quotes are ignored
- escape: by default the escape character is \, but can be set to any character. Escaped quote characters are ignored
- parserLib: by default it is "commons" can be set to "univocity" to use that library for CSV parsing.
- mode: determines the parsing mode. By default it is PERMISSIVE. Possible values are:
- PERMISSIVE: tries to parse all lines: nulls are inserted for missing tokens and extra tokens are ignored.
- DROPMALFORMED: drops lines which have fewer or more tokens than expected or tokens which do not match the schema
- FAILFAST: aborts with a RuntimeException if encounters any malformed line
- charset: defaults to 'UTF-8' but can be set to other valid charset names
- inferSchema: automatically infers column types. It requires one extra pass over the data and is false by default
- comment: skip lines beginning with this character. Default is "#". Disable comments by setting this to null.
- nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame
- dateFormat: specifies a string that indicates the date format to use when reading dates or timestamps. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to both DateType and TimestampType. By default, it is null which means trying to parse times and date by java.sql.Timestamp.valueOf() and java.sql.Date.valueOf().
參考文檔 https://github.com/databricks/spark-csvgit
2、導出表數據爲csv文件
export 語法:github
export table tablename [PARTITION (part_column="value"[, ...])] TO 'export_file_name.csv' [options(key=value)]
其中的表既能夠爲hive表,也能夠是同一個做業中經過createOrReplaceTempView建立的臨時表,保存後文件在資源管理->個人資源 目錄下面,導出csv文件默認是以逗號做爲字段分割,若是須要指定分割符,能夠在options選項中指定delimiter,options可選屬性同上sql
example:app
export table raw_activity_flat PARTITION (year=2018, month=3, day=12) TO 'activity_20180312.csv' options(delimiter=';')
#這段代碼只能在python 做業中使用
sparkSession.sql("select email, idnumber, wifi from raw_activity_flat where year=2018 and month=3 and partnerCode='qunaer' ").createOrReplaceTempView("tdl_raw_activity_qunaer");
sparkSession.sql("export table tdl_raw_activity_qunaer TO 'activity_20180312.csv'")