python之路 之一pyspark

pip包下載安裝pysparksql

pip install pyspark  這裏可能會遇到安裝超時的狀況   加參數  --timeout=100架構

pip   -default   -timeout=100     install -U pyspark app

下面是我寫的一些代碼,在運行時,沒什麼問題,可是目前不知道怎麼拿到rdd與dataframe中的值 分佈式

from pyspark import SparkConf, SparkContextfrom pyspark.sql import HiveContext,Row,DataFramefrom pyspark.sql.types import StructType,StructField,StringType,IntegerTypeappname = "myappname"master = "local"myconf = SparkConf().setAppName(appname).setMaster(master)sc = SparkContext(conf=myconf)hc = HiveContext(sc)#  構建一個表格 Parallelize a list and convert each line to a Row  將列表並行化並將每行轉換爲一行#  構建表能夠用applySchema  或者 inferSchema  inferSchema已經在1.5以後棄用,由createDataFrame代替datas = ["1 b 28", "3 c 30", "2 d 29"]source = sc.parallelize(datas)splits = source.map(lambda line: line.split(" "))  # 後面是註釋rows = splits.map(lambda words : Row(id=int(words[0]),name=words[1],age=int(words[2])))myrows = Row(id="a",name="zhangkun",age="28")#print(myrows.__getitem__(0))#print(myrows.__getitem__(1))#print(myrows.__getitem__(2))# Infer the schema,and register the schema as a table 推斷架構,並將架構註冊爲表fields=[]fields.append(StructField("id", IntegerType(), True))fields.append(StructField("name", StringType(), True))fields.append(StructField("age", IntegerType(), True))schema = StructType(fields)people=hc.createDataFrame(myrows,schema);  # 1.5以前使用的是inferSchema# people.printSchema()people.registerTempTable("people")#  SQL can be run over SchemaRDD that been registered as a table  sql 能夠在註冊過的表上正常運行了results=hc.sql("select * from people")#print(results.show)for i in results :    print(i)sc.stop()忽然來個新任務,CDH部署大數據分佈式平臺 ,含如下組建安裝:hadoop、hbase、hive、kafka、spark  暫時上面的線擱置,等回頭用到在看,主要仍是本人基礎比較差,須要多學習一些基礎。
相關文章
相關標籤/搜索