### 本地代碼flink streaming讀取遠程環境的kafka的數據,寫入遠程環境的HDFS中; public static void main(String[] args) throws Exception { // set up the streaming execution environment final StreamExecutionEnvironment env =StreamExecutionEnvironment.getExecutionEnvironment(); env.enableCheckpointing(5000); env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); Properties properties = new Properties(); //目標環境的IP地址和端口號 properties.setProperty("bootstrap.servers", "192.168.0.1:9092");//kafka //kafka版本0.8須要; // properties.setProperty("zookeeper.connect", "192.168.0.1:2181");//zookeepe properties.setProperty("group.id", "test-consumer-group"); //group.id //第一種方式: //這裏很重要,填寫hdfs-site.xml和core-site.xml的路徑,能夠把目標環境上的hadoop的這兩個配置拉到本地來,這個是我放在了項目的resources目錄下。 // properties.setProperty("fs.hdfs.hadoopconf", "E:\\Ali-Code\\cn-smart\\cn-components\\cn-flink\\src\\main\\resources"); //第二種方式: properties.setProperty("fs.default-scheme","hdfs://ip:8020"); //根據不一樣的版本new不一樣的消費對象; // FlinkKafkaConsumer09<String> flinkKafkaConsumer09 = new FlinkKafkaConsumer09<String>("test0", new SimpleStringSchema(),properties); FlinkKafkaConsumer010<String> flinkKafkaConsumer010 = new FlinkKafkaConsumer010<String>("test1", new SimpleStringSchema(), properties); // flinkKafkaConsumer010.assignTimestampsAndWatermarks(new CustomWatermarkEmitter()); DataStream<String> keyedStream = env.addSource(flinkKafkaConsumer010); keyedStream.print(); // execute program System.out.println("*********** hdfs ***********************"); BucketingSink<String> bucketingSink = new BucketingSink<>("/var"); //hdfs上的路徑 BucketingSink<String> bucketingSink1 = bucketingSink.setBucketer((Bucketer<String>) (clock, basePath, value) -> { return basePath; }); bucketingSink.setWriter(new StringWriter<>()) .setBatchSize(1024 * 1024 ) .setBatchRolloverInterval(2000); keyedStream.addSink(bucketingSink); env.execute("test"); } 在遠程目標環境上hdfs的/var下面生成不少小目錄,這些小目錄是kafka中的數據; 問題: 1. 這種方式生成的hdfs文件不可以被spark sql去讀取; 解決: 將數據寫成parquet格式到hdfs上可解決這個問題;見另外一篇博客 https://blog.csdn.net/u012798083/article/details/85852830 2. 若是出現大量inprocess的文件,怎麼辦? 解決: 將數據量加大一點; 3. 如何增長窗口處理? 解決:見另外一篇博客:https://blog.csdn.net/u012798083/article/details/85852830