在tensorflow1.8以後的版本中,tensorflow.contrib部分都有tensorrt的組件,該組件存在的意義在於,你能夠讀取pb文件,並調用tensorrt的方法進行subgraph壓縮,其餘不能壓縮的subgraph依然被tensorflow所處理。這樣的操做方式就不一樣於你生成一個pb文件,而後單獨用tensorrt的其餘工具等等進行操做的方式了。node
不一樣版本的tensorrt,其改動仍是較多的,本文是基於tensorrt-integration-speeds-tensorflow-inference.此時tensorflow1.12其中是tensorrt4.0.1版本。若是想要使用tensorrt5.0,那仍是推薦單獨使用tensorrt好了。
硬件 環境:python
- TensorRT-4.0.1.6.Ubuntu-14.04.5.x86_64-gnu.cuda-9.0.cudnn7.1.tar.gz;
- tensorflow-gpu 1.12.0;
- centos7.3
下面是我修改的代碼,在P40卡上,由於不支持FP16,因此並沒加速,實測INT8比FP32塊1倍。
json
# -*- coding: utf-8 -*- from __future__ import absolute_import from __future__ import division from __future__ import print_function r""" TF-TensorRT integration sample script 1 - Specify the fraction of GPU memory allowed for TensorFlow. TensorRT can use the remaining memory. 2 - Let TensorRT analyze the TensorFlow graph, apply optimizations, and replace subgraphs with TensorRT nodes. """ import os import sys import time import json import os.path as osp import argparse, itertools, datetime import numpy as np import tensorflow as tf from tensorflow.python.ops import data_flow_ops from tensorflow.python.platform import gfile from tensorflow.python.client import timeline import tensorflow.contrib.tensorrt as trt tf.logging.set_verbosity(tf.logging.INFO) class TF2TensorRT(object): '''將tf生成的pb模型進行讀取,並用tensorrt進行處理 ''' def __init__(self, percent, batch_size, output_nodes): '''Use the new per_process_gpu_memory_fraction parameter of the GPUOptions function to specify the GPU memory fraction TensorRT can consume. This parameter should be set the first time the TensorFlow-TensorRT process starts. As an example, 0.67 would allocate 67% of GPU memory for TensorFlow, making the remaining 33% available for TensorRT engines. ''' self.batch_size = batch_size self.output_nodes = output_nodes self.gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=percent) self.config = tf.ConfigProto(gpu_options=self.gpu_options) def read_pb(self, pb_path, graph, sess): '''read the model from pb file ''' self.pb_path = pb_path with graph.as_default(): with gfile.FastGFile(pb_path, 'rb') as fr: graph_def = tf.GraphDef() graph_def.ParseFromString(fr.read()) return graph_def def _write_pb(self, trt_graph, precision_mode): '''write converted model into new pb file ''' dir_path, ext = osp.splitext(self.pb_path) newpb_filename = '{}{}{}'.format(dir_path, precision_mode, ext) with gfile.FastGFile(newpb_filename, 'wb') as fw: fw.write(trt_graph.SerializeToString()) return newpb_filename def create_workspace(self): graph = tf.Graph() with graph.as_default(): sess = tf.Session(graph=graph,config=self.config) return graph,sess def close_workspace(self,*args,sess=None): sess.close() def get_FPxx(self, graph,graph_def, workspace_size=1<<30, precision_mode='FP32', dump=True): '''You apply TensorRT optimizations to the frozen graph with the new create_inference_graph function. TensorRT then takes a frozen TensorFlow graph as input and returns an optimized graph with TensorRT nodes You should use the per_process_gpu_memory_fraction and max_workspace_size_bytes parameters together for best overall application performance. For example, set the per_process_gpu_memory_fraction parameter to ( 12 – 4 ) / 12 = 0.67 and the max_workspace_size_bytes parameter to 4000000000 for a 12GB GPU in order to allocate ~4GB for the TensorRT engines. TensorRT automatically uses Tensor Cores in Volta GPUs for inference when using half-precision arithmetic. The peak performance of Tensor Cores on the NVIDIA Tesla V100 is about an order of magnitude (10x) faster than double precision (FP64) and about 4 times faster than single precision (FP32). Just use FP16 as value for the precision_mode parameter in the create_inference_graph function to enable half precision --- frozen_graph_def: frozen TensorFlow graphout put_node_name: list of strings with names of output nodes e.g. ["resnet_v1_50/predictions/Reshape_1"] max_batch_size: integer, size of input batch e.g. 16 max_workspace_size_bytes: integer, maximum GPU memory size available for TensorRT precision_mode: string, allowed values FP32, FP16 or INT8 ''' with graph.as_default(): trt_graph = trt.create_inference_graph(graph_def, self.output_nodes, max_batch_size=self.batch_size, max_workspace_size_bytes=workspace_size, precision_mode=precision_mode ) if dump: newpb_path = self._write_pb(trt_graph, precision_mode) else: newpb_path='' return trt_graph,newpb_path def get_INT8(self, graph, calib_graph, workspace_size=1<<30, precision_mode='INT8'): '''TensorRT provides capabilities to take models trained in single (FP32) and half (FP16) precision and convert them for deployment with INT8 quantizations while minimizing accuracy loss. HOW TO CALIBRATE THE GRAPH WITH INT8? To convert models for deployment with INT8, you need to calibrate the trained FP32 model before applying TensorRT’s optimizations described in the earlier sections. The remaining workflow remains unchanged 1 - First use the "create_inference_graph" function with the precision_mode parameter set to INT8 to calibrate the model. The output of this function is a frozen TensorFlow graph ready for calibration. 2 - Next, execute the calibration graph with calibration data. TensorRT uses the distribution of node data to quantize the weights for the nodes. It is important to use calibration data that closely reflects the distribution of the problem dataset in production. We suggest checking for error accumulation during inference when first using models calibrated with INT8. \```trt_graph = trt.create_inference_graph(getNetwork(network_file_name), outputs, max_batch_size=batch_size, max_workspace_size_bytes=workspace_size, precision_mode="INT8") \``` 3 - After executing the graph on calibration data, apply TensorRT optimizations to the calibration graph with the "calib_graph_to_infer_graph" function. This function also replaces the TensorFlow subgraph with a TensorRT node optimized for INT8. The output of the function is a frozen TensorFlow graph that can be used for inference as usual. \``` trt_graph=trt.calib_graph_to_infer_graph(calibGraph) \``` 4 - And that’s it! These two commands enable INT8 precision inference with your TensorFlow model. ''' with graph.as_default(): trt_graph = trt.calib_graph_to_infer_graph(calib_graph) newpb_path = self._write_pb(trt_graph,precision_mode) return trt_graph,newpb_path def convert_NHWC2NCHW(self, graph,sess,tensor_input): with graph.as_default(): tensor_output = tf.transpose(tensor_input, perm=(0,3,1,2)) tensor_output = sess.run(tensor_output) return tensor_output def read_tensor_from_image_file(self, graph, sess, file_name, input_height=224, input_width=224, input_mean=0, input_std=255, input_name = "file_reader", output_name = "normalized"): """ Read a jpg image file and return a tensor """ with graph.as_default(): file_reader = tf.read_file(file_name, input_name) image_reader = tf.image.decode_png(file_reader, channels = 3, name='jpg_reader') float_caster = tf.cast(image_reader, tf.float32) dims_expander = tf.expand_dims(float_caster, 0); resized = tf.image.resize_bilinear(dims_expander, [input_height, input_width]) normalized = tf.divide(tf.subtract(resized, [input_mean]), [input_std]) normalized_NHWC = sess.run(normalized) normalized_NCHW = self.convert_NHWC2NCHW(graph,sess,normalized_NHWC) return normalized_NHWC,normalized_NCHW def run(self, graph, graph_def, sess, num_loops, tensor_input): tf.logging.info('Starting execution') with graph.as_default(): ''' 下述幾行必須添加,不然會提示問題''' inc=tf.constant(tensor_input, dtype=tf.float32) dataset=tf.data.Dataset.from_tensors(inc) dataset=dataset.repeat() iterator=dataset.make_one_shot_iterator() next_element=iterator.get_next() output = tf.import_graph_def(graph_def=graph_def, input_map={"input":next_element}, return_elements=self.output_nodes) output = output[0].outputs[0] # 這一行是 resnet 50 特有的,若是讀取inceptionv3,則這裏須要修改 '''此處爲模擬代碼 ''' for i in range(num_loops): st = time.time() ans = sess.run(output) print('the {} run take {} seconds'.format(i,time.time()-st)) return ans def topX(arr,X): ind=np.argsort(arr)[:,-X:][:,::-1] ind = ind.squeeze() return arr[np.arange(np.shape(arr)[0])[:,np.newaxis],ind],ind def getLabels(labels,ids): return [labels[str(x+1)] for x in ids] if "__main__" == __name__: parser = argparse.ArgumentParser(prog="convert pb model file into uff!") parser.add_argument('--FP32',action='store_true') parser.add_argument('--FP16',action='store_true') parser.add_argument('--INT8',action='store_true') parser.add_argument('--native',action='store_true') parser.add_argument('--num_loops',type=int,default=20) parser.add_argument('--data_dir',type=str,default='./data') parser.add_argument('--pb_path',type=str,default='resnetV150_frozen.pb') parser.add_argument('--output_nodes',action='append',default=['InceptionV3/Predictions/Reshape_1:0']) parser.add_argument('--mem_percent',type=float,default=0.5) parser.add_argument('--topN',type=int,default=10) parser.add_argument('--batch_size',type=int,default=1) parser.add_argument('--workspace_size',type=int,default=1<<10,help="workspace size in MB") f,unparsed = parser.parse_known_args() batch_size = f.batch_size pb_path = f.pb_path mem_percent = f.mem_percent workspace_size = f.workspace_size os.environ["CUDA_VISIBLE_DEVICES"] = "0" print('===============start==================') print("Starting at",datetime.datetime.now()) output_nodes = f.output_nodes output_nodes = ['resnet_v1_50/predictions/Reshape_1'] print(output_nodes) tft = TF2TensorRT(mem_percent, batch_size, output_nodes) ''' 爲了更好的獨立性,下述每一個分支選擇都具備冗餘代碼,如每次都會去讀取圖片,還有關閉session等等,這是有意爲之''' if f.native: print('===native 模式') graph,sess = tft.create_workspace() graph_def = tft.read_pb(pb_path, graph, sess) imageName = 'grace_hopper.jpg' image_input = tft.read_tensor_from_image_file(graph,sess,imageName, input_height=224, input_width=224, input_mean=0, input_std=1.0) image_input = image_input[0] ans = tft.run(graph,graph_def,sess,2,image_input) tft.close_workspace(graph,graph_def,sess=sess) ans_topX = topX(ans,1) print('the result id is: ',ans_topX[1]) if f.FP32: print('===FP32 模式') graph,sess = tft.create_workspace() graph_def = tft.read_pb(pb_path, graph, sess) trt_graph_FP32,newpb_path = tft.get_FPxx(graph,graph_def, workspace_size=1<<30, precision_mode='FP32') tft.close_workspace(graph,graph_def,trt_graph_FP32,sess=sess) # read the converted pb file graph,sess = tft.create_workspace() imageName = 'grace_hopper.jpg' image_input = tft.read_tensor_from_image_file(graph,sess,imageName, input_height=224, input_width=224, input_mean=0, input_std=1.0) image_input = image_input[0] graph_def_FP32 = tft.read_pb(newpb_path, graph, sess) ans = tft.run(graph,graph_def_FP32,sess,2,image_input) tft.close_workspace(graph,graph_def_FP32,sess=sess) ans_topX = topX(ans,1) print('the result id is: ',ans_topX[1]) if f.FP16: print('===FP16 模式') graph,sess = tft.create_workspace() graph_def = tft.read_pb(pb_path, graph, sess) trt_graph_FP16,newpb_path = tft.get_FPxx(graph,graph_def, workspace_size=1<<30, precision_mode='FP16') tft.close_workspace(graph,graph_def,trt_graph_FP16,sess=sess) # read the converted pb file graph,sess = tft.create_workspace() imageName = 'grace_hopper.jpg' image_input = tft.read_tensor_from_image_file(graph,sess,imageName, input_height=224, input_width=224, input_mean=0, input_std=1.0) image_input = image_input[0] graph_def_FP16 = tft.read_pb(newpb_path, graph, sess) ans = tft.run(graph,graph_def_FP16,sess,2,image_input) tft.close_workspace(graph,graph_def_FP16,sess=sess) ans_topX = topX(ans,1) print('the result id is: ',ans_topX[1]) if f.INT8: print('===INT8 模式') graph,sess = tft.create_workspace() graph_def = tft.read_pb(pb_path, graph, sess) print('讀取pb文件,而後建立calibGraph,此時須要喂入較多生產樣本') calibGraph,_ = tft.get_FPxx(graph,graph_def, workspace_size=1<<30, precision_mode='INT8', dump=False) print("==========Running Calibration") print('校對即用多個生產數據進行下述代碼運行,tensorrt內部會按照每層激活值自行進行對應的校對') print('這裏是用單張圖片執行20次,模擬校對過程') print('正常流程是:1)將下面20次改成1次;2)循環讀取多個生產數據完成整個流程的校對') imageName = 'grace_hopper.jpg' image_input = tft.read_tensor_from_image_file(graph,sess,imageName, input_height=224, input_width=224, input_mean=0, input_std=1.0) image_input = image_input[0] ans = tft.run(graph,calibGraph,sess,20,image_input) print('校對完成,準備生成最終inference模型') print("=========Creating inference graph") int8Graph,newpb_path = tft.get_INT8(graph,calibGraph, workspace_size) tft.close_workspace(graph,graph_def,calibGraph,int8Graph,sess=sess) # read the converted pb file graph,sess = tft.create_workspace() graph_def_INT8 = tft.read_pb(newpb_path, graph, sess) ans = tft.run(graph,graph_def_INT8,sess,2,image_input) tft.close_workspace(graph,graph_def_INT8,sess=sess) ans_topX = topX(ans,1) print('the result id is: ',ans_topX[1])
當不添加上述輸入部分代碼則有以下結果,引發的緣由見Visualize Optimized Graph in TensorBoard:centos
INFO:tensorflow:Starting execution 2019-03-15 05:59:37.410106: E tensorflow/core/common_runtime/executor.cc:623] Executor failed to create kernel. Not found: No registered 'TRTEngineOp' OpKernel for CPU devices compatible with node {{node import/resnet_v1_50/my_trt_op_0}} = TRTEngineOp[InT=[DT_FLOAT], OutT=[DT_FLOAT], cached_engine_batches=[4], calibration_data="", fixed_input_size=true, input_shapes=[[?,3,230,230]], max_cached_engines_count=1, output_shapes=[[?,1000,1,1]], precision_mode="FP32", segment_funcdef_name="resnet_v1_50/my_trt_op_0_native_segment", serialized_segment="8\177\224\...00\000\000", static_engine=true, workspace_size_bytes=2147483648](import/resnet_v1_50/conv1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer) . Registered: device='GPU' [[{{node import/resnet_v1_50/my_trt_op_0}} = TRTEngineOp[InT=[DT_FLOAT], OutT=[DT_FLOAT], cached_engine_batches=[4], calibration_data="", fixed_input_size=true, input_shapes=[[?,3,230,230]], max_cached_engines_count=1, output_shapes=[[?,1000,1,1]], precision_mode="FP32", segment_funcdef_name="resnet_v1_50/my_trt_op_0_native_segment", serialized_segment="8\177\224\...00\000\000", static_engine=true, workspace_size_bytes=2147483648](import/resnet_v1_50/conv1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer)]]
下面給出INT8時候的日誌session
python tf_trt.py --INT8
===============start================== Starting at 2019-03-15 07:00:05.756805 ['resnet_v1_50/predictions/Reshape_1'] 2019-03-15 07:00:05.758165: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-03-15 07:00:06.554246: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531 pciBusID: 0000:84:00.0 totalMemory: 22.38GiB freeMemory: 22.22GiB 2019-03-15 07:00:06.554439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-03-15 07:00:07.119839: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-15 07:00:07.119905: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-03-15 07:00:07.119921: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-03-15 07:00:07.120522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11459 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:84:00.0, compute capability: 6.1) WARNING:tensorflow:From tf_trt.py:49: FastGFile.__init__ (from tensorflow.python.platform.gfile) is deprecated and will be removed in a future version. Instructions for updating: Use tf.gfile.GFile. =========reading the pb file,then creating the calibGraph INFO:tensorflow:Running against TensorRT version 4.0.1 2019-03-15 07:00:07.936861: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 1 2019-03-15 07:00:07.938337: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session 2019-03-15 07:00:07.939184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-03-15 07:00:07.939224: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-15 07:00:07.939242: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-03-15 07:00:07.939294: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-03-15 07:00:07.939869: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11459 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:84:00.0, compute capability: 6.1) 2019-03-15 07:00:09.016877: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope 'resnet_v1_50/', converted to graph 2019-03-15 07:00:09.016966: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op! 2019-03-15 07:00:35.699442: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:952] Engine resnet_v1_50/my_trt_op_0 creation for segment 0, composed of 452 nodes succeeded. 2019-03-15 07:00:36.704760: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects. 2019-03-15 07:00:36.944306: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects. 2019-03-15 07:00:37.046735: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: tf_graph 2019-03-15 07:00:37.046820: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 461 nodes (-267), 477 edges (-267), time = 476.292ms. 2019-03-15 07:00:37.046852: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] layout: Graph size after: 468 nodes (7), 479 edges (2), time = 127.892ms. 2019-03-15 07:00:37.046865: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 17 nodes (-451), 12 edges (-467), time = 26932.1719ms. 2019-03-15 07:00:37.046877: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 12 nodes (-5), 12 edges (0), time = 114.593ms. 2019-03-15 07:00:37.046889: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 12 nodes (0), 12 edges (0), time = 266.66ms. 2019-03-15 07:00:37.046909: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: resnet_v1_50/my_trt_op_0_native_segment 2019-03-15 07:00:37.046921: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 453 nodes (0), 468 edges (0), time = 282.458ms. 2019-03-15 07:00:37.046941: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] layout: Invalid argument: The graph is already optimized by layout optimizer. 2019-03-15 07:00:37.046952: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 453 nodes (0), 468 edges (0), time = 35.437ms. 2019-03-15 07:00:37.046969: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 453 nodes (0), 468 edges (0), time = 204.084ms. 2019-03-15 07:00:37.046984: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 453 nodes (0), 468 edges (0), time = 36.173ms. ==========Running Calibration INFO:tensorflow:Starting execution 2019-03-15 07:00:43.482560: I tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:578] Starting calibration thread on device 0, Calibration Resource @ 0x7f794c001850 ====take 6.6967267990112305 seconds=== ====take 0.011368751525878906 seconds=== ====take 0.05899786949157715 seconds=== ====take 0.06058168411254883 seconds=== ====take 0.060442447662353516 seconds=== ====take 0.06051158905029297 seconds=== ====take 0.060460805892944336 seconds=== ====take 0.060431480407714844 seconds=== ====take 0.06432700157165527 seconds=== ====take 0.06402254104614258 seconds=== ====take 0.06392884254455566 seconds=== ====take 0.06446218490600586 seconds=== ====take 0.06404638290405273 seconds=== ====take 0.0639350414276123 seconds=== ====take 0.06392097473144531 seconds=== ====take 0.06390523910522461 seconds=== ====take 0.06399869918823242 seconds=== ====take 0.06429791450500488 seconds=== ====take 0.06387209892272949 seconds=== ====take 0.06392908096313477 seconds=== =========Creating inference graph 2019-03-15 07:00:48.772447: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:155] Starting Calib Conversion 2019-03-15 07:00:48.845717: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:161] Construction of static int8 engine is not implemented yet!. Dynamic engine will be constructed ================================================== 2019-03-15 07:01:48.746487: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-03-15 07:01:48.746545: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-15 07:01:48.746555: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-03-15 07:01:48.746563: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-03-15 07:01:48.747006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11459 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:84:00.0, compute capability: 6.1) INFO:tensorflow:Starting execution 2019-03-15 07:01:55.221824: I tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:502] import/resnet_v1_50/my_trt_op_0 Constructing a new engine with batch size 1 ====take 48.35376954078674 seconds=== ====take 0.0026242733001708984 seconds=== ====take 0.002024412155151367 seconds=== ====take 0.0019381046295166016 seconds=== ====take 0.0018923282623291016 seconds=== ====take 0.0019183158874511719 seconds=== ====take 0.001911163330078125 seconds=== ====take 0.0019626617431640625 seconds=== ====take 0.001909494400024414 seconds=== ====take 0.001890420913696289 seconds=== ====take 0.0018913745880126953 seconds=== ====take 0.0019071102142333984 seconds=== ====take 0.001940011978149414 seconds=== ====take 0.001964569091796875 seconds=== ====take 0.0019214153289794922 seconds=== ====take 0.0019118785858154297 seconds=== ====take 0.0018911361694335938 seconds=== ====take 0.00193023681640625 seconds=== ====take 0.0019140243530273438 seconds=== ====take 0.0019001960754394531 seconds=== ================================================== (array([[0.47768646]], dtype=float32), array([[457]]))
若是出現下述狀況:
app