[譯] 使用深度學習自動生成 HTML 代碼 - 第 1 部分

時間 2019-11-17

標籤使用深度學習自動生成 html 代碼部分欄目 HTML 简体版

原文原文鏈接

原文地址：Turning Design Mockups Into Code With Deep Learning - Part 1

原文做者：Emil Wallner

譯文出自：掘金翻譯計劃

本文永久連接：github.com/xitu/gold-m…

譯者：sakila1012

校對者：sunshine940326，wzy816

使用深度學習自動生成 HTML 代碼 - 第 1 部分

在將來三年來，深度學習將改變前端的發展。它將會加快原型設計的速度和下降開發軟件的門檻。html

Tony Beltramelli 去年發佈了pix2code 論文，Airbnb 也發佈了 sketch2code。前端

目前，自動化前端開發的最大屏障是計算能力。但咱們可使用目前的深度學習算法，以及合成訓練數據來探索人工智能前端自動化的方法。python

在本文中，做者將教你們神經網絡學習如何基於一張圖片和一個設計原型來編寫一個 HTML 和 CSS 網站。下面是該過程的簡要概述：android

1) 向訓練的神經網絡輸入一個設計圖

2) 神經網絡將圖片轉換爲 HTML 標記語言

3) 渲染輸出

咱們將分三個版原本構建神經網絡。ios

在第 1 個版本，咱們構建最簡單地版原本掌握移動部分。第 2 個版本，HTML 專一於自動化全部步驟，並簡要神經網絡層。最後一個 Bootstrap 版本，咱們將建立一個模型來思考和探索 LSTM 層。git

全部的代碼準備在 Github 上和在 Jupyter 筆記本上的 FloydHub。全部 FloydHub notebook 都在 floydhub 目錄中，本地 notebook 在 local 目錄中。github

本文中的模型構建是基於 Beltramelli 的論文 pix2code 和 Jason Brownlee 的圖像描述生成教程。代碼是由 Python 和 Keras 編寫，使用 TensorFolw 框架。算法

若是你是深度學習的新手，我建議你嘗試使用下 Python，反向傳播和卷積神經網絡。能夠從我早期個在 FloyHub 博客上發表的文章開始學習 [1] [2] [3]。編程

核心邏輯

讓咱們回顧一下咱們的目標。咱們的目標是構建一個神經網絡，可以生成與截圖對應的 HTML/CSS。後端

當你訓練神經網絡時，你先提供幾個截圖和對應的 HTML 代碼。

網絡經過逐個預測全部匹配的 HTML 標記語言來學習。預測下一個標記語言的標籤時，網絡接收到截圖和以前全部正確的標記。

這裏是一個在 Google Sheet 簡單的訓練數據示例。

建立逐詞預測的模型是如今最經常使用的方法。這裏也有其餘方法，但該方法也是本教程使用的方法。

注意：每次預測時，神經網絡接收的是一樣的截圖。若是網絡須要預測 20 個單詞，它就會獲得 20 次一樣的設計截圖。如今，不用管神經網絡的工做原理，只須要專一於神經網絡的輸入和輸出。

咱們先來看前面的標記（markup）。假如咱們訓練神經網絡的目的是預測句子「I can code」。當網絡接收「I」時，預測「can」。下一次時，網絡接收「I can」，預測「code」。它接收全部前面的單詞，但只預測下一個單詞。

神經網絡根據數據建立特徵。神經網絡構建特徵以鏈接輸入數據和輸出數據。它必須建立表徵來理解每一個截圖的內容和它所須要預測的 HTML 語法，這些都是爲預測下一個標記構建知識。

把訓練好的模型應用到真實世界中和模型訓練過程差很少。咱們無需輸入正確的 HTML 標記，網絡會接收它目前生成的標記，而後預測下一個標記。預測從「起始標籤」（start tag）開始，到「結束標籤」（end tag）終止，或者達到最大限制時終止

Hello World 版本

如今讓咱們構建 Hello World 版實現。咱們將發送一張帶有「Hello World！」字樣的截屏到神經網絡中，並訓練它生成對應的標記語言。

首先，神經網絡將原型設計轉換爲一組像素值。且每個像素點有 RGB 三個通道，每一個通道的值都在 0-255 之間。

爲了以神經網絡能理解的方式表徵這些標記，我使用了 one-hot 編碼。所以句子「I can code」能夠映射爲如下形式。

在上圖中，咱們的編碼包含了開始和結束的標籤。這些標籤能爲神經網絡提供開始預測和結束預測的位置信息。

對於輸入的數據，咱們使用語句，從第一個單詞開始，而後依次相加。輸出的數據老是一個單詞。

語句和單詞的邏輯同樣。這也須要一樣的輸入長度。他們沒有被詞彙限制，而是受句子長度的限制。若是它比最大長度短，你用空的單詞填充它，一個只有零的單詞。

正如你所看到的，單詞是從右到左打印的。對於每次訓練，強制改變每一個單詞的位置。這須要模型學習序列而不是記住每一個單詞的位置。

在下圖中有四個預測。每一列是一個預測。左邊是顏色呈現的三個顏色通道：紅綠藍和上一個單詞。在括號外面，預測是一個接一個，以紅色的正方形表示結束。

#Length of longest sentence
    max_caption_len = 3
    #Size of vocabulary 
    vocab_size = 3

    # Load one screenshot for each word and turn them into digits 
    images = []
    for i in range(2):
        images.append(img_to_array(load_img('screenshot.jpg', target_size=(224, 224))))
    images = np.array(images, dtype=float)
    # Preprocess input for the VGG16 model
    images = preprocess_input(images)

    #Turn start tokens into one-hot encoding
    html_input = np.array(
                [[[0., 0., 0.], #start
                 [0., 0., 0.],
                 [1., 0., 0.]],
                 [[0., 0., 0.], #start <HTML>Hello World!</HTML>
                 [1., 0., 0.],
                 [0., 1., 0.]]])

    #Turn next word into one-hot encoding
    next_words = np.array(
                [[0., 1., 0.], # <HTML>Hello World!</HTML>
                 [0., 0., 1.]]) # end

    # Load the VGG16 model trained on imagenet and output the classification feature
    VGG = VGG16(weights='imagenet', include_top=True)
    # Extract the features from the image
    features = VGG.predict(images)

    #Load the feature to the network, apply a dense layer, and repeat the vector
    vgg_feature = Input(shape=(1000,))
    vgg_feature_dense = Dense(5)(vgg_feature)
    vgg_feature_repeat = RepeatVector(max_caption_len)(vgg_feature_dense)
    # Extract information from the input seqence 
    language_input = Input(shape=(vocab_size, vocab_size))
    language_model = LSTM(5, return_sequences=True)(language_input)

    # Concatenate the information from the image and the input
    decoder = concatenate([vgg_feature_repeat, language_model])
    # Extract information from the concatenated output
    decoder = LSTM(5, return_sequences=False)(decoder)
    # Predict which word comes next
    decoder_output = Dense(vocab_size, activation='softmax')(decoder)
    # Compile and run the neural network
    model = Model(inputs=[vgg_feature, language_input], outputs=decoder_output)
    model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

    # Train the neural network
    model.fit([features, html_input], next_words, batch_size=2, shuffle=False, epochs=1000)
複製代碼

在 Hello World 版本中，咱們使用三個符號「start」、「Hello World」和「end」。字符級的模型要求更小的詞彙表和受限的神經網絡，而單詞級的符號在這裏可能有更好的性能。

如下是執行預測的代碼：

# Create an empty sentence and insert the start token
    sentence = np.zeros((1, 3, 3)) # [[0,0,0], [0,0,0], [0,0,0]]
    start_token = [1., 0., 0.] # start
    sentence[0][2] = start_token # place start in empty sentence

    # Making the first prediction with the start token
    second_word = model.predict([np.array([features[1]]), sentence])

    # Put the second word in the sentence and make the final prediction
    sentence[0][1] = start_token
    sentence[0][2] = np.round(second_word)
    third_word = model.predict([np.array([features[1]]), sentence])

    # Place the start token and our two predictions in the sentence 
    sentence[0][0] = start_token
    sentence[0][1] = np.round(second_word)
    sentence[0][2] = np.round(third_word)

    # Transform our one-hot predictions into the final tokens
    vocabulary = ["start", "<HTML><center><H1>Hello World!</H1></center></HTML>", "end"]
    for i in sentence[0]:
        print(vocabulary[np.argmax(i)], end=' ')
複製代碼

輸出

10 epochs: start start start
100 epochs: start <HTML><center><H1>Hello World!</H1></center></HTML> <HTML><center><H1>Hello World!</H1></center></HTML>
300 epochs: start <HTML><center><H1>Hello World!</H1></center></HTML> end
**在收集數據以前構建第一個版本。**在本項目的早期階段，我設法得到 Geocities 託管網站的舊版存檔，它有 3800 萬的網站。但我忽略了減小 100K 大小詞彙所須要的巨大工做量。
**處理一個 TB 級的數據須要優秀的硬件或極其有耐心。**在個人 Mac 遇到幾個問題後，最終用上了強大的遠程服務器。我預計租用 8 個現代 CPU 和 1 GPS 內部連接以運行個人工做流。
**在理解輸入與輸出數據以前，其它部分都似懂非懂。**輸入 X 是屏幕的截圖和之前標記的標籤，輸出 Y 是下一個標記的標籤。當我理解這一點時，其它問題都更加容易弄清了。此外，嘗試其它不一樣的架構也將更加容易。
**注意兔子洞。**因爲這個項目與深度學習有關聯的，我在這個過程當中被不少兔子洞卡住了。我花了一個星期從無到有的編程RNNs，太着迷於嵌入向量空間，並被一些奇奇怪怪的實現方法所誘惑。
**圖片到代碼的網絡其實就是自動描述圖像的模型。**即便我意識到了這一點，但仍然錯過了不少自動圖像摘要方面的論文，由於它們看起來不夠炫酷。一旦我意識到了這一點，我對問題空間的理解就變得更加深入了。

在 FloyHub 上運行代碼

FloydHub 是一個深度學習訓練平臺，我自從開始學習深度學習時就對它有所瞭解，我也經常使用它訓練和管理深度學習實驗。咱們能夠安裝並在 10 分鐘內運行第一個模型，它是在雲 GPU 上訓練模型最好的選擇。

若是讀者沒用過 FloydHub，你能夠用 2 分鐘安裝或者觀看 5 分鐘視頻。

拷貝倉庫

git clone https://github.com/emilwallner/Screenshot-to-code-in-Keras.git
複製代碼

登陸並初始化 FloyHub 命令行工具

cd Screenshot-to-code-in-Keras
floyd login
floyd init s2c
複製代碼

在 FloydHub 雲 GPU 機器上運行 Jupyter notebook：

floyd run --gpu --env tensorflow-1.4 --data emilwallner/datasets/imagetocode/2:data --mode jupyter
複製代碼

全部的 notebooks 都放在 floydbub 目錄下。本地等同於本地目錄下。一旦咱們開始運行模型，那麼在 floydhub/Hello_world/hello_world.ipynb 下能夠找到第一個 Notebook。

若是你想了解更多的指南和對 flags 的解釋，請查看我早期的文章。

HTML 版本

在這個版本中，咱們將從 Hello World 模型自動化不少步驟，並關注與建立一個可擴展的神經網絡模型。

該版本並不能直接從隨機網頁預測 HTML，但它是探索動態問題不可缺乏的步驟。

概覽

若是咱們將前面的架構擴展爲如下圖展現的結構。

該架構主要有兩個部分。首先，編碼器。編碼器是咱們建立圖像特徵和前面標記特徵（markup features）的地方。特徵是網絡建立原型設計和標記語言之間聯繫的構建塊。在編碼器的末尾，咱們將圖像特徵傳遞給前面標記的每個單詞。

而後，解碼器將結合原型設計特徵和標記特徵以建立下一個標籤的特徵，這一個特徵能夠經過全鏈接層預測下一個標籤。

設計原型的特徵

由於咱們須要爲每一個單詞插入一個截屏，這將會成爲訓練神經網絡案例的瓶頸。所以咱們抽取生成標記語言所須要的信息來替代直接使用圖像。

這些抽取的信息將經過預訓練的 CNN 編碼到圖像特徵中。這個模型是在 Imagenet 上預先訓練好的。

咱們將使用分類層以前的層級輸出以抽取特徵。

咱們最終獲得 1536 個 8x8 的特徵圖，雖然咱們很難直觀地理解它，但神經網絡可以從這些特徵中抽取元素的對象和位置。

標記特徵

在 Hello World 版本中，咱們使用 one-hot 編碼以表徵標記。而在該版本中，咱們將使用詞嵌入表徵輸入並使用 one-hot 編碼表示輸出。

咱們構建每一個句子的方式保持不變，但咱們映射每一個符號的方式將會變化。one-hot 編碼將每個詞視爲獨立的單元，而詞嵌入會將輸入數據表徵爲一個實數列表，這些實數表示標記標籤之間的關係。

上面詞嵌入的維度爲 8，但通常詞嵌入的維度會根據詞彙表的大小在 50 到 500 間變更。

以上每一個單詞的八個數值就相似於神經網絡中的權重，它們傾向於刻畫單詞之間的聯繫（Mikolov alt el., 2013）。

這就是咱們開始部署標記特徵（markup features）的方式，而這些神經網絡訓練的特徵會將輸入數據和輸出數據聯繫起來。如今，不用擔憂他們是什麼，咱們將在下一部分進一步深刻挖掘。

編碼器

咱們如今將詞嵌入饋送到 LSTM 中，並指望能返回一系列的標記特徵。這些標記特徵隨後會饋送到一個 Time Distributed 密集層，該層級能夠視爲有多個輸入和輸出的全鏈接層。

對於另外一個平行的過程，其中圖像特徵首先會展開成一個向量，而後再饋送到一個全鏈接層而抽取出高級特徵。這些圖像特徵隨後會與標記特徵相級聯而做爲編碼器的輸出。

這個有點難理解，讓我來分步描述一下。

標記特徵

以下圖所示，如今咱們將詞嵌入投入到 LSTM 層中，全部的語句都填充上最大的三個記號。

爲了混合信號並尋找高級模式，咱們運用了一個 TimeDistributed 密集層以抽取標記特徵。TimeDistributed 密集層和通常的全鏈接層很是類似，且它有多個輸入與輸出。

圖像特徵

同時，咱們須要將圖像的全部像素值展開成一個向量，所以信息不會被改變，只是重組了一下。

如上，咱們會經過全鏈接層混合信號並抽取更高級的概念。由於咱們並不僅是處理一個輸入值，所以使用通常的全鏈接層就好了。

在這個案例中，它有三個標記特徵。所以，咱們最終獲得的圖像特徵和標記特徵是同等數量的。

級聯圖像特徵和標記特徵

全部的語句都被填充以建立三個標記特徵。由於咱們已經預處理了圖像特徵，因此咱們能爲每個標記特徵添加圖像特徵。

如上，在複製圖像特徵到對應的標記特徵後，咱們獲得了三個新的圖像-標記特徵（image-markup features），這就是咱們饋送到解碼器的輸入值。

解碼器

如今，咱們使用圖像-標記特徵來預測下一個標籤。

在下面的案例中，咱們使用三個圖像-標籤特徵對來輸出下一個標籤特徵。

注意 LSTM 層不該該返回一個長度等於輸入序列的向量，而只須要預測預測一個特徵。在咱們的案例中，這個特徵將預測下一個標籤，它包含了最後預測的信息。

最後的預測

全鏈接層會像傳統前饋網絡那樣工做，它將下一個標籤特徵中的 512 個值與最後的四個預測鏈接起來，即咱們在詞彙表所擁有的四個單詞：start、hello、world 和 end。

詞彙的預測值多是 [0.1, 0.1, 0.1, 0.7]。全鏈接層最後採用的 softmax 激活函數會爲四個類別產生一個 0-1 機率分佈，全部預測值的和等於 1。在這個案例中，例如將預測第四個詞爲下一個標籤。而後，你能夠將 one-hot 編碼 [0, 0, 0, 1] 轉譯成映射的值，也就是「end」。

# Load the images and preprocess them for inception-resnet
    images = []
    all_filenames = listdir('images/')
    all_filenames.sort()
    for filename in all_filenames:
        images.append(img_to_array(load_img('images/'+filename, target_size=(299, 299))))
    images = np.array(images, dtype=float)
    images = preprocess_input(images)

    # Run the images through inception-resnet and extract the features without the classification layer
    IR2 = InceptionResNetV2(weights='imagenet', include_top=False)
    features = IR2.predict(images)

    # We will cap each input sequence to 100 tokens
    max_caption_len = 100
    # Initialize the function that will create our vocabulary 
    tokenizer = Tokenizer(filters='', split=" ", lower=False)

    # Read a document and return a string
    def load_doc(filename):
        file = open(filename, 'r')
        text = file.read()
        file.close()
        return text

    # Load all the HTML files
    X = []
    all_filenames = listdir('html/')
    all_filenames.sort()
    for filename in all_filenames:
        X.append(load_doc('html/'+filename))

    # Create the vocabulary from the html files
    tokenizer.fit_on_texts(X)

    # Add +1 to leave space for empty words
    vocab_size = len(tokenizer.word_index) + 1
    # Translate each word in text file to the matching vocabulary index
    sequences = tokenizer.texts_to_sequences(X)
    # The longest HTML file
    max_length = max(len(s) for s in sequences)

    # Intialize our final input to the model
    X, y, image_data = list(), list(), list()
    for img_no, seq in enumerate(sequences):
        for i in range(1, len(seq)):
            # Add the entire sequence to the input and only keep the next word for the output
            in_seq, out_seq = seq[:i], seq[i]
            # If the sentence is shorter than max_length, fill it up with empty words
            in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
            # Map the output to one-hot encoding
            out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
            # Add and image corresponding to the HTML file
            image_data.append(features[img_no])
            # Cut the input sentence to 100 tokens, and add it to the input data
            X.append(in_seq[-100:])
            y.append(out_seq)

    X, y, image_data = np.array(X), np.array(y), np.array(image_data)

    # Create the encoder
    image_features = Input(shape=(8, 8, 1536,))
    image_flat = Flatten()(image_features)
    image_flat = Dense(128, activation='relu')(image_flat)
    ir2_out = RepeatVector(max_caption_len)(image_flat)

    language_input = Input(shape=(max_caption_len,))
    language_model = Embedding(vocab_size, 200, input_length=max_caption_len)(language_input)
    language_model = LSTM(256, return_sequences=True)(language_model)
    language_model = LSTM(256, return_sequences=True)(language_model)
    language_model = TimeDistributed(Dense(128, activation='relu'))(language_model)

    # Create the decoder
    decoder = concatenate([ir2_out, language_model])
    decoder = LSTM(512, return_sequences=False)(decoder)
    decoder_output = Dense(vocab_size, activation='softmax')(decoder)

    # Compile the model
    model = Model(inputs=[image_features, language_input], outputs=decoder_output)
    model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

    # Train the neural network
    model.fit([image_data, X], y, batch_size=64, shuffle=False, epochs=2)

    # map an integer to a word
    def word_for_id(integer, tokenizer):
        for word, index in tokenizer.word_index.items():
            if index == integer:
                return word
        return None

    # generate a description for an image
    def generate_desc(model, tokenizer, photo, max_length):
        # seed the generation process
        in_text = 'START'
        # iterate over the whole length of the sequence
        for i in range(900):
            # integer encode input sequence
            sequence = tokenizer.texts_to_sequences([in_text])[0][-100:]
            # pad input
            sequence = pad_sequences([sequence], maxlen=max_length)
            # predict next word
            yhat = model.predict([photo,sequence], verbose=0)
            # convert probability to integer
            yhat = np.argmax(yhat)
            # map integer to word
            word = word_for_id(yhat, tokenizer)
            # stop if we cannot map the word
            if word is None:
                break
            # append as input for generating the next word
            in_text += ' ' + word
            # Print the prediction
            print(' ' + word, end='')
            # stop if we predict the end of the sequence
            if word == 'END':
                break
        return

    # Load and image, preprocess it for IR2, extract features and generate the HTML
    test_image = img_to_array(load_img('images/87.jpg', target_size=(299, 299)))
    test_image = np.array(test_image, dtype=float)
    test_image = preprocess_input(test_image)
    test_features = IR2.predict(np.array([test_image]))
    generate_desc(model, tokenizer, np.array(test_features), 100)
複製代碼