phpjason_encode技巧_用深度进修自动生成HTML代码

文章目录 [+]

作者：Emil Wallner

机器之心编译

phpjason_encode技巧_用深度进修自动生成HTML代码

如何用前端页面原型天生对应的代码一贯是我们关注的问题，本文作者根据 pix2code 等论文构建了一个强大的前端代码天生模型，并详细阐明了如何利用 LSTM 与 CNN 将设计原型编写为 HTML 和 CSS 网站。

（图片来自网络侵删）

项目链接：github.com/emilwallner…

在未来三年内，深度学习将改变前端开拓。
它将会加快原型设计速率，拉低开拓软件的门槛。

Tony Beltramelli 在去年发布了论文《pix2code: Generating Code from a Graphical User Interface Screenshot》，Airbnb 也发布Sketch2code（airbnb.design/sketching-i…）。

目前，自动化前端开拓的最大阻碍是打算能力。
但我们已经可以利用目前的深度学习算法，以及合成演习数据来探索人工智能自动构建前真个方法。
在本文中，作者将教神经网络学习基于一张图片和一个设计模板来编写一个 HTML 和 CSS 网站。
以下是该过程的简要概述：

1）向演习过的神经网络输入一个设计图

2）神经网络将图片转化为 HTML 标记措辞

3）渲染输出

我们将分三步从易到难构建三个不同的模型，首先，我们构建最大略地版本来节制移动部件。
第二个版本 HTML 专注于自动化所有步骤，并简要阐明神经网络层。
在末了一个版本 Bootstrap 中，我们将创建一个模型来思考和探索 LSTM 层。

代码地址：

github.com/emilwallner…www.floydhub.com/emilwallner…

所有 FloydHub notebook 都在 floydhub 目录中，本地 notebook 在 local 目录中。

本文中的模型构建基于 Beltramelli 的论文《pix2code: Generating Code from a Graphical User Interface Screenshot》和 Jason Brownlee 的图像描述天生教程，并利用 Python 和 Keras 完成。

核心逻辑

我们的目标是构建一个神经网络，能够天生与截图对应的 HTML/CSS 标记措辞。

演习神经网络时，你先供应几个截图和对应的 HTML 代码。
网络通过逐个预测所有匹配的 HTML 标记措辞来学习。
预测下一个标记措辞的标签时，网络吸收到截图和之前所有精确的标记。

这里是一个大略的演习数据示例：docs.google.com/spreadsheet…。

创建逐词预测的模型是现在最常用的方法，也是本教程利用的方法。

把稳：每次预测时，神经网络吸收的是同样的截图。
也便是说如果网络须要预测 20 个单词，它就会得到 20 次同样的设计截图。
现在，不用管神经网络的事情事理，只须要专注于神经网络的输入和输出。

我们先来看前面的标记（markup）。
如果我们演习神经网络的目的是预测句子「I can code」。
当网络吸收「I」时，预测「can」。
下一次时，网络吸收「I can」，预测「code」。
它吸收所有之前单词，但只预测下一个单词。

神经网络根据数据创建特色。
神经网络构建特色以连接输入数据和输出数据。
它必须创建表征来理解每个截图的内容和它所须要预测的 HTML 语法，这些都是为预测下一个标记构建知识。
把演习好的模型运用到真实天下中和模型演习过程差不多。

我们无需输入精确的 HTML 标记，网络会吸收它目前天生的标记，然后预测下一个标记。
预测从「起始标签」（start tag）开始，到「结束标签」（end tag）终止，或者达到最大限定时终止。

Hello World 版

现在让我们构建 Hello World 版实现。
我们将馈送一张带有「Hello World！
」字样的截屏到神经网络中，并演习它天生对应的标记措辞。

首先，神经网络将原型设计转换为一组像素值。
且每一个像素点有 RGB 三个通道，每个通道的值都在 0-255 之间。

为了以神经网络能理解的办法表征这些标记，我利用了 one-hot 编码。
因此句子「I can code」可以映射为以下形式。

在上图中，我们的编码包含了开始和结束的标签。
这些标签能为神经网络供应开始预测和结束预测的位置信息。
以下是这些标签的各种组合以及对应 one-hot 编码的情形。

我们会使每个单词在每一轮演习中改变位置，因此这许可模型学习序列而不是影象词的位置。
不才图中有四个预测，每一行是一个预测。
且左边代表 RGB 三色通道和之前的词，右边代表预测结果和赤色的结束标签。

#Length of longest sentence max_caption_len = 3#Size of vocabulary vocab_size = 3# Load one screenshot for each word and turn them into digits images = []for i in range(2): images.append(img_to_array(load_img('screenshot.jpg', target_size=(224, 224)))) images = np.array(images, dtype=float)# Preprocess input for the VGG16 model images = preprocess_input(images)#Turn start tokens into one-hot encoding html_input = np.array( [[[0., 0., 0.], #start [0., 0., 0.], [1., 0., 0.]], [[0., 0., 0.], #start <HTML>Hello World!</HTML> [1., 0., 0.], [0., 1., 0.]]])#Turn next word into one-hot encoding next_words = np.array( [[0., 1., 0.], # <HTML>Hello World!</HTML> [0., 0., 1.]]) # end# Load the VGG16 model trained on imagenet and output the classification feature VGG = VGG16(weights='imagenet', include_top=True)# Extract the features from the image features = VGG.predict(images)#Load the feature to the network, apply a dense layer, and repeat the vector vgg_feature = Input(shape=(1000,)) vgg_feature_dense = Dense(5)(vgg_feature) vgg_feature_repeat = RepeatVector(max_caption_len)(vgg_feature_dense)# Extract information from the input seqence language_input = Input(shape=(vocab_size, vocab_size)) language_model = LSTM(5, return_sequences=True)(language_input)# Concatenate the information from the image and the input decoder = concatenate([vgg_feature_repeat, language_model])# Extract information from the concatenated output decoder = LSTM(5, return_sequences=False)(decoder)# Predict which word comes next decoder_output = Dense(vocab_size, activation='softmax')(decoder)# Compile and run the neural network model = Model(inputs=[vgg_feature, language_input], outputs=decoder_output) model.compile(loss='categorical_crossentropy', optimizer='rmsprop')# Train the neural network model.fit([features, html_input], next_words, batch_size=2, shuffle=False, epochs=1000)复制代码

在 Hello World 版本中，我们利用三个符号「start」、「Hello World」和「end」。
字符级的模型哀求更小的词汇表和受限的神经网络，而单词级的符号在这里可能有更好的性能。

以下是实行预测的代码：

# Create an empty sentence and insert the start token sentence = np.zeros((1, 3, 3)) # [[0,0,0], [0,0,0], [0,0,0]] start_token = [1., 0., 0.] # start sentence[0][2] = start_token # place start in empty sentence# Making the first prediction with the start token second_word = model.predict([np.array([features[1]]), sentence])# Put the second word in the sentence and make the final prediction sentence[0][1] = start_token sentence[0][2] = np.round(second_word) third_word = model.predict([np.array([features[1]]), sentence])# Place the start token and our two predictions in the sentence sentence[0][0] = start_token sentence[0][1] = np.round(second_word) sentence[0][2] = np.round(third_word)# Transform our one-hot predictions into the final tokens vocabulary = [\公众start\"大众, \"大众<HTML><center><H1>Hello World!</H1></center></HTML>\"大众, \"大众end\公众]for i in sentence[0]:print(vocabulary[np.argmax(i)], end=' ')复制代码

输出

10 epochs: start start start100 epochs: start <HTML><center><H1>Hello World!</H1></center></HTML> <HTML><center><H1>Hello World!</H1></center></HTML>300 epochs: start <HTML><center><H1>Hello World!</H1></center></HTML> end

我走过的坑：

在网络数据之前构建第一个版本。
在本项目的早期阶段，我设法得到 Geocities 托管网站的旧版存档，它有 3800 万的网站。
但我忽略了减少 100K 大小词汇所须要的巨大事情量。
演习一个 TB 级的数据须要精良的硬件或极其有耐心。
在我的 Mac 碰着几个问题后，终极用上了强大的远程做事器。
我估量租用 8 个当代 CPU 和 1 GPS 内部链接以运行我的事情流。
在理解输入与输出数据之前，其它部分都似懂非懂。
输入 X 是屏幕的截图和以前标记的标签，输出 Y 是下一个标记的标签。
当我理解这一点时，其它问题都更加随意马虎弄清了。
此外，考试测验其它不同的架构也将更加随意马虎。
图片到代码的网络实在便是自动描述图像的模型。
纵然我意识到了这一点，但仍旧错过了很多自动图像择要方面的论文，由于它们看起来不足炫酷。
一旦我意识到了这一点，我对问题空间的理解就变得更加深刻了。

在 FloydHub 上运行代码

FloydHub 是一个深度学习演习平台，我自从开始学习深度学习时就对它有所理解，我也常用它演习和管理深度学习试验。
我们能安装它并在 10 分钟内运行第一个模型，它是在云 GPU 上演习模型最好的选择。
若果读者没用过 FloydHub，可以花 10 分钟旁边安装并理解。

FloydHub 地址：www.floydhub.com/

复制 Repo：

https://github.com/emilwallner/Screenshot-to-code-in-Keras.git复制代码

登录并初始化 FloydHub 命令行工具：

cd Screenshot-to-code-in-Kerasfloyd loginfloyd init s2c复制代码

在 FloydHub 云 GPU 机器上运行 Jupyter notebook：

floyd run --gpu --env tensorflow-1.4 --data emilwallner/datasets/imagetocode/2:data --mode jupyter复制代码

所有的 notebook 都放在 floydbub 目录下。
一旦我们开始运行模型，那么在 floydhub/Helloworld/helloworld.ipynb 下可以找到第一个 Notebook。
更多详情请查看本项目早期的 flags。

HTML 版本

在这个版本中，我们将关注与创建一个可扩展的神经网络模型。
该版本并不能直接从随机网页预测 HTML，但它是探索动态问题不可短缺的步骤。

概览

如果我们将前面的架构扩展为以下右图展示的构造，那么它就能更高效地处理识别与转换过程。

该架构紧张有两个部，即编码器与解码器。
编码器是我们创建图像特色和前面标记特色（markup features）的部分。
特色是网络创建原型设计和标记措辞之间联系的构建块。
在编码器的末端，我们将图像特色通报给前面标记的每一个单词。
随后解码器将结合原型设计特色和标记特色以创建下一个标签的特色，这一个特色可以通过全连接层预测下一个标签。

设计原型的特色

由于我们须要为每个单词插入一个截屏，这将会成为演习神经网络的瓶颈。
因此我们抽取天生标记措辞所须要的信息来替代直策应用图像。
这些抽取的信息将通过预演习的 CNN 编码到图像特色中，且我们将利用分类层之前的层级输出以抽取特色。

我们终极得到 1536 个 88 的特色图，虽然我们很难直不雅观地理解它，但神经网络能够从这些特色中抽取元素的工具和位置。

标记特色

在 Hello World 版本中，我们利用 one-hot 编码以表征标记。
而在该版本中，我们将利用词嵌入表征输入并利用 one-hot 编码表示输出。
我们构建每个句子的办法保持不变，但我们映射每个符号的办法将会变革。
one-hot 编码将每一个词视为独立的单元，而词嵌入会将输入数据表征为一个实数列表，这些实数表示标记标签之间的关系。

上面词嵌入的维度为 8，但一样平常词嵌入的维度会根据词汇表的大小在 50 到 500 间变动。
以上每个单词的八个数值就类似于神经网络中的权重，它们方向于刻画单词之间的联系（Mikolov alt el., 2013）。
这便是我们开始支配标记特色（markup features）的办法，而这些神经网络演习的特色会将输入数据和输出数据联系起来。

编码器

我们现在将词嵌入馈送到 LSTM 中，并期望能返回一系列的标记特色。
这些标记特色随后会馈送到一个 Time Distributed 密集层，该层级可以视为有多个输入和输出的全连接层。

和嵌入与 LSTM 层相平行的还有其余一个处理过程，个中图像特色首先会展开成一个向量，然后再馈送到一个全连接层而抽取出高等特色。
这些图像特色随后会与标记特色相级联而作为编码器的输出。

标记特色

如下图所示，现在我们将词嵌入投入到 LSTM 层中，所有的语句都会用零添补以得到相同的向量长度。

为了稠浊旗子暗记并探求高等模式，我们利用了一个 TimeDistributed 密集层以抽取标记特色。
TimeDistributed 密集层和一样平常的全连接层非常相似，且它有多个输入与输出。

图像特色

对付另一个平行的过程，我们须要将图像的所有像素值展开成一个向量，因此信息不会被改变，它们只会用来识别。

如上，我们会通过全连接层稠浊旗子暗记并抽取更高等的观点。
由于我们并不但是处理一个输入值，因此利用一样平常的全连接层就行了。

级联图像特色和标记特色

所有的语句都被添补以创建三个标记特色。
由于我们已经预处理了图像特色，以是我们能为每一个标记特色添加图像特色。

如上，在复制图像特色到对应的标记特色后，我们得到了新的图像-标记特色（image-markup features），这便是我们馈送到解码器的输入值。

解码器

现在，我们利用图像-标记特色来预测下一个标签。

不才面的案例中，我们利用三个图像-标签特色对来输出下一个标签特色。
把稳 LSTM 层不应该返回一个长度即是输入序列的向量，而只须要预测预测一个特色。
在我们的案例中，这个特色将预测下一个标签，它包含了末了预测的信息。

末了的预测

密集层会像传统前馈网络那样事情，它将下一个标签特色中的 512 个值与末了的四个预测连接起来，即我们在词汇表所拥有的四个单词：start、hello、world 和 end。
密集层末了采取的 softmax 函数会为四个种别产生一个概率分布，例如 [0.1, 0.1, 0.1, 0.7] 将预测第四个词为下一个标签。

# Load the images and preprocess them for inception-resnet images = [] all_filenames = listdir('images/') all_filenames.sort()for filename in all_filenames: images.append(img_to_array(load_img('images/'+filename, target_size=(299, 299)))) images = np.array(images, dtype=float) images = preprocess_input(images)# Run the images through inception-resnet and extract the features without the classification layer IR2 = InceptionResNetV2(weights='imagenet', include_top=False) features = IR2.predict(images)# We will cap each input sequence to 100 tokens max_caption_len = 100# Initialize the function that will create our vocabulary tokenizer = Tokenizer(filters='', split=\"大众 \公众, lower=False)# Read a document and return a stringdef load_doc(filename): file = open(filename, 'r') text = file.read() file.close()return text# Load all the HTML files X = [] all_filenames = listdir('html/') all_filenames.sort()for filename in all_filenames: X.append(load_doc('html/'+filename))# Create the vocabulary from the html files tokenizer.fit_on_texts(X)# Add +1 to leave space for empty words vocab_size = len(tokenizer.word_index) + 1# Translate each word in text file to the matching vocabulary index sequences = tokenizer.texts_to_sequences(X)# The longest HTML file max_length = max(len(s) for s in sequences)# Intialize our final input to the model X, y, image_data = list(), list(), list()for img_no, seq in enumerate(sequences):for i in range(1, len(seq)):# Add the entire sequence to the input and only keep the next word for the output in_seq, out_seq = seq[:i], seq[i]# If the sentence is shorter than max_length, fill it up with empty words in_seq = pad_sequences([in_seq], maxlen=max_length)[0]# Map the output to one-hot encoding out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]# Add and image corresponding to the HTML file image_data.append(features[img_no])# Cut the input sentence to 100 tokens, and add it to the input data X.append(in_seq[-100:]) y.append(out_seq) X, y, image_data = np.array(X), np.array(y), np.array(image_data)# Create the encoder image_features = Input(shape=(8, 8, 1536,)) image_flat = Flatten()(image_features) image_flat = Dense(128, activation='relu')(image_flat) ir2_out = RepeatVector(max_caption_len)(image_flat) language_input = Input(shape=(max_caption_len,)) language_model = Embedding(vocab_size, 200, input_length=max_caption_len)(language_input) language_model = LSTM(256, return_sequences=True)(language_model) language_model = LSTM(256, return_sequences=True)(language_model) language_model = TimeDistributed(Dense(128, activation='relu'))(language_model)# Create the decoder decoder = concatenate([ir2_out, language_model]) decoder = LSTM(512, return_sequences=False)(decoder) decoder_output = Dense(vocab_size, activation='softmax')(decoder)# Compile the model model = Model(inputs=[image_features, language_input], outputs=decoder_output) model.compile(loss='categorical_crossentropy', optimizer='rmsprop')# Train the neural network model.fit([image_data, X], y, batch_size=64, shuffle=False, epochs=2)# map an integer to a worddef word_for_id(integer, tokenizer):for word, index in tokenizer.word_index.items():if index == integer:return wordreturn None# generate a description for an imagedef generate_desc(model, tokenizer, photo, max_length):# seed the generation process in_text = 'START'# iterate over the whole length of the sequencefor i in range(900):# integer encode input sequence sequence = tokenizer.texts_to_sequences([in_text])[0][-100:]# pad input sequence = pad_sequences([sequence], maxlen=max_length)# predict next word yhat = model.predict([photo,sequence], verbose=0)# convert probability to integer yhat = np.argmax(yhat)# map integer to word word = word_for_id(yhat, tokenizer)# stop if we cannot map the wordif word is None:break# append as input for generating the next word in_text += ' ' + word# Print the predictionprint(' ' + word, end='')# stop if we predict the end of the sequenceif word == 'END':breakreturn# Load and image, preprocess it for IR2, extract features and generate the HTML test_image = img_to_array(load_img('images/87.jpg', target_size=(299, 299))) test_image = np.array(test_image, dtype=float) test_image = preprocess_input(test_image) test_features = IR2.predict(np.array([test_image])) generate_desc(model, tokenizer, np.array(test_features), 100)复制代码

输出

演习不同轮数所天生网站的地址：

250 epochs：emilwallner.github.io/html/250_ep…350 epochs：emilwallner.github.io/html/350_ep…450 epochs：emilwallner.github.io/html/450_ep…550 epochs：emilwallner.github.io/html/550_ep…

我走过的坑：

我认为理解 LSTM 比 CNN 要难一些。
当我展开 LSTM 后，它们会变得随意马虎理解一些。
此外，我们在考试测验理解 LSTM 前，可以先关注输入与输出特色。
从头构建一个词汇表要比压缩一个巨大的词汇表随意马虎得多。
这样的构建包括字体、div 标签大小、变量名的 hex 颜色和一样平常单词。
大多数库是为解析文本文档而构建。
在库的利用文档中，它们会见告我们如何通过空格进行分割，而不是代码，我们须要自定义解析的办法。
我们可以从 ImageNet 上预演习的模型抽取特色。
然而，相对付从头演习的 pix2code 模型，丢失要高 30% 旁边。
此外，我对付利用基于网页截屏预演习的 inception-resnet 网络很有兴趣。

Bootstrap 版本

在终极版本中，我们利用 pix2code 论文中天生 bootstrap 网站的数据集。
利用 Twitter 的 Bootstrap 库（getbootstrap.com/），我们可以结合 HTML 和 CSS，降落词汇表规模。

我们将利用这一版本为之前未见过的截图天生标记。
我们还深入研究它如何构建截图和标记的先验知识。

我们不在 bootstrap 标记上演习，而是利用 17 个简化 token，将其编译成 HTML 和 CSS。
数据集（github.com/tonybeltram…）包括 1500 个测试截图和 250 个验证截图。
均匀每个截图有 65 个 token，一共有 96925 个演习样本。

我们轻微修正一下 pix2code 论文中的模型，使之预测网络组件的准确率达到 97%。

端到端方法

从预演习模型中提取特色在图像描述天生模型中效果很好。
但是几次实验后，我创造 pix2code 的端到端方法效果更好。
在我们的模型中，我们用轻量级卷积神经网络更换预演习图像特色。
我们不该用最大池化来增加信息密度，而是增加步幅。
这可以保持前端元素的位置和颜色。

存在两个核心模型：卷积神经网络（CNN）和循环神经网络（RNN）。
最常用的循环神经网络是是非期影象（LSTM）网络。
我之前的文章中先容过 CNN 教程，本文紧张先容 LSTM。

理解 LSTM 中的韶光步

关于 LSTM 比较难明得的是韶光步。
我们的原始神经网络有两个韶光步，如果你给它「Hello」，它就会预测「World」。
但是它会试图预测更多韶光步。
下例中，输入有四个韶光步，每个单词对应一个韶光步。

LSTM 适宜时序数据的输入，它是一种适宜顺序信息的神经网络。
模型展开图示如下，对付每个循环步，你须要保持同样的权重。

加权后的输入与输出特色在级联后输入到激活函数，并作为当前韶光步的输出。
由于我们重复利用了相同的权重，它们将从一些输入获取信息并构建序列的知识。
下面是 LSTM 在每一个韶光步上的简化版处理过程：

理解 LSTM 层级中的单元

每一层 LSTM 单元的总数决定了它影象的能力，同样也对应于每一个输出特色的维度大小。
LSTM 层级中的每一个单元将学习如何追踪句法的不同方面。
以下是一个 LSTM 单元追踪标签行信息的可视化，它是我们用来演习 bootstrap 模型的大略标记措辞。

每一个 LSTM 单元会坚持一个单元状态，我们可以将单元状态视为影象。
权重和激活值可利用不同的办法改动状态值，这令 LSTM 层可以通过保留或遗忘输入信息而得到精调。
除了处理当前输入信息与输出信息，LSTM 单元还须要改动影象状态以通报到下一个韶光步。

dir_name = 'resources/eval_light/'# Read a file and return a stringdef load_doc(filename): file = open(filename, 'r') text = file.read() file.close()return textdef load_data(data_dir): text = [] images = []# Load all the files and order them all_filenames = listdir(data_dir) all_filenames.sort()for filename in (all_filenames):if filename[-3:] == \"大众npz\"大众:# Load the images already prepared in arrays image = np.load(data_dir+filename) images.append(image['features'])else:# Load the boostrap tokens and rap them in a start and end tag syntax = '<START> ' + load_doc(data_dir+filename) + ' <END>'# Seperate all the words with a single space syntax = ' '.join(syntax.split())# Add a space after each comma syntax = syntax.replace(',', ' ,') text.append(syntax) images = np.array(images, dtype=float)return images, text train_features, texts = load_data(dir_name)# Initialize the function to create the vocabulary tokenizer = Tokenizer(filters='', split=\"大众 \"大众, lower=False)# Create the vocabulary tokenizer.fit_on_texts([load_doc('bootstrap.vocab')])# Add one spot for the empty word in the vocabulary vocab_size = len(tokenizer.word_index) + 1# Map the input sentences into the vocabulary indexes train_sequences = tokenizer.texts_to_sequences(texts)# The longest set of boostrap tokens max_sequence = max(len(s) for s in train_sequences)# Specify how many tokens to have in each input sentence max_length = 48def preprocess_data(sequences, features): X, y, image_data = list(), list(), list()for img_no, seq in enumerate(sequences):for i in range(1, len(seq)):# Add the sentence until the current count(i) and add the current count to the output in_seq, out_seq = seq[:i], seq[i]# Pad all the input token sentences to max_sequence in_seq = pad_sequences([in_seq], maxlen=max_sequence)[0]# Turn the output into one-hot encoding out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]# Add the corresponding image to the boostrap token file image_data.append(features[img_no])# Cap the input sentence to 48 tokens and add it X.append(in_seq[-48:]) y.append(out_seq)return np.array(X), np.array(y), np.array(image_data) X, y, image_data = preprocess_data(train_sequences, train_features)#Create the encoder image_model = Sequential() image_model.add(Conv2D(16, (3, 3), padding='valid', activation='relu', input_shape=(256, 256, 3,))) image_model.add(Conv2D(16, (3,3), activation='relu', padding='same', strides=2)) image_model.add(Conv2D(32, (3,3), activation='relu', padding='same')) image_model.add(Conv2D(32, (3,3), activation='relu', padding='same', strides=2)) image_model.add(Conv2D(64, (3,3), activation='relu', padding='same')) image_model.add(Conv2D(64, (3,3), activation='relu', padding='same', strides=2)) image_model.add(Conv2D(128, (3,3), activation='relu', padding='same')) image_model.add(Flatten()) image_model.add(Dense(1024, activation='relu')) image_model.add(Dropout(0.3)) image_model.add(Dense(1024, activation='relu')) image_model.add(Dropout(0.3)) image_model.add(RepeatVector(max_length)) visual_input = Input(shape=(256, 256, 3,)) encoded_image = image_model(visual_input) language_input = Input(shape=(max_length,)) language_model = Embedding(vocab_size, 50, input_length=max_length, mask_zero=True)(language_input) language_model = LSTM(128, return_sequences=True)(language_model) language_model = LSTM(128, return_sequences=True)(language_model)#Create the decoder decoder = concatenate([encoded_image, language_model]) decoder = LSTM(512, return_sequences=True)(decoder) decoder = LSTM(512, return_sequences=False)(decoder) decoder = Dense(vocab_size, activation='softmax')(decoder)# Compile the model model = Model(inputs=[visual_input, language_input], outputs=decoder) optimizer = RMSprop(lr=0.0001, clipvalue=1.0) model.compile(loss='categorical_crossentropy', optimizer=optimizer)#Save the model for every 2nd epoch filepath=\"大众org-weights-epoch-{epoch:04d}--val_loss-{val_loss:.4f}--loss-{loss:.4f}.hdf5\"大众 checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_weights_only=True, period=2) callbacks_list = [checkpoint]# Train the model model.fit([image_data, X], y, batch_size=64, shuffle=False, validation_split=0.1, callbacks=callbacks_list, verbose=1, epochs=50)复制代码

测试准确率

找到一种丈量准确率的精良方法非常棘手。
比如一个词一个词地比拟，如果你的预测中有一个词不对照，准确率可能便是 0。
如果你把百分百对照的单词移除一个，终极的准确率可能是 99/100。

我利用的是 BLEU 分值，它在机器翻译和图像描述模型实践上都是最好的。
它把句子分解成 4 个 n-gram，从 1-4 个单词的序列。
不才面的预测中，「cat」该当是「code」。

为了得到终极的分值，每个的分值须要乘以 25%，(4/5) × 0.25 + (2/4) × 0.25 + (1/3) × 0.25 + (0/2) ×0.25 = 0.2 + 0.125 + 0.083 + 0 = 0.408。
然后用总和乘以句子长度的惩罚函数。
由于在我们的示例中，长度是精确的，以是它就直接是我们的终极得分。

你可以增加 n-gram 的数量，4 个 n-gram 的模型是最为对应人类翻译的。
我建议你阅读下面的代码：

#Create a function to read a file and return its contentdef load_doc(filename): file = open(filename, 'r') text = file.read() file.close()return textdef load_data(data_dir): text = [] images = [] files_in_folder = os.listdir(data_dir) files_in_folder.sort()for filename in tqdm(files_in_folder):#Add an imageif filename[-3:] == \"大众npz\"大众: image = np.load(data_dir+filename) images.append(image['features'])else:# Add text and wrap it in a start and end tag syntax = '<START> ' + load_doc(data_dir+filename) + ' <END>'#Seperate each word with a space syntax = ' '.join(syntax.split())#Add a space between each comma syntax = syntax.replace(',', ' ,') text.append(syntax) images = np.array(images, dtype=float)return images, text#Intialize the function to create the vocabulary tokenizer = Tokenizer(filters='', split=\"大众 \"大众, lower=False)#Create the vocabulary in a specific order tokenizer.fit_on_texts([load_doc('bootstrap.vocab')]) dir_name = '../../../../eval/' train_features, texts = load_data(dir_name)#load model and weights json_file = open('../../../../model.json', 'r') loaded_model_json = json_file.read() json_file.close() loaded_model = model_from_json(loaded_model_json)# load weights into new model loaded_model.load_weights(\公众../../../../weights.hdf5\公众)print(\公众Loaded model from disk\"大众)# map an integer to a worddef word_for_id(integer, tokenizer):for word, index in tokenizer.word_index.items():if index == integer:return wordreturn Noneprint(word_for_id(17, tokenizer))# generate a description for an imagedef generate_desc(model, tokenizer, photo, max_length): photo = np.array([photo])# seed the generation process in_text = '<START> '# iterate over the whole length of the sequenceprint('\nPrediction---->\n\n<START> ', end='')for i in range(150):# integer encode input sequence sequence = tokenizer.texts_to_sequences([in_text])[0]# pad input sequence = pad_sequences([sequence], maxlen=max_length)# predict next word yhat = loaded_model.predict([photo, sequence], verbose=0)# convert probability to integer yhat = argmax(yhat)# map integer to word word = word_for_id(yhat, tokenizer)# stop if we cannot map the wordif word is None:break# append as input for generating the next word in_text += word + ' '# stop if we predict the end of the sequenceprint(word + ' ', end='')if word == '<END>':breakreturn in_text max_length = 48 # evaluate the skill of the modeldef evaluate_model(model, descriptions, photos, tokenizer, max_length): actual, predicted = list(), list()# step over the whole setfor i in range(len(texts)): yhat = generate_desc(model, tokenizer, photos[i], max_length)# store actual and predictedprint('\n\nReal---->\n\n' + texts[i]) actual.append([texts[i].split()]) predicted.append(yhat.split())# calculate BLEU score bleu = corpus_bleu(actual, predicted)return bleu, actual, predicted bleu, actual, predicted = evaluate_model(loaded_model, texts, train_features, tokenizer, max_length)#Compile the tokens into HTML and css dsl_path = \公众compiler/assets/web-dsl-mapping.json\"大众 compiler = Compiler(dsl_path) compiled_website = compiler.compile(predicted[0], 'index.html')print(compiled_website )print(bleu)复制代码

输出

样本输出的链接：

Generated website 1 - Original 1 (emilwallner.github.io/bootstrap/r…)Generated website 2 - Original 2 (emilwallner.github.io/bootstrap/r…)Generated website 3 - Original 3 (emilwallner.github.io/bootstrap/r…)Generated website 4 - Original 4 (emilwallner.github.io/bootstrap/r…)Generated website 5 - Original 5 (emilwallner.github.io/bootstrap/r…)

我走过的坑：

理解模型的弱点而不是测试随机模型。
首先我利用随机的东西，比如批归一化、双向网络，并考试测验实现把稳力机制。
在查看测试数据，并知道其无法高精度地预测颜色和位置之后，我意识到 CNN 存在一个弱点。
这致使我利用增加的步幅来取代最大池化。
验证丢失从 0.12 降至 0.02，BLEU 分值从 85% 增加至 97%。
如果它们干系，则只利用预演习模型。
在小数据的情形下，我认为一个预演习图像模型将会提升性能。
从我的实验来看，端到端模型演习更慢，须要更多内存，但是精确度会提升 30%。
当你在远程做事器上运行模型，我们须要为一些不同做好准备。
在我的 mac 上，它按照字母表顺序读取文档。
但是在做事器上，它被随机定位。
这在代码和截图之间造成了不匹配。

下一步

前端开拓是深度学习运用的空想空间。
数据随意马虎天生，并且当前深度学习算法可以映射绝大部分逻辑。
一个最让人激动的领域是把稳力机制在 LSTM 上的运用。
这不仅会提升精确度，还可以使我们可视化 CNN 在天生标记时所聚焦的地方。
把稳力同样是标记、可定义模板、脚本和终极端之间通信的关键。
把稳力层要追踪变量，使网络可以在编程措辞之间保持通信。

但是在不久的将来，最大的影响将会来自合成数据的可扩展方法。
接着你可以一步步添加字体、颜色和动画。
目前为止，大多数进步发生在草图（sketches）方面并将其转化为模版运用。
在不到两年的韶光里，我们将创建一个草图，它会在一秒之内找到相应的前端。
Airbnb 设计团队与 Uizard 已经创建了两个正在利用的原型。
下面是一些可能的试验过程：

实验

开始

运行所有模型考试测验不同的超参数测试一个不同的 CNN 架构添加双向 LSTM 模型用不同数据集实现模型

进一步实验

利用相应的语法创建一个稳定的随机运用/网页天生器从草图到运用模型的数据。
自动将运用/网页截图转化为草图，并利用 GAN 创建多样性。
运用把稳力层可视化每一预测的图像聚焦，类似于这个模型为模块化方法创建一个框架。
比如，有字体的编码器模型，一个用于颜色，另一个用于排版，并利用一个解码器整合它们。
稳定的图像特色是一个好的开始。
馈送大略的 HTML 组件到神经网络中，并利用 CSS 教其天生动画。
利用把稳力方法并可视化两个输入源的聚焦将会很迷人。

原文链接：blog.floydhub.com/turning-des…

本文为机器之心编译，转载请联系本公众年夜众号得到授权。