Graph Convolutional Neural Networks for Web-Scale Recommender Systems(用于Web级推荐系统的图形卷积神经网络)
Graph Convolutional Neural Networks for Web-Scale Recommender Systems
用于Web级推荐系统的图形卷积神经网络
ABSTRACT
Recent advancements in deep neural networks for graph-structured data have led to state-of-the-art performance on recommender system benchmarks. However, making these methods practical and scalable to web-scale recommendation tasks with billions of items and hundreds of millions of users remains a challenge.
Here we describe a large-scale deep recommendation engine that we developed and deployed at Pinterest. We develop a data-efficient Graph Convolutional Network (GCN) algorithm PinSage, which combines efficient random walks and graph convolutions to generate embeddings of nodes (i.e., items) that incorporate both graph structure as well as node feature information. Compared to prior GCN approaches, we develop a novel method based on highly efficient random walks to structure the convolutions and design a novel training strategy that relies on harder-and-harder training examples to improve robustness and convergence of the model.
We deploy PinSage at Pinterest and train it on 7.5 billion exam-ples on a graph with 3 billion nodes representing pins and boards, and 18 billion edges. According to offline metrics, user studies and A/B tests, PinSage generates higher-quality recommendations than comparable deep learning and graph-based alternatives. To our knowledge, this is the largest application of deep graph embed-dings to date and paves the way for a new generation of web-scale recommender systems based on graph convolutional architectures.
用于图形结构数据的深度神经网络的最新进展已经在推荐器系统基准上产生了最先进的性能。然而,使这些方法实用且可扩展到具有数十亿项目和数亿用户的网络规模推荐任务仍然是一项挑战。
在这里,我们描述了我们在Pinterest开发和部署的大规模深度推荐引擎。我们开发了一种数据有效的图形卷积网络(GCN)算法PinSage,它结合了有效的随机游走和图形卷积,以生成包含图形结构和节点特征信息的节点(即项目)的嵌入。与之前的GCN方法相比,我们开发了一种基于高效随机游走的新方法来构建卷积并设计一种新的训练策略,该策略依赖于更难和更难的训练示例来提高模型的鲁棒性和收敛性。
我们在Pinterest部署了PinSage,并在图表上培训了75亿个例子,其中30亿个节点代表引脚和电路板,以及180亿个边缘。根据离线指标,用户研究和A / B测试,PinSage可提供比可比较的深度学习和基于图形的替代方案更高质量的建议。据我们所知,这是迄今为止最大的深度图嵌入应用,并为基于图卷积结构的新一代Web级推荐系统铺平了道路。
1 INTRODUCTION
Deep learning methods have an increasingly critical role in rec-ommender system applications, being used to learn useful low-dimensional embeddings of images, text, and even individual users [9, 12]. The representations learned using deep models can be used to complement, or even replace, traditional recommendation algo-rithms like collaborative filtering. and these learned representations have high utility because they can be re-used in various recom-mendation tasks. For example, item embeddings learned using a deep model can be used for item-item recommendation and also to recommended themed collections (e.g., playlists, or “feed” content).
Recent years have seen significant developments in this space— especially the development of new deep learning methods that are capable of learning on graph-structured data, which is fundamen-tal for recommendation applications (e.g., to exploit user-to-item interaction graphs as well as social graphs) [6, 19, 21, 24, 29, 30].
Most prominent among these recent advancements is the suc-cess of deep learning architectures known as Graph Convolutional Networks (GCNs) [19, 21, 24, 29]. The core idea behind GCNs is to learn how to iteratively aggregate feature information from lo-cal graph neighborhoods using neural networks (Figure 1). Here a single “convolution” operation transforms and aggregates feature information from a node’s one-hop graph neighborhood, and by stacking multiple such convolutions information can be propagated across far reaches of a graph. Unlike purely content-based deep models (e.g., recurrent neural networks [3]), GCNs leverage both content information as well as graph structure. GCN-based methods have set a new standard on countless recommender system bench-marks (see [19] for a survey). However, these gains on benchmark tasks have yet to be translated to gains in real-world production environments.
深度学习方法在调用系统应用程序中具有越来越重要的作用,用于学习图像,文本甚至个人用户的有用的低维嵌入[9,12]。使用深度模型学习的表示可用于补充甚至替代传统的推荐算法,如协同过滤。并且这些学习的表示具有很高的实用性,因为它们可以在各种推荐任务中重复使用。例如,使用深度模型学习的项目嵌入可以用于项目项目推荐以及推荐的主题集合(例如,播放列表或“馈送”内容)。
近年来,这一领域取得了重大进展 – 尤其是新的深度学习方法的开发,这些方法能够学习图形结构数据,这是推荐应用的基础(例如,利用用户到项目的交互图形以及社交图表)[6,19,21,24,29,30]。
这些最新进展中最突出的是深度学习架构的成功,称为图形卷积网络(GCN)[19,21,24,29]。 GCN背后的核心思想是学习如何使用神经网络从lo-cal图形邻域迭代地聚合特征信息(图1)。这里,单个“卷积”操作转换并聚合来自节点的单跳图邻域的特征信息,并且通过堆叠多个这样的卷积信息可以在图的远端传播。与纯粹基于内容的深层模型(例如,递归神经网络[3])不同,GCN利用内容信息和图形结构。基于GCN的方法为无数的推荐系统基准设定了新的标准(参见[19]的一项调查)。但是,基准任务的这些收益尚未转化为实际生产环境中的收益。
The main challenge is to scale both the training as well as in-ference of GCN-based node embeddings to graphs with billions of nodes and tens of billions of edges. Scaling up GCNs is difficult because many of the core assumptions underlying their design are violated when working in a big data environment. For example, all existing GCN-based recommender systems require operating on the full graph Laplacian during training—an assumption that is infeasible when the underlying graph has billions of nodes and whose structure is constantly evolving.
Present work. Here we present a highly-scalable GCN framework that we have developed and deployed in production at Pinterest. Our framework, a random-walk-based GCN named PinSage, operates on a massive graph with 3 billion nodes and 18 billion edges—a graph that is 10, 000× larger than typical applications of GCNs. PinSage leverages several key insights to drastically improve the scalability of GCNs:
主要的挑战是将基于GCN的节点嵌入的训练和推理扩展到具有数十亿个节点和数百亿个边缘的图形。 扩展GCN很困难,因为在大数据环境中工作时,其设计的许多核心假设都会受到侵犯。 例如,所有现有的基于GCN的推荐系统都需要在训练期间对完整图拉普拉斯算子进行操作 – 当基础图具有数十亿个节点且其结构不断发展时,这种假设是不可行的。
目前的工作。 在这里,我们提出了一个高度可扩展的GCN框架,我们在Pinterest的生产中开发和部署了该框架。 我们的框架是一个名为PinSage的基于随机游走的GCN,它运行在一个包含30亿个节点和180亿个边缘的大型图形上 – 图形比GCN的典型应用程序大10,000倍。 PinSage利用几个关键见解来大幅提高GCN的可扩展性:
图1:使用深度2卷积的模型架构概述(最好以彩色查看)。 左:一个小例子输入
图形。 右:使用前一层表示计算节点A的嵌入hA(2)的2层神经网络,
节点A的hA(1)和其邻域N(A)(节点B,C,D)的hA(1)。 (然而,邻域的概念是通用的,并不是所有邻居都需要包括在内(第3.2节)。)底部:计算输入图的每个节点的嵌入的神经网络。 虽然神经网络在节点之间不同,但它们都共享相同的参数集(即,卷积(1)和卷积(2)函数的参数;算法1)。 具有相同阴影图案的框共享参数; γ表示重要性汇集函数; 薄矩形框表示密集连接的多层神经网络。
-
•On-the-fly convolutions: Traditional GCN algorithms per-form graph convolutions by multiplying feature matrices by powers of the full graph Laplacian. In contrast, our PinSage algo-rithm performs efficient, localized convolutions by sampling the neighborhood around a node and dynamically constructing a computation graph from this sampled neighborhood. These dy-namically constructed computation graphs (Fig. 1) specify how to perform a localized convolution around a particular node, and alleviate the need to operate on the entire graph during training.
-
•Producer-consumer minibatch construction: We develop a producer-consumer architecture for constructing minibatches that ensures maximal GPU utilization during model training. A large-memory, CPU-bound producer efficiently samples node network neighborhoods and fetches the necessary features to define local convolutions, while a GPU-bound TensorFlow model consumes these pre-defined computation graphs to efficiently run stochastic gradient decent.
•Efficient MapReduce inference: Given a fully-trained GCN model, we design an efficient MapReduce pipeline that can dis-tribute the trained model to generate embeddings for billions of nodes, while minimizing repeated computations.
In addition to these fundamental advancements in scalability, we also introduce new training techniques and algorithmic innova-tions. These innovations improve the quality of the representations learned by PinSage, leading significant performance gains in down-stream recommender system tasks:
- Constructing convolutions via random walks: Taking full neighborhoods of nodes to perform convolutions (Fig. 1) would result in huge computation graphs, so we resort to sampling. However, random sampling is suboptimal, and we develop a new technique using short random walks to sample the computa-tion graph. An additional benefit is that each node now has an importance score, which we use in the pooling/aggregation step.
- •Importance pooling: A core component of graph convolutions is the aggregation of feature information from local neighbor-hoods in the graph. We introduce a method to weigh the impor-tance of node features in this aggregation based upon random-walk similarity measures, leading to a 46% performance gain in offline evaluation metrics.
- •Curriculum training: We design a curriculum training scheme, where the algorithm is fed harder-and-harder examples during training, resulting in a 12% performance gain.
We have deployed PinSage for a variety of recommendation tasks at Pinterest, a popular content discovery and curation appli-cation where users interact with pins, which are visual bookmarks to online content (e.g., recipes they want to cook, or clothes they want to purchase). Users organize these pins into boards, which con-tain collections of similar pins. Altogether, Pinterest is the world’s largest user-curated graph of images, with over 2 billion unique pins collected into over 1 billion boards.
Through extensive offline metrics, controlled user studies, and A/B tests, we show that our approach achieves state-of-the-art performance compared to other scalable deep content-based rec-ommendation algorithms, in both an item-item recommendation task (i.e., related-pin recommendation), as well as a “homefeed” recommendation task. In offline ranking metrics we improve over the best performing baseline by more than 40%, in head-to-head human evaluations our recommendations are preferred about 60% of the time, and the A/B tests show 30% to 100% improvements in user engagement across various settings.
To our knowledge, this is the largest-ever application of deep graph embeddings and paves the way for new generation of rec-ommendation systems based on graph convolutional architectures.
-
动态卷积:传统的GCN算法通过将特征矩阵乘以完整图拉普拉斯算子的幂来进行每个图形卷积。相比之下,我们的PinSage算法通过对节点周围的邻域进行采样并从该采样邻域动态构建计算图来执行高效的局部卷积。这些动态构建的计算图(图1)指定了如何在特定节点周围执行局部卷积,并减少了在训练期间对整个图进行操作的需要。
-
生产者 – 消费者小批量建设:我们开发了一个生产者 – 消费者体系结构,用于构建微型计算机,确保在模型培训期间最大限度地利用GPU。一个大内存,受CPU限制的生产者有效地采样节点网络邻域并获取必要的特征来定义局部卷积,而受GPU约束的TensorFlow模型使用这些预定义的计算图来有效地运行随机梯度体面。
-
高效的MapReduce推理:给定一个完全训练的GCN模型,我们设计了一个有效的MapReduce管道,可以分解训练的模型,为数十亿个节点生成嵌入,同时最大限度地减少重复计算。
除了可扩展性的这些基本进步之外,我们还引入了新的培训技术和算法创新。这些创新提高了PinSage所学习的表示质量,在下游推荐系统任务中带来了显着的性能提升:
- 通过随机游走构建卷积:使用完整的节点邻域来执行卷积(图1)将导致巨大的计算图,因此我们求助于采样。然而,随机抽样不是最理想的,我们开发了一种使用短随机游走来抽样计算图的新技术。另一个好处是每个节点现在都有一个重要性分数,我们在池化/聚合步骤中使用它。
- •重要性池:图卷的核心组件是图中本地邻居的特征信息的聚合。我们引入了一种方法来基于随机游走相似性度量来衡量此聚合中节点特征的重要性,从而使离线评估指标的性能提高46%。
- 课程培训:我们设计了一个课程培训方案,在培训过程中,算法得到了越来越难的实例,从而使性能提高了12%。
我们在Pinterest上部署了PinSage用于各种推荐任务,Pinterest是一种流行的内容发现和策展应用,用户可以在其中与引脚交互,引脚是在线内容的可视书签(例如,他们想要烹饪的食谱,或者他们想要购买的衣服) )。用户将这些引脚组织成板,其中包含类似引脚的集合。总而言之,Pinterest是世界上最大的用户策划图像图形,超过20亿个独特的引脚被收集到超过10亿个电路板中。
通过广泛的离线指标,受控用户研究和A / B测试,我们表明,与项目项目推荐任务中的其他可扩展的基于深度内容的推荐算法相比,我们的方法实现了最先进的性能(即相关引脚推荐),以及“主页”推荐任务。在离线排名指标中,我们在最佳绩效基线上的表现提高了40%以上,在人机对话评估中,我们的建议在60%的时间内是首选,A / B测试显示在30%到100%的情况下,用户参与各种设置。
据我们所知,这是有史以来最大的深度图嵌入应用,并为基于图卷积结构的新一代推荐系统铺平了道路。
2 RELATED WORK
Our work builds upon a number of recent advancements in deep learning methods for graph-structured data.
The notion of neural networks for graph data was first outlined in Gori et al. (2005) [15] and further elaborated on in Scarselli et al. (2009) [27]. However, these initial approaches to deep learning on graphs required running expensive neural “message-passing” algorithms to convergence and were prohibitively expensive on large graphs. Some limitations were addressed by Gated Graph Sequence Neural Networks [22]—which employs modern recurrent neural architectures—but the approach remains computationally expensive and has mainly been used on graphs with <10, 000 nodes.
More recently, there has been a surge of methods that rely on the notion of “graph convolutions” or Graph Convolutional Net-works (GCNs). This approach originated with the work of Bruna et al. (2013), which developed a version of graph convolutions based on spectral graph thery [7]. Following on this work, a number of authors proposed improvements, extensions, and approximations of these spectral convolutions [6, 10, 11, 13, 18, 21, 24, 29, 31], lead-ing to new state-of-the-art results on benchmarks such as node classification, link prediction, as well as recommender system tasks (e.g., the MovieLens benchmark [24]). These approaches have con-sistently outperformed techniques based upon matrix factorization or random walks (e.g., node2vec [17] and DeepWalk [26]), and their success has led to a surge of interest in applying GCN-based methods to applications ranging from recommender systems [24] to drug design [20, 31]. Hamilton et al. (2017b) [19] and Bronstein et al. (2017) [6] provide comprehensive surveys of recent advancements.
However, despite the successes of GCN algorithms, no previous works have managed to apply them to production-scale data with billions of nodes and edges—a limitation that is primarily due to the fact that traditional GCN methods require operating on the entire graph Laplacian during training. Here we fill this gap and show that GCNs can be scaled to operate in a production-scale recommender system setting involving billions of nodes/items. Our work also demonstrates the substantial impact that GCNs have on recommendation performance in a real-world environment.
In terms of algorithm design, our work is most closely related to Hamilton et al. (2017a)’s GraphSAGE algorithm [18] and the closely related follow-up work of Chen et al. (2018) [8]. GraphSAGE is an inductive variant of GCNs that we modify to avoid operating on the entire graph Laplacian. We fundamentally improve upon GraphSAGE by removing the limitation that the whole graph be stored in GPU memory, using low-latency random walks to sample
我们的工作建立在图形结构数据深度学习方法的最新进展之上。
Gori等人首次概述了图数据的神经网络概念。 (2005)[15]并在Scarselli等人的文章中进一步阐述。 (2009)[27]。然而,这些在图上进行深度学习的初始方法需要运行昂贵的神经“消息传递”算法来收敛,并且在大图上非常昂贵。门控图形序列神经网络[22]解决了一些局限性 – 它采用了现代的递归神经架构 – 但该方法计算成本仍然很高,主要用于节点数<10,000的图形。
最近,出现了大量依赖“图形卷积”或图形卷积网络(GCN)概念的方法。这种方法起源于Bruna等人的工作。 (2013),基于光谱图[7]开发了一个图形卷积的版本。继这项工作之后,许多作者提出了这些光谱卷积的改进,扩展和近似[6,10,11,13,18,21,24,29,31],导致了新的状态。 -art结果基于节点分类,链接预测以及推荐系统任务(例如,MovieLens基准测试[24])等基准测试。这些方法始终优于基于矩阵分解或随机游走的技术(例如,node2vec [17]和DeepWalk [26]),并且它们的成功引起了人们对将基于GCN的方法应用于推荐器等应用的兴趣。系统[24]到药物设计[20,31]。汉密尔顿等人。 (2017b)[19]和Bronstein等。 (2017)[6]提供了最近进展的综合调查。
然而,尽管GCN算法取得了成功,但之前的工作没有设法将它们应用于具有数十亿节点和边缘的生产规模数据 – 这主要是因为传统的GCN方法需要在整个图形拉普拉斯算子上运行。训练。在这里,我们填补了这一空白,并表明GCN可以扩展到在涉及数十亿节点/项目的生产规模推荐系统设置中运行。我们的工作还证明了GCN对现实环境中的推荐性能产生的重大影响。
在算法设计方面,我们的工作与Hamilton等人的关系最为密切。 (2017a)的GraphSAGE算法[18]和陈等人的密切相关的后续工作。 (2018)[8]。 GraphSAGE是GCN的归纳变体,我们修改它以避免在整个图拉普拉斯算子上运行。我们通过消除整个图存储在GPU内存中的限制,从而使用低延迟随机游走来获取样本,从根本上改进了GraphSAGE
graph neighborhoods in a producer-consumer architecture. We also introduce a number of new training techniques to improve performance and a MapReduce inference pipeline to scale up to graphs with billions of nodes.
Lastly, also note that graph embedding methods like node2vec
[17]and DeepWalk [26] cannot be applied here. First, these are unsupervised methods. Second, they cannot include node feature information. Third, they directly learn embeddings of nodes and thus the number of model parameters is linear with the size of the graph, which is prohibitive for our setting.
生产者 – 消费者体系结构中的图形邻域。 我们还介绍了许多用于提高性能的新培训技术和MapReduce推理管道,以扩展到具有数十亿节点的图形。
最后,还要注意图形嵌入方法,如node2vec
[17]和DeepWalk [26]不能在这里应用。 首先,这些是无监督的方法。 其次,它们不能包含节点特征信息。 第三,它们直接学习节点的嵌入,因此模型参数的数量与图形的大小成线性关系,这对我们的设置来说是禁止的。
3 METHOD
In this section, we describe the technical details of the PinSage archi-tecture and training, as well as a MapReduce pipeline to efficiently generate embeddings using a trained PinSage model.
The key computational workhorse of our approach is the notion of localized graph convolutions.1 To generate the embedding for a node (i.e., an item), we apply multiple convolutional modules that aggregate feature information (e.g., visual, textual features) from the node’s local graph neighborhood (Figure 1). Each module learns how to aggregate information from a small graph neighbor-hood, and by stacking multiple such modules, our approach can gain information about the local network topology. Importantly, parameters of these localized convolutional modules are shared across all nodes, making the parameter complexity of our approach independent of the input graph size.
在本节中,我们将介绍PinSage架构和培训的技术细节,以及使用经过训练的PinSage模型高效生成嵌入的MapReduce管道。
我们方法的关键计算主力是局部图卷积的概念.1为了生成节点(即项目)的嵌入,我们应用多个卷积模块来聚合来自节点的特征信息(例如,视觉,文本特征)。 局部图邻域(图1)。 每个模块都学习如何聚合来自小图邻居的信息,并通过堆叠多个这样的模块,我们的方法可以获得有关本地网络拓扑的信息。 重要的是,这些局部卷积模块的参数在所有节点之间共享,使得我们方法的参数复杂度与输入图形大小无关。
3.1 Problem Setup
Pinterest is a content discovery application where users interact with pins, which are visual bookmarks to online content (e.g., recipes they want to cook, or clothes they want to purchase). Users organize these pins into boards, which contain collections of pins that the user deems to be thematically related. Altogether, the Pinterest graph contains 2 billion pins, 1 billion boards, and over 18 billion edges (i.e., memberships of pins to their corresponding boards).
Our task is to generate high-quality embeddings or representa-tions of pins that can be used for recommendation (e.g., via nearest-neighbor lookup for related pin recommendation, or for use in a downstream re-ranking system). In order to learn these embed-dings, we model the Pinterest environment as a bipartite graph consisting of nodes in two disjoint sets, I (containing pins) and C (containing boards). Note, however, that our approach is also naturally generalizable, with I being viewed as a set of items and C as a set of user-defined contexts or collections.
In addition to the graph structure, we also assume that the pins/items u ∈ I are associated with real-valued attributes, xu ∈ Rd . In general, these attributes may specify metadata or content information about an item, and in the case of Pinterest, we have that pins are associated with both rich text and image features. Our goal is to leverage both these input attributes as well as the structure of the bipartite graph to generate high-quality embed-dings. These embeddings are then used for recommender system candidate generation via nearest neighbor lookup (i.e., given a pin, find related pins) or as features in machine learning systems for ranking the candidates.
Pinterest是一种内容发现应用程序,其中用户与引脚交互,引脚是在线内容的可视书签(例如,他们想要烹饪的食谱,或者他们想要购买的衣服)。用户将这些引脚组织成电路板,其中包含用户认为与主题相关的引脚集合。总而言之,Pinterest图包含20亿个引脚,10亿个电路板和超过180亿个边沿(即,引脚到其相应电路板的成员资格)。
我们的任务是生成可用于推荐的高质量嵌入或引脚表示(例如,通过最近邻居查找以获得相关引脚推荐,或用于下游重新排序系统)。为了学习这些嵌入,我们将Pinterest环境建模为由两个不相交集合中的节点组成的二分图,I(包含引脚)和C(包含板)。但请注意,我们的方法也可以自然地推广,我将其视为一组项目,将C视为一组用户定义的上下文或集合。
除了图形结构之外,我们还假设引脚/项u∈I与实值属性xu∈Rd相关联。通常,这些属性可以指定关于项目的元数据或内容信息,并且在Pinterest的情况下,我们将这些引脚与富文本和图像特征相关联。我们的目标是利用这些输入属性以及二分图的结构来生成高质量的嵌入式数据。然后,这些嵌入用于通过最近邻居查找生成推荐系统候选(即,给定引脚,找到相关引脚)或用作机器学习系统中用于对候选进行排序的特征。
For notational convenience and generality, when we describe the PinSage algorithm, we simply refer to the node set of the full graph with V = I ∪ C and do not explicitly distinguish between pin and board nodes (unless strictly necessary), using the more general term “node” whenever possible.
为了符号方便性和通用性,当我们描述PinSage算法时,我们简单地引用完整图的节点集,其中V =I∪C并且没有明确区分引脚和板节点(除非严格必要),使用更一般 尽可能使用术语“节点”。
3.2 Model Architecture
We use localized convolutional modules to generate embeddings for nodes. We start with input node features and then learn neural networks that transform and aggregate features over the graph to compute the node embeddings (Figure 1).
Forward propagation algorithm. We consider the task of gener-ating an embedding, zu for a node u, which depends on the node’s input features and the graph structure around this node.
我们使用局部卷积模块为节点生成嵌入。 我们从输入节点特征开始,然后学习神经网络,在图形上转换和聚合特征以计算节点嵌入(图1)。
前向传播算法 我们考虑为节点u生成嵌入,zu的任务,这取决于节点的输入特征和该节点周围的图形结构。
The core of our PinSage algorithm is a localized convolution operation, where we learn how to aggregate information from u’s neighborhood (Figure 1). This procedure is detailed in Algorithm 1 convolve. The basic idea is that we transform the representations zv , ∀v ∈ N(u) of u’s neighbors through a dense neural network and then apply a aggregator/pooling fuction (e.g., a element-wise mean or weighted sum, denoted as γ ) on the resulting set of vectors (Line 1). This aggregation step provides a vector representation, nu , of u’s local neighborhood, N(u). We then concatenate the ag-gregated neighborhood vector nu with u’s current representation hu and transform the concatenated vector through another dense neural network layer (Line 2). Empirically we observe significant performance gains when using concatenation operation instead of the average operation as in [21]. Additionally, the normalization in Line 3 makes training more stable, and it is more efficient to perform approximate nearest neighbor search for normalized embeddings (Section 3.5). The output of the algorithm is a representation of u that incorporates both information about itself and its local graph neighborhood.
Importance-based neighborhoods. An important innovation in our approach is how we define node neighborhoods N(u), i.e., how we select the set of neighbors to convolve over in Algorithm 1. Whereas previous GCN approaches simply examine k-hop graph neighborhoods, in PinSage we define importance-based neighbor-hoods, where the neighborhood of a node u is defined as the T nodes that exert the most influence on node u. Concretely, we simulate random walks starting from node u and compute the L1-normalized
visit count of nodes visited by the random walk [14].2 The neigh-borhood of u is then defined as the top T nodes with the highest normalized visit counts with respect to node u.
The advantages of this importance-based neighborhood defi-nition are two-fold. First, selecting a fixed number of nodes to aggregate from allows us to control the memory footprint of the algorithm during training [18]. Second, it allows Algorithm 1 to take into account the importance of neighbors when aggregating the vector representations of neighbors. In particular, we imple-ment γ in Algorithm 1 as a weighted-mean, with weights defined according to the L1 normalized visit counts. We refer to this new approach as importance pooling.
Stacking convolutions. Each time we apply the convolve opera-tion (Algorithm 1) we get a new representation for a node, and we can stack multiple such convolutions on top of each other in order to gain more information about the local graph structure around node u. In particular, we use multiple layers of convolutions, where the inputs to the convolutions at layer k depend on the representa-tions output from layer k − 1 (Figure 1) and where the initial (i.e., “layer 0”) representations are equal to the input node features. Note that the model parameters in Algorithm 1 (Q, q, W, and w) are shared across the nodes but differ between layers.
我们的PinSage算法的核心是局部卷积运算,我们学习如何从u的邻域聚合信息(图1)。此过程在算法1卷入中详细说明。基本思想是我们通过密集神经网络转换u邻居的表示zv,∀v∈N(u)然后应用聚合器/池化函数(例如,元素方式或加权和,表示为γ)在得到的矢量集上(第1行)。该聚合步骤提供了u的局部邻域N(u)的向量表示nu。然后,我们将ag-gregated邻域向量nu与u的当前表示hu连接,并将连接的向量转换为另一个密集的神经网络层(第2行)。根据经验,我们在使用串联操作而不是[21]中的平均操作时观察到显着的性能提升。此外,第3行中的归一化使训练更加稳定,并且对归一化嵌入执行近似最近邻搜索更有效(第3.5节)。算法的输出是u的表示,其包含关于其自身及其局部图邻域的信息。
基于重要性的社区。我们方法的一个重要创新是我们如何定义节点邻域N(u),即我们如何选择在算法1中进行卷积的邻居集合。而以前的GCN方法只是检查k-hop图形邻域,在PinSage中我们定义重要性基于邻居的邻居,其中节点u的邻域被定义为对节点u施加最大影响的T节点。具体地说,我们模拟从节点u开始的随机游走并计算L1标准化
访问随机游走所访问的节点数[14] .2然后将u的邻域定义为相对于节点u具有最高归一化访问计数的前T个节点。
这种基于重要性的邻域定义的优势是双重的。首先,选择要聚合的固定数量的节点允许我们在训练期间控制算法的内存占用[18]。其次,它允许算法1在聚合邻居的向量表示时考虑邻居的重要性。特别地,我们在算法1中实现γ作为加权平均值,其中权重根据L1归一化访问计数来定义。我们将这种新方法称为重要性池。
堆叠卷积。每次我们应用卷积运算(算法1)时,我们得到一个节点的新表示,并且我们可以将多个这样的卷积堆叠在彼此之上,以便获得关于节点u周围的局部图结构的更多信息。特别地,我们使用多层卷积,其中层k处的卷积的输入取决于层k-1(图1)的表示和初始(即“层0”)表示相等的表示。到输入节点功能。请注意,算法1(Q,q,W和w)中的模型参数在节点之间共享,但层之间不同。
We first describe our margin-based loss function in detail. Follow-ing this, we give an overview of several techniques we developed that lead to the computation efficiency and fast convergence rate of PinSage, allowing us to train on billion node graphs and billions training examples. And finally, we describe our curriculum-training scheme, which improves the overall quality of the recommendations.
我们首先详细描述基于保证金的损失函数。 接下来,我们概述了我们开发的几种技术,这些技术可以提高PinSage的计算效率和快速收敛速度,使我们能够训练数十亿个节点图和数十亿个训练样例。 最后,我们描述了我们的课程培训计划,该计划提高了建议的整体质量。
Loss function. In order to train the parameters of the model, we use a max-margin-based loss function. The basic idea is that we want to maximize the inner product of positive examples, i.e., the embedding of the query item and the corresponding related item. At the same time we want to ensure that the inner product of negative examples—i.e., the inner product between the embedding of the query item and an unrelated item—is smaller than that of the positive sample by some pre-defined margin. The loss function for a
损失功能。 为了训练模型的参数,我们使用基于最大边际的损失函数。 基本思想是我们希望最大化正例的内积,即嵌入查询项和相应的相关项。 同时,我们希望确保否定示例的内积 – 即,查询项的嵌入与不相关项之间的内积 – 比正样本的内积小一些预定义的余量。 a的损失函数
Multi-GPU training with large minibatches. To make full use of multiple GPUs on a single machine for training, we run the for-ward and backward propagation in a multi-tower fashion. With multiple GPUs, we first divide each minibatch (Figure 1 bottom) into equal-sized portions. Each GPU takes one portion of the minibatch and performs the computations using the same set of parameters. Af-ter backward propagation, the gradients for each parameter across all GPUs are aggregated together, and a single step of synchronous SGD is performed. Due to the need to train on extremely large number of examples (on the scale of billions), we run our system with large batch sizes, ranging from 512 to 4096.
We use techniques similar to those proposed by Goyal et al. [16] to ensure fast convergence and maintain training and generalization accuracy when dealing with large batch sizes. We use a gradual warmup procedure that increases learning rate from small to a peak value in the first epoch according to the linear scaling rule. Afterwards the learning rate is decreased exponentially.
Producer-consumer minibatch construction. During training, the adjacency list and the feature matrix for billions of nodes are placed in CPU memory due to their large size. However, during the convolve step of PinSage, each GPU process needs access to the neighborhood and feature information of nodes in the neighbor-hood. Accessing the data in CPU memory from GPU is not efficient. To solve this problem, we use a re-indexing technique to create a sub-graph G′ = (V ′, E′) containing nodes and their neighborhood, which will be involved in the computation of the current minibatch. A small feature matrix containing only node features relevant to computation of the current minibatch is also extracted such that the order is consistent with the index of nodes in G′. The adjacency list of G′ and the small feature matrix are fed into GPUs at the start of each minibatch iteration, so that no communication between the GPU and CPU is needed during the convolve step, greatly improving GPU utilization.
The training procedure has alternating usage of CPUs and GPUs. The model computations are in GPUs, whereas extracting features, re-indexing, and negative sampling are computed on CPUs. In ad-dition to parallelizing GPU computation with multi-tower training, and CPU computation using OpenMP [25], we design a producer-consumer pattern to run GPU computation at the current iteration and CPU computation at the next iteration in parallel. This further reduces the training time by almost a half.
Sampling negative items. Negative sampling is used in our loss function (Equation 1) as an approximation of the normalization factor of edge likelihood [23]. To improve efficiency when training with large batch sizes, we sample a set of 500 negative items to be shared by all training examples in each minibatch. This drastically saves the number of embeddings that need to be computed during each training step, compared to running negative sampling for each node independently. Empirically, we do not observe a difference between the performance of the two sampling schemes.
In the simplest case, we could just uniformly sample negative examples from the entire set of items. However, ensuring that the inner product of the positive example (pair of items (q, i )) is larger than that of the q and each of the 500 negative items is too “easy” and does not provide fine enough “resolution” for the system to learn. In particular, our recommendation algorithm should be capable of finding 1,000 most relevant items to q among the catalog of over 2 billion items. In other words, our model should be able to distinguish/identify 1 item out of 2 million items. But with 500 random negative items, the model’s resolution is only 1 out of
500.Thus, if we sample 500 random negative items out of 2 billion items, the chance of any of these items being even slightly related to the query item is small. Therefore, with large probability the learning will not make good parameter updates and will not be able to differentiate slightly related items from the very related ones.
To solve the above problem, for each positive training example (i.e., item pair (q, i)), we add “hard” negative examples, i.e., items that are somewhat related to the query item q, but not as related as the positive item i. We call these “hard negative items”. They are generated by ranking items in a graph according to their Per-sonalized PageRank scores with respect to query item q [14]. Items ranked at 2000-5000 are randomly sampled as hard negative items. As illustrated in Figure 2, the hard negative examples are more similar to the query than random negative examples, and are thus challenging for the model to rank, forcing the model to learn to distinguish items at a finer granularity.
使用大型微型计算机进行多GPU培训。为了在一台机器上充分利用多个GPU进行培训,我们以多塔式方式进行前向和后向传播。对于多个GPU,我们首先将每个小批量(图1底部)划分为相等大小的部分。每个GPU获取一部分小批量并使用相同的参数集执行计算。在后向传播之后,所有GPU上的每个参数的梯度被聚合在一起,并且执行同步SGD的单个步骤。由于需要训练大量的例子(数十亿的规模),我们运行我们的系统,批量大,从512到4096。
我们使用类似于Goyal等人提出的技术。 [16]确保在处理大批量时快速收敛并保持培训和泛化准确性。我们使用渐进的预热程序,根据线性缩放规则,在第一个时期内将学习率从小值提高到峰值。之后学习率呈指数下降。
生产者 – 消费者小批量建设。在训练期间,由于数据量大,数十亿个节点的邻接列表和特征矩阵被放置在CPU内存中。但是,在PinSage的卷积步骤中,每个GPU进程都需要访问邻居和邻居中节点的特征信息。从GPU访问CPU内存中的数据效率不高。为了解决这个问题,我们使用重新索引技术来创建包含节点及其邻域的子图G’=(V’,E’),其将参与当前小批量的计算。还提取仅包含与当前小批量的计算相关的节点特征的小特征矩阵,使得该顺序与G’中的节点索引一致。 G’的邻接列表和小特征矩阵在每个小批量迭代开始时被馈送到GPU中,因此在卷积步骤期间不需要GPU和CPU之间的通信,从而大大提高了GPU利用率。
训练过程交替使用CPU和GPU。模型计算在GPU中,而在CPU上计算提取特征,重新索引和负抽样。除了使用多塔培训并行GPU计算和使用OpenMP进行CPU计算[25]之外,我们还设计了一个生产者 – 消费者模式,以便在当前迭代中运行GPU计算,并在下一次迭代中并行运行CPU计算。这进一步将训练时间减少了近一半。
采样负面项目。在我们的损失函数(等式1)中使用负采样作为边缘似然归一化因子的近似值[23]。为了提高大批量培训的效率,我们对每个小批量的所有培训示例共享500个负面项目。与每个节点独立运行负采样相比,这大大节省了每个训练步骤中需要计算的嵌入数量。根据经验,我们没有观察到两种抽样方案的性能差异。
在最简单的情况下,我们可以从整个项目集合中统一采样负面示例。但是,确保正例(项目对(q,i))的内积大于q的内积,并且500个负项中的每一个都太“容易”并且不能提供足够精细的“分辨率”。系统要学习。特别是,我们的推荐算法应该能够在超过20亿个项目的目录中找到1000个最相关的项目。换句话说,我们的模型应该能够区分/识别200万个项目中的1个项目。但是有500个随机负面项目,模型的分辨率只有1个
500.因此,如果我们从20亿个项目中抽取500个随机负面项目,那么这些项目中任何一个与查询项目稍微相关的可能性就很小。因此,很有可能学习不会进行良好的参数更新,并且无法区分略微相关的项目与非常相关的项目。
为了解决上述问题,对于每个积极的训练样例(即项目对(q,i)),我们添加“硬”否定示例,即与查询项目q有些相关的项目,但不是与积极的项目我我们将这些称为“硬性负面物品”。它们是通过根据查询项q [14]的Per-sonalized PageRank分数对图表中的项目进行排名而生成的。排名为2000-5000的项目被随机抽样为硬阴性项目。如图2所示,硬否定示例与查询比随机否定示例更相似,因此难以对模型进行排名,迫使模型学习以更精细的粒度区分项目。
Using hard negative items throughout the training procedure doubles the number of epochs needed for the training to con-verge. To help with convergence, we develop a curriculum training scheme [4]. In the first epoch of training, no hard negative items are used, so that the algorithm quickly finds an area in the parameter space where the loss is relatively small. We then add hard negative items in subsequent epochs, focusing the model to learn how to distinguish highly related pins from only slightly related ones. At epoch n of the training, we add n − 1 hard negative items to the set of negative items for each item.
在整个训练过程中使用硬阴性项目会使训练所需的时期数量增加一倍。 为了帮助融合,我们制定了课程培训计划[4]。 在第一个训练时期,没有使用硬负项,因此算法快速找到参数空间中损失相对较小的区域。 然后,我们在后续时期添加硬阴性项,重点关注模型,以了解如何区分高度相关的引脚和仅略微相关的引脚。 在训练的时期,我们将n – 1个硬阴性项添加到每个项目的负项目集中。
Figure 2: Random negative examples and hard negative ex-amples. Notice that the hard negative example is signifi-cantly more similar to the query, than the random negative example, though not as similar as the positive example.
图2:随机负面例子和硬性负面例子。 请注意,与负面示例相比,硬负面示例显着更类似于查询,尽管与正面示例不太相似。
3.4 Node Embeddings via MapReduce
After the model is trained, it is still challenging to directly apply the trained model to generate embeddings for all items, including those that were not seen during training. Naively computing embeddings for nodes using Algorithm 2 leads to repeated computations caused by the overlap between K-hop neighborhoods of nodes. As illus-trated in Figure 1, many nodes are repeatedly computed at multiple layers when generating the embeddings for different target nodes. To ensure efficient inference, we develop a MapReduce approach that runs model inference without repeated computations.
We observe that inference of node embeddings very nicely lends itself to MapReduce computational model. Figure 3 details the data flow on the bipartite pin-to-board Pinterest graph, where we assume the input (i.e., “layer-0”) nodes are pins/items (and the layer-1 nodes are boards/contexts). The MapReduce pipeline has two key parts:
(1)One MapReduce job is used to project all pins to a low-dimensional latent space, where the aggregation operation will be performed (Algorithm 1, Line 1).
(2)Another MapReduce job is then used to join the resulting pin representations with the ids of the boards they occur in, and the board embedding is computed by pooling the features of its (sampled) neighbors.
Note that our approach avoids redundant computations and that the latent vector for each node is computed only once. After the em-beddings of the boards are obtained, we use two more MapReduce jobs to compute the second-layer embeddings of pins, in a similar fashion as above, and this process can be iterated as necessary (up to K convolutional layers).3
在训练模型之后,直接应用训练模型来生成所有项目的嵌入仍然具有挑战性,包括在训练期间未看到的项目。使用算法2对节点进行初始计算嵌入导致由节点的K跳邻域之间的重叠引起的重复计算。如图1所示,当为不同的目标节点生成嵌入时,在多个层重复计算许多节点。为了确保有效的推理,我们开发了一种MapReduce方法,该方法运行模型推理而无需重复计算。
我们观察到节点嵌入的推断非常好地适用于MapReduce计算模型。图3详述了二分针对板Pinterest图的数据流,其中我们假设输入(即“第0层”)节点是引脚/项(并且第1层节点是板/上下文)。 MapReduce管道有两个关键部分:
(1)一个MapReduce作业用于将所有引脚投影到低维潜在空间,其中将执行聚合操作(算法1,第1行)。
(2)然后使用另一个MapReduce作业将得到的引脚表示与它们出现的板的ID相连,并且通过汇集其(采样的)邻居的特征来计算板嵌入。
请注意,我们的方法避免了冗余计算,并且每个节点的潜在向量仅计算一次。在获得电路板的em-bedding之后,我们使用另外两个MapReduce作业以与上面类似的方式计算引脚的第二层嵌入,并且可以根据需要迭代该过程(直到K个卷积层)。
3.5 Efficient nearest-neighbor lookups
The embeddings generated by PinSage can be used for a wide range of downstream recommendation tasks, and in many settings we can directly use these embeddings to make recommendations by performing nearest-neighbor lookups in the learned embedding space. That is, given a query item q, the we can recommend items whose embeddings are the K-nearest neighbors of the query item’s embedding. Approximate KNN can be obtained efficiently via lo-cality sensitive hashing [2]. After the hash function is computed, retrieval of items can be implemented with a two-level retrieval pro-cess based on the Weak AND operator [5]. Given that the PinSage model is trained offline and all node embeddings are computed via MapReduce and saved in a database, the efficient nearest-neighbor lookup operation enables the system to serve recommendations in an online fashion,
PinSage生成的嵌入可用于各种下游推荐任务,在许多设置中,我们可以通过在学习的嵌入空间中执行最近邻查找来直接使用这些嵌入来提出建议。 也就是说,给定查询项q,我们可以推荐其嵌入是查询项嵌入的K最近邻居的项。 通过物理敏感散列可以有效地获得近似KNN [2]。 在计算散列函数之后,可以使用基于Weak AND运算符的两级检索过程来实现项目的检索[5]。 鉴于PinSage模型是离线训练的,并且所有节点嵌入都是通过MapReduce计算并保存在数据库中,有效的最近邻查找操作使系统能够以在线方式提供建议,
4 EXPERIMENTS
To demonstrate the efficiency of PinSage and the quality of the embeddings it generates, we conduct a comprehensive suite of experiments on the entire Pinterest object graph, including offline experiments, production A/B tests as well as user studies.
为了证明PinSage的效率及其产生的嵌入质量,我们对整个Pinterest对象图进行了一整套实验,包括离线实验,生产A / B测试以及用户研究。
4.1 Experimental Setup
We evaluate the embeddings generated by PinSage in two tasks: recommending related pins and recommending pins in a user’s home/news feed. To recommend related pins, we select the K near-est neighbors to the query pin in the embedding space. We evaluate performance on this related-pin recommendation task using both offline ranking measures as well as a controlled user study. For the homefeed recommendation task, we select the pins that are closest in the embedding space to one of the most recently pinned items by the user. We evaluate performance of a fully-deployed production system on this task using A/B tests to measure the overall impact on user engagement.
Training details and data preparation. We define the set, L, of positive training examples (Equation (1)) using historical user engagement data. In particular, we use historical user engagement data to identify pairs of pins (q, i), where a user interacted with pin
iimmediately after she interacted with pin q. We use all other pins as negative items (and sample them as described in Section 3.3). Overall, we use 1.2 billion pairs of positive training examples (in addition to 500 negative examples per batch and 6 hard negative examples per pin). Thus in total we use 7.5 billion training examples.
Since PinSage can efficiently generate embeddings for unseen data, we only train on a subset of the Pinterest graph and then generate embeddings for the entire graph using the MapReduce pipeline described in Section 3.4. In particular, for training we use a randomly sampled subgraph of the entire graph, containing 20% of all boards (and all the pins touched by those boards) and 70% of the labeled examples. During hyperparameter tuning, a remaining 10% of the labeled examples are used. And, when testing, we run inference on the entire graph to compute embeddings for all 2 billion pins, and the remaining 20% of the labeled examples are used to test the recommendation performance of our PinSage in the offline evaluations. Note that training on a subset of the full graph drastically decreased training time, with a negligible impact on final performance. In total, the full datasets for training and evaluation are approximately 18TB in size with the full output embeddings being 4TB.
我们评估PinSage在两个任务中生成的嵌入:推荐相关引脚并在用户的主页/新闻源中推荐引脚。为了推荐相关引脚,我们选择嵌入空间中查询引脚的K近邻。我们使用离线排名度量和受控用户研究来评估此相关引脚推荐任务的性能。对于主页提交推荐任务,我们选择嵌入空间中最接近用户最近固定项目之一的引脚。我们使用A / B测试来评估完全部署的生产系统在此任务上的性能,以衡量对用户参与的总体影响。
培训细节和数据准备。我们使用历史用户参与数据定义积极训练样例(等式(1))的集合L.特别地,我们使用历史用户参与数据来识别用户与引脚交互的引脚对(q,i)
她与pin q交互后立刻。我们使用所有其他引脚作为负项(并按照第3.3节中的描述对它们进行采样)。总的来说,我们使用了12亿对正面训练示例(除了每批500个负面示例和每个针脚6个硬负面示例)。因此,我们总共使用了75亿个培训示例。
由于PinSage可以有效地为看不见的数据生成嵌入,因此我们只训练Pinterest图的子集,然后使用3.4节中描述的MapReduce管道为整个图生成嵌入。特别是,对于训练,我们使用整个图形的随机采样子图,包含20%的所有板(以及这些板触及的所有引脚)和70%的标记示例。在超参数调整期间,使用剩余的10%的标记示例。并且,在测试时,我们对整个图形进行推断,以计算所有20亿个引脚的嵌入,其余20%的标记示例用于测试我们的PinSage在离线评估中的推荐性能。请注意,对完整图表子集的培训大大减少了培训时间,对最终性能的影响可以忽略不计。总的来说,用于训练和评估的完整数据集大小约为18TB,完整输出嵌入为4TB。
Features used for learning. Each pin at Pinterest is associated with an image and a set of textual annotations (title, description). To generate feature representation xq for each pin q, we concatenate visual embeddings (4,096 dimensions), textual annotation embed-dings (256 dimensions), and the log degree of the node/pin in the graph. The visual embeddings are the 6-th fully connected layer of a classification network using the VGG-16 architecture [28]. Tex-tual annotation embeddings are trained using a Word2Vec-based model [23], where the context of an annotation consists of other annotations that are associated with each pin.
Baselines for comparison. We evaluate the performance of Pin-Sage against the following state-of-the-art content-based, graph-based and deep learning baselines that generate embeddings of pins:
(1)Visual embeddings (Visual): Uses nearest neighbors of deep visual embeddings for recommendations. The visual features are described above.
(2)Annotation embeddings (Annotation): Recommends based on nearest neighbors in terms of annotation embeddings. The annotation embeddings are described above.
(3)Combined embeddings (Combined): Recommends based on concatenating visual and annotation embeddings, and using a 2-layer multi-layer perceptron to compute embeddings that capture both visual and annotation features.
(4)Graph-based method (Pixie): This random-walk-based method [14] uses biased random walks to generate ranking scores by simulating random walks starting at query pin q. Items with top K scores are retrieved as recommendations. While this approach does not generate pin embeddings, it is currently the state-of-the-art at Pinterest for certain recommendation tasks [14] and thus an informative baseline.
The visual and annotation embeddings are state-of-the-art deep learning content-based systems currently deployed at Pinterest to generate representations of pins. Note that we do not compare against other deep learning baselines from the literature simply due to the scale of our problem. We also do not consider non-deep learning approaches for generating item/content embeddings, since other works have already proven state-of-the-art performance of deep learning approaches for generating such embeddings [9, 12, 24].
用于学习的功能。 Pinterest中的每个引脚都与一个图像和一组文本注释(标题,描述)相关联。为了生成每个引脚q的特征表示xq,我们连接了视觉嵌入(4,096维),文本注释嵌入(256维)以及图中节点/引脚的对数度。视觉嵌入是使用VGG-16架构的分类网络的第6个完全连接层[28]。使用基于Word2Vec的模型[23]训练文本注释嵌入,其中注释的上下文包含与每个引脚相关联的其他注释。
比较基线。我们针对以下基于内容的,基于图形的深度学习基线评估了Pin-Sage的性能,这些基线生成了引脚的嵌入:
(1)视觉嵌入(Visual):使用深度视觉嵌入的最近邻居作为推荐。视觉特征如上所述。
(2)注释嵌入(注释):根据注释嵌入推荐基于最近邻居。注释嵌入如上所述。
(3)组合嵌入(组合):推荐基于连接视觉和注释嵌入,并使用2层多层感知器计算捕获视觉和注释特征的嵌入。
(4)基于图的方法(Pixie):这种基于随机游走的方法[14]通过模拟从查询引脚q开始的随机游走来使用偏向随机游走来生成排名分数。具有最高K分数的项目被检索作为推荐。虽然这种方法不会产生引脚嵌入,但它目前是Pinterest中针对某些推荐任务[14]的最新技术,因此也是一个信息基线。
视觉和注释嵌入是目前在Pinterest部署的最先进的基于深度学习内容的系统,用于生成引脚的表示。请注意,由于问题的严重性,我们不会与文献中的其他深度学习基线进行比较。我们也不考虑用于生成项目/内容嵌入的非深度学习方法,因为其他工作已经证明了用于生成这种嵌入的深度学习方法的最先进性能[9,12,24]。
We also conduct ablation studies and consider several variants of PinSage when evaluating performance:
-
max-pooling uses the element-wise max as a symmetric aggre-gation function (i.e., γ = max) without hard negative samples;
-
mean-pooling uses the element-wise mean as a symmetric aggregation function (i.e., γ = mean);
-
mean-pooling-xent is the same as mean-pooling but uses the cross-entropy loss introduced in [18].
-
mean-pooling-hard is the same as mean-pooling, except that it incorporates hard negative samples as detailed in Section 3.3.
-
PinSage uses all optimizations presented in this paper, includ-ing the use of importance pooling in the convolution step.
The max-pooling and cross-entropy settings are extensions of the best-performing GCN model from Hamilton et al. [18]—other vari-ants (e.g., based on Kipf et al. [21]) performed significantly worse in development tests and are omitted for brevity.4 For all the above variants, we used K = 2, hidden dimension size m = 2048, and set the embedding dimension d to be 1024.
Computation resources. Training of PinSage is implemented in TensorFlow [1] and run on a single machine with 32 cores and 16 Tesla K80 GPUs. To ensure fast fetching of item’s visual and annotation features, we store them in main memory, together with the graph, using Linux HugePages to increase the size of virtual memory pages from 4KB to 2MB. The total amount of memory used in training is 500GB. Our MapReduce inference pipeline is run on a Hadoop2 cluster with 378 d2.8xlarge Amazon AWS nodes.
我们还进行消融研究,并在评估性能时考虑PinSage的几种变体:
-
max-pooling使用逐元素max作为对称聚合函数(即γ= max)而没有硬阴性样本;
-
均值池使用元素均值作为对称聚合函数(即,γ=均值);
-
mean-pooling-xent与均值池相同,但使用[18]中引入的交叉熵损失。
-
mean-pooling-hard与mean-pooling相同,除了它包含第3.3节中详述的硬阴性样本。
-
PinSage使用本文中介绍的所有优化,包括在卷积步骤中使用重要性池。
最大池和交叉熵设置是Hamilton等人的最佳性能GCN模型的扩展。 [18]其他变量(例如,基于Kipf等人[21])在开发测试中表现更差,为简洁省略.4对于所有上述变体,我们使用K = 2,隐藏尺寸大小为m = 2048,并将嵌入维度d设置为1024。
计算资源。 PinSage的培训在TensorFlow [1]中实施,并在具有32个核心和16个Tesla K80 GPU的单台机器上运行。为了确保快速获取项目的视觉和注释功能,我们将它们与图形一起存储在主存储器中,使用Linux HugePages将虚拟内存页面的大小从4KB增加到2MB。培训中使用的内存总量为500GB。我们的MapReduce推理管道在具有378个d2.8xlarge Amazon AWS节点的Hadoop2集群上运行。
表1:PinSage和基于内容的深度学习基线的命中率和MRR。 总体而言,PinSage在最佳基线上的命中率提高了150%,MRR提高了60%
4.2 Offline Evaluation
To evaluate performance on the related pin recommendation task, we define the notion of hit-rate. For each positive pair of pins (q, i) in the test set, we use q as a query pin and then compute its top
Knearest neighbors NNq from a sample of 5 million test pins. We then define the hit-rate as the fraction of queries q where i was ranked among the top K of the test sample (i.e., where i ∈ NNq ). This metric directly measures the probability that recommendations made by the algorithm contain the items related to the query pin q. In our experiments K is set to be 500.
We also evaluate the methods using Mean Reciprocal Rank (MRR), which takes into account of the rank of the item j among recommended items for query item q:
为了评估相关引脚推荐任务的性能,我们定义了命中率的概念。 对于测试集中的每个正对引脚(q,i),我们使用q作为查询引脚,然后计算其顶部
来自500万个测试引脚样本的最近邻NNq。 然后,我们将命中率定义为查询q的分数,其中i在测试样本的前K中排名(即,其中i∈NNq)。 该度量直接测量算法所做推荐包含与查询引脚q相关的项目的概率。 在我们的实验中,K设定为500。
我们还使用均值倒数等级(MRR)来评估方法,其考虑了查询项目q的推荐项目中的项目j的等级:
Due to the large pool of candidates (more than 2 billion), we use a scaled version of the MRR in Equation (2), where Ri, q is the rank of item i among recommended items for query q, and n is the total number of labeled item pairs. The scaling factor 100 ensures that, for example, the difference between rank at 1, 000 and rank at 2, 000 is still noticeable, instead of being very close to 0.
Table 1 compares the performance of the various approaches using the hit rate as well as the MRR.5 PinSage with our new importance-pooling aggregation and hard negative examples achieves the best performance at 67% hit-rate and 0.59 MRR, outperforming the top baseline by 40% absolute (150% relative) in terms of the hit rate and also 22% absolute (60% relative) in terms of MRR. We also observe that combining visual and textual information works much better than using either one alone (60% improvement of the combined approach over visual/annotation only).
Embedding similarity distribution. Another indication of the effectiveness of the learned embeddings is that the distances be-tween random pairs of item embeddings are widely distributed. If all items are at about the same distance (i.e., the distances are tightly clustered) then the embedding space does not have enough “resolu-tion” to distinguish between items of different relevance. Figure 4 plots the distribution of cosine similarities between pairs of items using annotation, visual, and PinSage embeddings. This distribution of cosine similarity between random pairs of items demonstrates the effectiveness of PinSage, which has the most spread out distribu-tion. In particular, the kurtosis of the cosine similarities of PinSage embeddings is 0.43, compared to 2.49 for annotation embeddings and 1.20 for visual embeddings.
Another important advantage of having such a wide-spread in the embeddings is that it reduces the collision probability of the subsequent LSH algorithm, thus increasing the efficiency of serving the nearest neighbor pins during recommendation.
由于候选人数量大(超过20亿),我们在等式(2)中使用MRR的缩放版本,其中Ri,q是查询q的推荐项目中项目i的等级,n是总数标记的项目对的数量。缩放因子100确保例如,等级为1,000和等级为2,000的差异仍然是显着的,而不是非常接近于0。
表1比较了使用命中率的各种方法的性能以及MRR.5 PinSage与我们新的重要性汇总聚合和硬阴性示例在67%命中率和0.59 MRR下达到最佳性能,优于最高基线就命中率而言绝对值为40%(相对于150%),就MRR而言绝对值为22%(相对于60%)。我们还观察到,将视觉和文本信息结合起来比单独使用任何一个都要好得多(组合方法仅比视觉/注释提高60%)。
嵌入相似度分布。学习嵌入的有效性的另一个指示是随机对项目嵌入之间的距离被广泛分布。如果所有项目都处于大约相同的距离(即,距离紧密聚集),则嵌入空间没有足够的“分辨率”来区分不同相关的项目。图4绘制了使用注释,视觉和PinSage嵌入的项目对之间余弦相似度的分布。随机对项之间的余弦相似性的这种分布证明了PinSage的有效性,其具有最大的分布。特别是,PinSage嵌入的余弦相似度的峰度为0.43,而注释嵌入为2.49,视觉嵌入为1.20。
在嵌入中具有如此广泛的扩展的另一个重要优点是它降低了后续LSH算法的冲突概率,从而提高了在推荐期间服务最近邻居引脚的效率。
图4:视觉嵌入,注释嵌入和Pin-Sage嵌入的成对余弦相似度的概率密度。
4.3 User Studies
We also investigate the effectiveness of PinSage by performing head-to-head comparison between different learned representations. In the user study, a user is presented with an image of the query pin, together with two pins retrieved by two different recommendation algorithms. The user is then asked to choose which of the two candidate pins is more related to the query pin. Users are instructed to find various correlations between the recommended items and the query item, in aspects such as visual appearance, object category and personal identity. If both recommended items seem equally related, users have the option to choose “equal”. If no consensus is reached among 2/3 of users who rate the same question, we deem the result as inconclusive.
Table 2 shows the results of the head-to-head comparison be-tween PinSage and the 4 baselines. Among items for which the user has an opinion of which is more related, around 60% of the pre-ferred items are recommended by PinSage. Figure 5 gives examples of recommendations and illustrates strengths and weaknesses of the different methods. The image to the left represents the query item. Each row to the right corresponds to the top recommendations made by the visual embedding baseline, annotation embedding baseline, Pixie, and PinSage. Although visual embeddings gener-ally predict categories and visual similarity well, they occasionally make large mistakes in terms of image semantics. In this example, visual information confused plants with food, and tree logging with war photos, due to similar image style and appearance. The graph-based Pixie method, which uses the graph of pin-to-board relations, correctly understands that the category of query is “plants” and it recommends items in that general category. However, it does not find the most relevant items. Combining both visual/textual and graph information, PinSage is able to find relevant items that are both visually and topically similar to the query item.
我们还通过在不同的学习表示之间进行头对头比较来研究PinSage的有效性。在用户研究中,向用户呈现查询引脚的图像,以及由两个不同推荐算法检索的两个引脚。然后要求用户选择两个候选引脚中的哪一个与查询引脚更相关。指示用户在视觉外观,对象类别和个人身份等方面找到推荐项目和查询项目之间的各种相关性。如果两个推荐项似乎相同,则用户可以选择“相等”。如果对同一问题评分的2/3的用户未达成共识,我们认为结果不确定。
表2显示了PinSage与4个基线之间的头对头比较结果。在用户具有更多相关意见的项目中,PinSage推荐约60%的优先项目。图5给出了建议的示例,并说明了不同方法的优缺点。左侧的图像表示查询项。右侧的每一行对应于可视嵌入基线,注释嵌入基线,Pixie和PinSage所做的最高建议。虽然视觉嵌入通常可以很好地预测类别和视觉相似性,但它们偶尔会在图像语义方面犯大错误。在这个例子中,由于类似的图像样式和外观,视觉信息使植物与食物混淆,并且树木记录与战争照片。基于图形的Pixie方法使用引脚到板的关系图,正确地理解查询的类别是“植物”,并且它推荐该一般类别中的项目。但是,它没有找到最相关的项目。结合视觉/文本和图形信息,PinSage能够找到与查询项目在视觉上和主题上相似的相关项目。
In addition, we visualize the embedding space by randomly choosing 1000 items and compute the 2D t-SNE coordinates from the PinSage embedding, as shown in Figure 6.6 We observe that the proximity of the item embeddings corresponds well with the simi-larity of content, and that items of the same category are embedded into the same part of the space. Note that items that are visually different but have the same theme are also close to each other in the embedding space, as seen by the items depicting different fashion-related items on the bottom side of the plot.
此外,我们通过随机选择1000个项目来可视化嵌入空间,并从PinSage嵌入计算2D t-SNE坐标,如图6.6所示。我们观察到项目嵌入的接近度与内容的相似性很好地对应, 并且相同类别的项目嵌入到空间的相同部分中。 请注意,视觉上不同但具有相同主题的项目在嵌入空间中也彼此接近,如在绘图底部描绘不同时尚相关项目的项目所示。
4.4 Production A/B Test
Lastly, we also report on the production A/B test experiments, which compared the performance of PinSage to other deep learning content-based recommender systems at Pinterest on the task of homefeed recommendations. We evaluate the performance by ob-serving the lift in user engagement. The metric of interest is repin rate, which measures the percentage of homefeed recommendations that have been saved by the users. A user saving a pin to a board is a high-value action that signifies deep engagement of the user. It means that a given pin presented to a user at a given time was relevant enough for the user to save that pin to one of their boards so that they can retrieve it later.
We find that PinSage consistently recommends pins that are more likely to be re-pinned by the user than the alternative methods. Depending on the particular setting, we observe 10-30% improve-ments in repin rate over the Annotation and Visual embedding based recommendations.
最后,我们还报告了生产A / B测试实验,该实验比较了PinSage与Pinterest上其他基于深度学习内容的推荐系统在家庭饲料建议任务上的表现。 我们通过观察用户参与的提升来评估性能。 感兴趣的度量标准是repin rate,用于衡量用户保存的家庭馈送建议的百分比。 用户将引脚保存到电路板是一种高价值的操作,表示用户的深度参与。 这意味着在给定时间呈现给用户的给定引脚足够相关,用户可以将该引脚保存到其中一个板上,以便以后可以检索它。
我们发现PinSage始终推荐比其他方法更有可能被用户重新固定的引脚。 根据特定设置,我们观察到基于Annotation和Visual embedding建议的重复率提高了10-30%。
4.5 Training and Inference Runtime Analysis
One advantage of GCNs is that they can be made inductive [19]: at the inference (i.e., embedding generation) step, we are able to compute embeddings for items that were not in the training set. This allows us to train on a subgraph to obtain model parameters, and then make embed nodes that have not been observed during training. Also note that it is easy to compute embeddings of new nodes that get added into the graph over time. This means that recommendations can be made on the full (and constantly grow-ing) graph. Experiments on development data demonstrated that training on a subgraph containing 300 million items could achieve the best performance in terms of hit-rate (i.e., further increases in the training set size did not seem to help), reducing the runtime by a factor of 6 compared to training on the full graph.
Table 3 shows the the effect of batch size of the minibatch SGD on the runtime of PinSage training procedure, using the mean-pooling-hard variant. For varying batch sizes, the table shows: (1) the computation time, in milliseconds, for each minibatch, when varying batch size; (2) the number of iterations needed for the model to converge; and (3) the total estimated time for the training proce-dure. Experiments show that a batch size of 2048 makes training most efficient.
When training the PinSage variant with importance pooling, another trade-off comes from choosing the size of neighborhood T . Table 3 shows the runtime and performance of PinSage when
T= 10, 20 and 50. We observe a diminishing return as T increases, and find that a two-layer GCN with neighborhood size 50 can best capture the neighborhood information of nodes, while still being computationally efficient.
After training completes, due to the highly efficient MapReduce inference pipeline, the whole inference procedure to generate em-beddings for 3 billion items can finish in less than 24 hours.
GCN的一个优点是可以使它们具有归纳性[19]:在推理(即嵌入生成)步骤中,我们能够计算不在训练集中的项目的嵌入。这允许我们在子图上训练以获得模型参数,然后制作在训练期间未观察到的嵌入节点。另请注意,计算新节点的嵌入很容易随着时间的推移而添加到图形中。这意味着可以在完整(并且不断增长)的图表上进行推荐。对开发数据的实验表明,对包含3亿个项目的子图的训练可以在命中率方面达到最佳性能(即,训练集大小的进一步增加似乎没有帮助),将运行时间缩短了6倍与完整图表上的培训相比。
表3显示了使用mean-pooling-hard变体,miniatch SGD的批量大小对PinSage训练过程的运行时间的影响。对于不同的批量大小,该表显示:(1)当批量大小变化时,每个小批量的计算时间(以毫秒为单位); (2)模型收敛所需的迭代次数; (3)培训程序的总预计时间。实验表明,2048的批量大小使培训效率最高。
在训练具有重要性汇集的PinSage变体时,另一个权衡取决于选择邻域T的大小。表3显示了PinSage的运行时和性能
T = 10,20和50.随着T的增加,我们观察到收益递减,并发现邻域大小为50的双层GCN可以最好地捕获节点的邻域信息,同时仍然具有计算效率。
培训完成后,由于高效的MapReduce推理管道,生成30亿件物品的整体推理程序可以在不到24小时内完成。
图5:不同算法推荐的Pinterest引脚示例。 左边的图像是查询引脚。 使用Visual em-beddings,Annotation embeddings,基于图形的Pixie和PinSage计算右侧的推荐项目。
Figure 6: t-SNE plot of item embeddings in 2 dimensions.
Table 3: Runtime comparisons for different batch sizes.
Table 4: Performance tradeoffs for importance pooling.
5 CONCLUSION
We proposed PinSage, a random-walk graph convolutional network (GCN). PinSage is a highly-scalable GCN algorithm capable of learn-ing embeddings for nodes in web-scale graphs containing billions of objects. In addition to new techniques that ensure scalability, we introduced the use of importance pooling and curriculum training that drastically improved embedding performance. We deployed PinSage at Pinterest and comprehensively evaluated the quality of the learned embeddings on a number of recommendation tasks, with offline metrics, user studies and A/B tests all demonstrating a substantial improvement in recommendation performance. Our work demonstrates the impact that graph convolutional methods can have in a production recommender system, and we believe that PinSage can be further extended in the future to tackle other graph representation learning problems at large scale, including knowledge graph reasoning and graph clustering.
我们提出了PinSage,一种随机游走图卷积网络(GCN)。 PinSage是一种高度可扩展的GCN算法,能够在包含数十亿个对象的Web级图形中学习节点的嵌入。 除了确保可扩展性的新技术之外,我们还引入了重要性池和课程训练,大大提高了嵌入性能。 我们在Pinterest部署了PinSage,并在一系列推荐任务中全面评估了学习嵌入的质量,离线指标,用户研究和A / B测试都证明了推荐性能的实质性改进。 我们的工作展示了图卷积方法在生产推荐系统中可能产生的影响,我们相信PinSage可以在未来进一步扩展,以解决大规模的其他图表表示学习问题,包括知识图推理和图聚类。
Acknowledgments
The authors acknowledge Raymond Hsu, Andrei Curelea and Ali Altaf for performing various A/B tests in production system, Jerry
Zitao Liu for providing data used by Pixie[14], and Vitaliy Kulikov for help in nearest neighbor query of the item embeddings.
REFERENCES
[1]M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).
[2]A. Andoni and P. Indyk. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In FOCS.
[3]T. Bansal, D. Belanger, and A. McCallum. 2016. Ask the GRU: Multi-task learning for deep text recommendations. In RecSys. ACM.
[4]Y. Bengio, J. Louradour, R. Collobert, and J. Weston. 2009. Curriculum learning. In ICML.
[5]A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. 2003. Efficient query evaluation using a two-level retrieval process. In CIKM.
[6]M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. 2017. Geometric deep learning: Going beyond euclidean data. IEEE Signal Processing Magazine 34, 4 (2017).
[7]J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. 2014. Spectral networks and locally connected networks on graphs. In ICLR.
[8]J. Chen, T. Ma, and C. Xiao. 2018. FastGCN: Fast Learning with Graph Convolu-tional Networks via Importance Sampling. ICLR (2018).
[9]P. Covington, J. Adams, and E. Sargin. 2016. Deep neural networks for youtube recommendations. In RecSys. ACM.
[10]H. Dai, B. Dai, and L. Song. 2016. Discriminative Embeddings of Latent Variable Models for Structured Data. In ICML.
[11]M. Defferrard, X. Bresson, and P. Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS.
[12]A. Van den Oord, S. Dieleman, and B. Schrauwen. 2013. Deep content-based music recommendation. In NIPS.
[13]D. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams. 2015. Convolutional networks on graphs for learning molecular fingerprints. In NIPS.
[14]C. Eksombatchai, P. Jindal, J. Z. Liu, Y. Liu, R. Sharma, C. Sugnet, M. Ulrich, and
J.Leskovec. 2018. Pixie: A System for Recommending 3+ Billion Items to 200+ Million Users in Real-Time. WWW (2018).
[15]M. Gori, G. Monfardini, and F. Scarselli. 2005. A new model for learning in graph domains. In IEEE International Joint Conference on Neural Networks.
[16]P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch,
Y.Jia, and K. He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv preprint arXiv:1706.02677 (2017).
[17]A. Grover and J. Leskovec. 2016. node2vec: Scalable feature learning for networks. In KDD.
[18]W. L. Hamilton, R. Ying, and J. Leskovec. 2017. Inductive Representation Learning on Large Graphs. In NIPS.
[19]W. L. Hamilton, R. Ying, and J. Leskovec. 2017. Representation Learning on Graphs: Methods and Applications. IEEE Data Engineering Bulletin (2017).
[20]S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley. 2016. Molecular graph convolutions: moving beyond fingerprints. CAMD 30, 8.
[21]T. N. Kipf and M. Welling. 2017. Semi-supervised classification with graph convolutional networks. In ICLR.
[22]Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. 2015. Gated graph sequence neural networks. In ICLR.
[23]T. Mikolov, I Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS.
[24]F. Monti, M. M. Bronstein, and X. Bresson. 2017. Geometric matrix completion with recurrent multi-graph neural networks. In NIPS.
[25]OpenMP Architecture Review Board. 2015. OpenMP Application Program Inter-face Version 4.5. (2015).
[26]B. Perozzi, R. Al-Rfou, and S. Skiena. 2014. DeepWalk: Online learning of social representations. In KDD.
[27]F. Scarselli, M. Gori, A.C. Tsoi, M. Hagenbuchner, and G. Monfardini. 2009. The graph neural network model. IEEE Transactions on Neural Networks 20, 1 (2009), 61–80.
[28]K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[29]R. van den Berg, T. N. Kipf, and M. Welling. 2017. Graph Convolutional Matrix Completion. arXiv preprint arXiv:1706.02263 (2017).
[30]J. You, R. Ying, X. Ren, W. L. Hamilton, and J. Leskovec. 2018. GraphRNN: Generating Realistic Graphs using Deep Auto-regressive Models. ICML (2018).
[31]M. Zitnik, M. Agrawal, and J. Leskovec. 2018. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics (2018).