The concept of knowledge graph was first proposed by Google in 2012, aiming to achieve a more intelligent search engine, and began to be used in academia after 2013. and industrial-grade popularization. At present, with the rapid development of artificial intelligence technology, knowledge graphs have been widely used in search, recommendation, advertising, risk control, intelligent scheduling, speech recognition, robots and other fields.
Knowledge graph, as the core technology driving force of artificial intelligence, can alleviate the problem of deep learning relying on massive training data and large-scale computing power. It can be widely adapted to different It can perform downstream tasks and has good interpretability. Therefore, large Internet companies around the world are actively deploying their own knowledge graphs.
For example, in 2013, Facebook released Open Graph, which was used for intelligent search on social networks; in 2014, Baidu launched the knowledge graph, which was mainly used in search, assistant, and toB business scenarios; in 2015, Alibaba launched the product knowledge graph. It plays a key role in front-end shopping guide, platform governance and intelligent question and answer business; Tencent Cloud Knowledge Graph launched by Tencent in 2017 effectively assists scenarios such as financial search and entity risk prediction; Meituan Brain launched by Meituan in 2018 Knowledge graphs have been implemented in multiple businesses such as intelligent search recommendations and intelligent merchant operations.
Currently, the domain map is mainly concentrated in business fields such as e-commerce, medical care, and finance, while the semantic network of automotive knowledge And there is a lack of systematic guidance method for knowledge graph construction. This article takes knowledge in the automotive field as an example, focusing on entities and relationships such as car series, models, dealers, manufacturers, brands, etc., to provide an idea for building a domain map from scratch, and details the steps and methods in building a knowledge map. Description, and introduces several typical applications based on this map.
The data source is the Autohome website. Autohome is an automotive service platform composed of multiple sections such as shopping guides, information, reviews, and word-of-mouth. It has accumulated a large number of views, purchases, and uses. Car data is organized and mined by building a knowledge graph to organize and mine car-centered content, providing rich knowledge information, structured and accurately depicting interests, and supporting multiple dimensions such as cold start, recall, sorting, and display of recommended users, to provide business Lift brings results.
Knowledge graph is a semantic representation of the real world, and its basic unit is [entity-relationship-entity]. A triplet of [entity-attribute-attribute value], entities are connected to each other through relationships, thus forming a semantic network. There will be greater challenges in constructing the graph, but after it is constructed, it can show rich application value in multiple scenarios such as data analysis, recommendation calculation, and interpretability.
Construction challenges:
Benefit:
The technical architecture is mainly divided into three layers: construction layer, storage layer and application layer. The architecture diagram is as follows:
According to the architecture diagram, the specific construction process can be divided into four steps: ontology design, knowledge acquisition, Knowledge storage, and application service design and use.
Ontology is a recognized collection of concepts. The construction of ontology refers to constructing the ontology structure and knowledge framework of the knowledge graph based on the definition of ontology.
The main reasons for constructing a graph based on ontology are as follows:
According to the coverage of knowledge, knowledge graphs can be divided into general knowledge graphs and domain knowledge graphs. Currently, there are many cases of general knowledge graphs, such as Google’s Knowledge Graph, Microsoft’s Satori and Probase etc. The domain map is a map of specific industries such as finance and e-commerce. General graphs pay more attention to breadth and emphasize the integration of more entities, but do not have high requirements for accuracy. It is difficult to reason and use axioms, rules and constraints with the help of ontology libraries; while the knowledge coverage of domain graphs is smaller, But the depth of knowledge is deeper and is often built in a certain professional field.
Considering the requirements for accuracy, domain ontology construction tends to be done manually, such as the representative seven-step method, IDEF5 method, etc. [1]. The core idea of this type of method is to Structured data, conduct ontology analysis, summarize and construct an ontology that meets the application purpose and scope, and then optimize and verify the ontology to obtain the first version of the ontology definition. If you want to obtain a larger domain ontology, you can supplement it from unstructured corpus. Considering that the manual construction process is relatively large, this article takes the automotive field as an example to provide a method of semi-automatic ontology construction. The detailed steps are as follows:
The above method can effectively use deep learning technologies such as BERT to better capture the internal relationships between corpus, use clustering to construct each module of the ontology hierarchically, supplemented by manual intervention, and can quickly , Accurately complete the preliminary ontology construction. The following figure is a schematic diagram of semi-automated ontology construction:
Using the Protégé ontology construction tool [2], the ontology concept can be carried out The construction of classes, relationships, attributes and instances. The following figure is a visual example of ontology construction:
This article divides the top-level ontology concepts in the automotive field into three categories: entities and events And label system:
1) Entity class represents conceptual entities with specific meanings, including vocabulary entities and automobile entities, among which automobile entities include sub-entity types such as organizations and automobile concepts;
2) Label system represents Tag systems in various dimensions, including content classification, concept tags, interest tags and other tags described in the material dimension;
3) Event classes represent the objective facts of one or more roles, and there is an evolutionary relationship between different types of events .
Protégé Different types of Schema configuration files can be exported, among which the owl.xml structure configuration file is as shown in the figure below. This configuration file can be directly loaded and used in MYSQL and JanusGraph to realize automatic creation of Schema.
The data sources of knowledge graphs usually include three types of data structures, namely structured data, semi-structured data, Unstructured data. For different types of data sources, the key technologies involved in knowledge extraction and the technical difficulties that need to be solved are different.
Structured data is the most direct source of knowledge for the graph. It can be used basically through preliminary conversion. Compared with other types of data, the cost is the lowest. Therefore, generally graph data gives priority to structured data. Structured data may involve multiple database sources, and usually requires the use of ETL methods to convert the model. ETL refers to Extract, Transform, and Load. Extraction is to read data from various original business systems. , which is the premise of all work; conversion is to convert the extracted data according to pre-designed rules so that the originally heterogeneous data formats can be unified; loading is to import the converted data incrementally or entirely into the data as planned In the warehouse.
Through the above ETL process, data from different sources can be dropped into intermediate tables to facilitate subsequent knowledge storage. The following figure is an example diagram of car series entity attributes and relationship tables:
Car series and brand relationship table:
In addition to structured data, there is also a large amount of knowledge (triple) information in unstructured data. Generally speaking, the amount of unstructured data in an enterprise is much larger than structured data. Mining unstructured knowledge can greatly expand and enrich the knowledge graph.
Challenges of triple extraction algorithm
Problem 1: Within a single field, document content and formats are diverse, requiring a large amount of annotated data and high cost
Problem 2: The effect of migration between fields is not good enough, and the cost of scalable expansion across fields is high
The models are basically aimed at specific scenarios in specific industries. If you change the scenario, the effect will be different. There was a significant decline.
Solution idea, Pre-train Finetune paradigm, pre-training: the heavyweight base allows the model to "see more" and make full use of large-scale and multi-industry unlabeled documents to train a unified Pre-training base enhances the model's ability to represent and understand various types of documents.
Fine-tuning: lightweight document structuring algorithm. Based on pre-training, a lightweight document-oriented structured algorithm is constructed to reduce labeling costs.
Pre-training method for documents
There are existing pre-training models for documents. If the text is shorter, Bert can completely encode the entire text. files; however, our actual documents are usually relatively long, and many of the attribute values that need to be extracted exceed 1024 characters. Bert’s encoding will cause the attribute values to be truncated.
Advantages and shortcomings of long text pre-training methods
The Sparse Attention method optimizes the calculation of O(n2) to O(n) by optimizing Self-Attention. ), greatly improving the input text length. Although the text length of the ordinary model has been increased from 512 to 4096, it still cannot completely solve the fragmentation problem of truncated text. Baidu proposed ERNIE-DOC [3] using the Recurrence Transformer method, which can theoretically model unlimited text. Since all text information needs to be input for modeling, it is very time-consuming.
The above two pre-training methods based on long text do not consider document characteristics, such as spatial (Spartial), visual (Visual) and other information. And the PretrainTask based on the text design is designed for pure text as a whole, without the logical structure design of the document.
In view of the above shortcomings, here is a long document pre-training model DocBert[4], DocBert model design:
Use large-scale (million-level) unlabeled document data for pre-training, and build self-supervised learning tasks based on the text semantics (Text), layout information (Layout), and visual features (Visual) of the document , allowing the model to better understand document semantics and structural information.
1.Layout-Aware MLM: Consider the position and font size information of the text in the Mask language model to achieve document layout-aware semantic understanding.
2.Text-Image Alignment: Fusion of document visual features, reconstructing the masked text in the image, helping the model learn the alignment relationship between different modes of text, layout, and image.
3.Title Permutation: Construct the title reconstruction task in a self-supervised manner to enhance the model's ability to understand the logical structure of the document.
4.Sparse Transformer Layers: Use Sparse Attention method to enhance the model’s ability to process long documents.
In addition to obtaining triples from structured and unstructured text , Autohome also mines the categories, concept tags and interest keyword tags contained in materials, and establishes associations between materials and vehicle entities, bringing new knowledge to the automotive knowledge graph. The following introduces some of the content understanding work and thinking done by Autohome from the perspective of classification, concept tags, and interest word tags.
The classification system serves as the basis for content description and coarse-grained classification of materials. The unified content system established is more based on manual definition and is divided through AI models. In terms of classification methods, we use active learning to label data that is difficult to classify. We also use data enhancement, adversarial training, and keyword fusion to improve the classification effect.
The concept label granularity is between classification and interest word labels, finer than classification granularity, and more complete description of interest points than interest words. We have established a car vision The three dimensions of human vision and content vision enrich the label dimension and refine the label granularity. Rich and specific material tags make it easier to search and recommend tag-based model optimization, and can be used for tag outreach to attract users and secondary traffic. The mining of concept tags combines the use of machine mining methods on important data such as queries, and generalization analysis. Through manual review, we obtain a set of concept tags and use a multi-label model for classification.
Interest word tags are the most fine-grained tags and are mapped to user interests. According to different user interest preferences, personalized recommendations can be better made. Keyword mining uses a combination of multiple interest word mining methods, including Keybert extraction of key substrings, combined with various syntax analysis methods such as TextRank, positionRank, singlerank, TopicRank, MultipartiteRank, etc. to generate interest word candidates.
The mined words have relatively high similarity, and synonyms need to be identified, which requires improving manual efficiency. Therefore, we also use clustering to perform automatic semantic similarity identification. Features used for clustering include word2vec, bert emding and other artificial features. Then using the clustering method, and finally through manual correction, we generated a batch of high-quality keywords offline.
For labels with different granularities, we still need to associate the labels with the cars at the material level. First, we calculate the labels of the title article respectively, and then identify the entities in the title article and obtain several labels. -Entity pseudo-labels. Finally, based on a large amount of corpus, labels with high co-occurrence probability will be marked as the label of the entity. Through the above three tasks, we have obtained rich and massive labels. Associating these tags with car series and entities will greatly enrich our car map and establish car tags that attract media and user attention.
With larger-scale training samples, how to obtain better model quality, how to solve the high cost of labeling, and the long labeling cycle have become urgent problems to be solved . First, we can use semi-supervised learning to use massive unlabeled data for pre-training. Then an active learning method is used to maximize the value of the annotated data, and iteratively select high-information samples for annotation. Finally, remote supervision can be used to leverage the value of existing knowledge and discover the correlation between tasks. For example, after having the map and title, you can use remote supervision method to construct NER training data based on the map.
The knowledge in the knowledge graph is represented through the RDF structure, and its basic unit is a fact. Each fact is a triplet (S, P, O). In actual systems, according to different storage methods, the storage of knowledge graphs can be divided into storage based on RDF table structure and storage based on attribute graph structure. Picture galleries are mostly stored using attribute graph structures. Common storage systems include Neo4j, JanusGraph, OritentDB, InfoGrid, etc.
After comparing JanusGraph with several mainstream graph databases such as Neo4J, ArangoDB, and OrientDB, we finally chose JanusGraph as the graph database for the project. The main reasons for choosing JanusGraph are as follows Reason:
JanusGraph[5] is a graph database engine. It focuses on compact graph serialization, rich graph data modeling, and efficient query execution. The composition of the gallery schema can be expressed by the following formula:
janusgraph schema = vertex label edge label property keys
It is worth noting here that the property key is usually used for graph index.
In order to achieve better graph query performance, janusgraph has established an index. The index is divided into Graph Index and Vertex-centric Indexes. Graph Index includes composite index (Composite Index) and mixed index (Mixed Index).
Combined index is limited to equal search. (The combined index does not need to configure an external index backend and is supported by the main storage backend (of course, hbase, Cassandra, and Berkeley can also be configured))
Example:
<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">mgmt</span>.<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">buildIndex</span>(<span style="color: rgb(102, 153, 0); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">'byNameAndAgeComposite'</span>, <span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">Vertex</span>.<span style="color: rgb(215, 58, 73); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">class</span>).<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">addKey</span>(<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">name</span>).<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">addKey</span>(<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">age</span>).<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">buildCompositeIndex</span>() <span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">#构建一个组合索引“name</span><span style="color: rgb(215, 58, 73); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">-</span><span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">age”</span><br><span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">g</span>.<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">V</span>().<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">has</span>(<span style="color: rgb(102, 153, 0); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">'age'</span>, <span style="color: rgb(0, 92, 197); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">30</span>).<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">has</span>(<span style="color: rgb(102, 153, 0); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">'name'</span>, <span style="color: rgb(102, 153, 0); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">'小明'</span>)<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">#查找</span> <span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">名字为小明年龄30的节点</span>
Hybrid index requires ES as the backend End index to support multi-condition queries other than equality (equal queries are also supported, but equal queries, combined indexes are faster). According to whether word segmentation is needed, it is divided into full-text search and string search
Understanding the way Janusgraph stores data will help us make better use of the library. JanusGraph stores graphs in adjacency list format, which means that the graph is stored as a collection of vertices and their adjacency lists.
The adjacency list of a vertex contains all incident edges (and attributes) of the vertex.
#JanusGraph stores each adjacency list as a row in the underlying storage backend. The (64-bit) vertex ID (uniquely assigned to each vertex by JanusGraph) is the key pointing to the row containing the vertex's adjacency list.
Each edge and attribute is stored as a separate cell in the row, allowing efficient insertion and deletion. Therefore, the maximum number of cells allowed per row in a particular storage backend is also the maximum degree of vertices that JanusGraph can support for that backend.
If the storage backend supports key-order, the adjacency list will be sorted by vertex id, and JanusGraph can assign vertex ids to effectively partition the graph. Assign ids so that frequently visited vertices have ids with small absolute differences.
Janusgraph uses the gremlin language for graph search. We provide a unified graph query service. External users do not need to care about the specific implementation of the gremlin language. , using a common interface for querying. We divide it into three interfaces: conditional search interface, node-centered outward query, and inter-node path query interface. The following are several examples of gremlin implementation:
<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">g</span>.<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">V</span>().<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">has</span>(<span style="color: rgb(102, 153, 0); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">'price'</span>,<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">gt</span>(<span style="color: rgb(0, 92, 197); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">8</span>)).<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">has</span>(<span style="color: rgb(102, 153, 0); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">'price'</span>,<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">lt</span>(<span style="color: rgb(0, 92, 197); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">12</span>)).<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">order</span>().<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">by</span>(<span style="color: rgb(102, 153, 0); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">'sales'</span>,<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">desc</span>).<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">valueMap</span>().<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">limit</span>(<span style="color: rgb(0, 92, 197); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">1</span>)
Output:
<span style="color: rgb(215, 58, 73); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">==></span>{<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">name</span><span style="color: rgb(215, 58, 73); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">=</span>[<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">xuanyi</span>], <span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">price</span><span style="color: rgb(215, 58, 73); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">=</span>[<span style="color: rgb(0, 92, 197); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">10</span>], <span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">sales</span><span style="color: rgb(215, 58, 73); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">=</span>[<span style="color: rgb(0, 92, 197); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">45767</span>]}
The Sylphy sales volume is the highest, which is 45767
<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">g</span>.<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">V</span>(<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">xiaoming</span>).<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">repeat</span>(<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">out</span>()).<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">times</span>(<span style="color: rgb(0, 92, 197); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">2</span>).<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">valueMap</span>()
<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">g</span>.<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">V</span>(<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">xiaoming</span>).<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">repeat</span>(<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">out</span>().<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">simplePath</span>()).<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">until</span>(<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">or</span>(<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">has</span>(<span style="color: rgb(102, 153, 0); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">"car"</span>,<span style="color: rgb(102, 153, 0); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">'name'</span>, <span style="color: rgb(102, 153, 0); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">'kaluola'</span>),<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">has</span>(<span style="color: rgb(102, 153, 0); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">"car"</span>, <span style="color: rgb(102, 153, 0); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">'name'</span>,<span style="color: rgb(102, 153, 0); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">'xuanyi'</span>))).<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">path</span>().<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">by</span>(<span style="color: rgb(102, 153, 0); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">"name"</span>)
Output
<span style="color: rgb(215, 58, 73); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">==></span><span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">path</span>[<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">xiaoming</span>, <span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">around</span> <span style="color: rgb(0, 92, 197); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">10</span><span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">w</span>, <span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">kaluola</span>]<br><span style="color: rgb(215, 58, 73); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">==></span><span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">path</span>[<span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">xiaoming</span>, <span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">around</span> <span style="color: rgb(0, 92, 197); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">10</span><span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">w</span>, <span style="color: rgb(89, 89, 89); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);">xuanyi</span>]
Discover Xiao Ming and these two articles There is a node between the articles "about 100,000"
There is a large amount of non-European data in the knowledge graph. Recommendation applications based on KG effectively use non-European data to improve the accuracy of the recommendation system, and then Let the recommendation system achieve effects that traditional systems cannot achieve. Recommendations based on KG can be divided into three categories, based on KG representation technology (KGE), path-based method, and graph neural network. This chapter will introduce KG’s applications and papers in three aspects: cold start, reason, and ranking in recommendation systems.
Knowledge graph can model the high-order relationships hidden in KG from user-item interaction, which is a good solution to the problem of user-related problems. The data sparsity caused by calling a limited number of behaviors can be applied to solve the cold start problem. There are also related studies on this issue in the industry.
Sang et al. [6] proposed a dual-channel neural interaction method called knowledge graph-enhanced residual recursive neural collaborative filtering (KGNCF-RRN), which exploits the long-term relationship dependencies of KG context and User items interact to make recommendations.
(1) For the KG context interaction channel, a residual recurrent network (RRN) is proposed to construct context-based path embedding, and residual learning is integrated into the traditional recurrent neural network (RNN) to effectively Encoding KG's long-term relational dependencies. Self-attention networks are then applied to path embeddings to capture the ambiguity of various user interaction behaviors.
(2) For the user-item interaction channel, user and item embeddings are input into the newly designed two-dimensional interaction diagram.
(3) Finally, on top of the dual-channel neural interaction matrix, a convolutional neural network is used to learn the complex correlation between users and items. This method can capture rich semantic information and also capture complex implicit relationships between users and items for recommendation.
Du Y et al. [7] proposed a new solution to the cold start problem based on a meta-learning framework MetaKG, including collaborative-aware meta learner and knowledge-aware meta learner, capturing User preferences and entity cold start knowledge. The collaborative-aware meta learner learning task aims to aggregate each user's preferred knowledge representation. In contrast, the knowledge-aware meta learner learning task is to globally generalize different user-preferred knowledge representations. Under the guidance of two learners, MetaKG can effectively capture high-order collaborative relationships and semantic representations, and can easily adapt to cold start scenarios. In addition, the author also designed an adaptive task that can adaptively select KG information for learning to prevent the model from being interfered with by noise information. The MetaKG architecture is shown in the figure below.
Recommendation reasons can improve the interpretability of the recommendation system and allow users to understand The calculation process for generating recommendation results can also explain why the item is popular. Users understand the principle of generating recommended results through recommendation reasons, which can enhance users' confidence in the system's recommended results and make them more tolerant of incorrect results in the event of recommendation errors.
The earliest interpretable recommendations were based on templates. The advantage of templates is that they ensure readability and high accuracy. However, the templates need to be sorted manually, and they are not very general, giving people a repetitive feeling. Later, a free-form form that did not require presets was developed, and a knowledge graph was added. One of the paths was used as an explanation. Along with the annotation, there were some generative methods that combined KG paths. Each point or edge selected in the model was A reasoning process that can be demonstrated to the user. Recently, Chen Z [8] et al. proposed an incremental multi-task learning framework ECR, which can achieve close collaboration between recommendation prediction, explanation generation and user feedback integration. It consists of two parts. The first part, Incremental Cross Knowledge Modeling, learns the transferred cross knowledge in the recommendation task and the explanation task, and explains how to use the cross knowledge to be updated by using incremental learning. The second part, incremental multi-task prediction, explains how to generate explanations based on cross-knowledge and how to predict recommendation scores based on cross-knowledge and user feedback.
KG can create user- The interaction between items combines the uesr-item graph and KG into one large graph, which can capture the high-order connections between items. The traditional recommendation method is to model the problem as a supervised learning task. This method ignores the intrinsic relationship between items (such as the competitive product relationship between Camry and Accord) and cannot obtain synergistic signals from user behavior. The following introduces two papers on KG application in recommendation ranking.
Wang[9] and others designed the KGAT algorithm. First, they used GNN to iteratively propagate and update the embedding, so that they can quickly capture high-order connections. Secondly, they used the attention mechanism during aggregation to learn each feature during the propagation process. The weight of the neighbor reflects the importance of high-order connections; finally, N implicit representations of user-item are obtained through N-order propagation updates, and different layers represent different orders of connection information. KGAT can capture richer, unspecific higher-order connections.
Zhang[20] and others proposed the RippleNet model. The key idea is interest propagation: RippleNet uses the user's historical interests as a seed set in KG, and then Expand user interests outward along the connections of KG to form the distribution of user interests on KG. The biggest advantage of RippleNet is that it can automatically mine possible paths from items that users have clicked on in history to candidate items, without any manual design of meta-paths or meta-graphs.
#RippleNet takes user U and item V as input, and outputs the predicted probability of user U clicking item V. For user U, taking its historical interest V_{u} as the seed, you can see in the figure that the initial starting point is two, and then continues to spread to the surroundings. Given itemV and each triple left(h_{i},r_{i},t_{i}right) in the 1-hop ripple set V_{u_{}^{1}} of user U, by comparing V Assign associated probabilities to nodes h_{i} and relationships r_{i} in triples.
After obtaining the correlation probability, multiply the tail of the triplet in V_{u_{}^{1}} by the corresponding correlation probability for a weighted sum, and get User U's historical interest is a first-order response to V. The user's interest is transferred from V_{u} to o_{u}^{1}, and o_{u}^{2}, o_{u}^{3} can be calculated. ...o_{u}^{n}, and then the characteristics of U about item V can be calculated to fuse all his order responses.
In summary, we mainly focused on recommendations, introduced the detailed process of graph construction, and analyzed the difficulties and challenges involved. At the same time, it also summarizes a lot of important work and gives specific solutions, ideas and suggestions. Finally, the application including knowledge graph is introduced, especially the role and use of knowledge graph in the field of recommendation, including cold start, interpretability, and recall ranking.
Citation:
[1] Kim S, Oh S G. Extracting and Applying Evaluation Criteria for Ontology Quality Assessment[J]. Library Hi Tech, 2019.
[2]Protege: https://www.php.cn/link/9d405c24be657bbf7a5244815a908922
[3] Ding S, Shang J, Wang S, et al. ERNIE-DOC: The Retrospective Long-Document Modeling Transformer[J]. 2020.
[4]DocBert,[1] Adhikari A , Ram A , Tang R ,et al. DocBERT: BERT for Document Classification[J]. 2019.
[5]JanusGraph,https://www.php.cn /link/fc0de4e0396fff257ea362983c2dda5a
[6] Sang L, Xu M, Qian S, et al. Knowledge graph enhanced neural collaborative filtering with residual recurrent network[J]. Neurocomputing, 2021 , 454: 417-429.
[7] Du Y , Zhu X , Chen L , et al. MetaKG: Meta-learning on Knowledge Graph for Cold-start Recommendation[J]. arXiv e-prints, 2022.
[8] Chen Z , Wang X , Xie X , et al. Towards Explainable Conversational Recommendation[C]// Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence {IJCAI-PRICAI-20. 2020.
[9] Wang X , He X , Cao Y , et al. KGAT: Knowledge Graph Attention Network for Recommendation[J]. ACM, 2019.
[10]Wang H, Zhang F, Wang J, et al. RippleNet: Propagating User Preferences on the Knowledge Graph for Recommender Systems[J]. ACM, 2018.
##
The above is the detailed content of Construction of automotive knowledge graph for recommendation. For more information, please follow other related articles on the PHP Chinese website!