How to build a system?
How to build a system for extracting structured information and data from unstructured text? What methods use this type of behavior? Which corpora are suitable for this work? Is it possible to train and evaluate the model?
Information extraction, especially structured information extraction, can be compared to database records. The corresponding relationship binds the corresponding data information. For unstructured data such as natural language, in order to obtain the corresponding relationship, the special relationship corresponding to the entity should be searched and recorded using some data structures such as strings and elements.
Entity recognition: chunking technology
For example: We saw the yellow dog, according to the idea of chunking, the last three words will be divided into NP, and the three words inside Each word corresponds to DT/JJ/NN respectively; saw is divided into VBD; We is divided into NP. For the last three words, NP is the chunk (larger set). In order to achieve this, you canuse NLTK's own chunking syntax, similar to regular expressions, to implement sentence chunking.
Construction of chunking syntax
Just pay attention to three points:
- ##Basic chunking:
Chunking:{ Sub-block under the block}
(similar to:
"NP: {- ?
* }" A string like this). And ?*+ saves the meaning of the regular expression.
- ?
import nltk sentence = [('the','DT'),('little','JJ'),('yellow','JJ'),('dog','NN'),('brak','VBD')] grammer = "NP: {<DT>?<JJ>*<NN>}"cp = nltk.RegexpParser(grammer) #生成规则result = cp.parse(sentence) #进行分块print(result) result.draw() #调用matplotlib库画出来
- You can define a ## for sequences of identifiers that are not included in chunks. #gap
:}
##+{ import nltk sentence = [('the','DT'),('little','JJ'),('yellow','JJ'),('dog','NN'),('bark','VBD'),('at','IN'),('the','DT'),('cat','NN')] grammer = """NP: {<DT>?<JJ>*<NN>} }<VBD|NN>+{ """ #加缝隙,必须保存换行符cp = nltk.RegexpParser(grammer) #生成规则result = cp.parse(sentence) #进行分块print(result)
Copy after login
- VP: {
- . At this time, the parameter
loop
Tree diagramof the
RegexpParserfunction can be set to 2 and looped multiple times to prevent omissions.
print(type(result))
to view the type, you will find that it isnltk.tree. Tree. As you can tell from the name, this is a tree-like structure.
nltk.Tree can realize tree structure, and supports splicing technology, providing node query and tree drawing.
<div class="code" style="position:relative; padding:0px; margin:0px;"><pre class='brush:php;toolbar:false;'>tree1 = nltk.Tree(&#39;NP&#39;,[&#39;Alick&#39;])print(tree1)
tree2 = nltk.Tree(&#39;N&#39;,[&#39;Alick&#39;,&#39;Rabbit&#39;])print(tree2)
tree3 = nltk.Tree(&#39;S&#39;,[tree1,tree2])print(tree3.label()) #查看树的结点tree3.draw()</pre><div class="contentsignin">Copy after login</div></div>
IOB mark
Developing and evaluating chunkers
NLTK already provides us with chunkers, reducing manual building rules. At the same time, it also provides content that has been divided into chunks for reference when we build our own rules.
#这段代码在python2下运行from nltk.corpus import conll2000print conll2000.chunked_sents('train.txt')[99] #查看已经分块的一个句子text = """ he /PRP/ B-NP accepted /VBD/ B-VP the DT B-NP position NN I-NP of IN B-PP vice NN B-NP chairman NN I-NP of IN B-PP Carlyle NNP B-NP Group NNP I-NP , , O a DT B-NP merchant NN I-NP banking NN I-NP concern NN I-NP . . O"""result = nltk.chunk.conllstr2tree(text,chunk_types=['NP'])
For the previously defined rules
cp
cp.evaluate(conll2000.chunked_sents(' train.txt')[99]) to test the accuracy. Using the Unigram tagger we learned before, we can segment noun phrases into chunks and test the accuracy
<div class="code" style="position:relative; padding:0px; margin:0px;"><pre class='brush:php;toolbar:false;'>class UnigramChunker(nltk.ChunkParserI):""" 一元分块器, 该分块器可以从训练句子集中找出每个词性标注最有可能的分块标记, 然后使用这些信息进行分块 """def __init__(self, train_sents):""" 构造函数 :param train_sents: Tree对象列表 """train_data = []for sent in train_sents:# 将Tree对象转换为IOB标记列表[(word, tag, IOB-tag), ...]conlltags = nltk.chunk.tree2conlltags(sent)# 找出每个词性标注对应的IOB标记ti_list = [(t, i) for w, t, i in conlltags]
train_data.append(ti_list)# 使用一元标注器进行训练self.__tagger = nltk.UnigramTagger(train_data)def parse(self, tokens):""" 对句子进行分块 :param tokens: 标注词性的单词列表 :return: Tree对象 """# 取出词性标注tags = [tag for (word, tag) in tokens]# 对词性标注进行分块标记ti_list = self.__tagger.tag(tags)# 取出IOB标记iob_tags = [iob_tag for (tag, iob_tag) in ti_list]# 组合成conll标记conlltags = [(word, pos, iob_tag) for ((word, pos), iob_tag) in zip(tokens, iob_tags)]return nltk.chunk.conlltags2tree(conlltags)
test_sents = conll2000.chunked_sents("test.txt", chunk_types=["NP"])
train_sents = conll2000.chunked_sents("train.txt", chunk_types=["NP"])
unigram_chunker = UnigramChunker(train_sents)print(unigram_chunker.evaluate(test_sents))</pre><div class="contentsignin">Copy after login</div></div>
Named entity recognition and information extraction
Named entity: an exact noun phrase that refers to a specific type of individual, such as a date, person, organization, etc.
. If you go to Xu Yan classifier by yourself, you will definitely have a big head (ˉ▽ ̄~)~~. NLTK provides a trained classifier--nltk.ne_chunk(tagged_sent[,binary=False]). If binary is set to True, then named entities are only tagged as NE; otherwise the tags are a bit more complex. <div class="code" style="position:relative; padding:0px; margin:0px;"><pre class='brush:php;toolbar:false;'>sent = nltk.corpus.treebank.tagged_sents()[22]print(nltk.ne_chunk(sent,binary=True))</pre><div class="contentsignin">Copy after login</div></div>
If the named entity is determined,
Relationship extraction
The above is the detailed content of How to build a system?. For more information, please follow other related articles on the PHP Chinese website!#请在Python2下运行import re
IN = re.compile(r'.*\bin\b(?!\b.+ing)')for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):for rel in nltk.sem.extract_rels('ORG','LOC',doc,corpus='ieer',pattern = IN):print nltk.sem.show_raw_rtuple(rel)

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

How to delete Xiaohongshu notes? Notes can be edited in the Xiaohongshu APP. Most users don’t know how to delete Xiaohongshu notes. Next, the editor brings users pictures and texts on how to delete Xiaohongshu notes. Tutorial, interested users come and take a look! Xiaohongshu usage tutorial How to delete Xiaohongshu notes 1. First open the Xiaohongshu APP and enter the main page, select [Me] in the lower right corner to enter the special area; 2. Then in the My area, click on the note page shown in the picture below , select the note you want to delete; 3. Enter the note page, click [three dots] in the upper right corner; 4. Finally, the function bar will expand at the bottom, click [Delete] to complete.

As a Xiaohongshu user, we have all encountered the situation where published notes suddenly disappeared, which is undoubtedly confusing and worrying. In this case, what should we do? This article will focus on the topic of "What to do if the notes published by Xiaohongshu are missing" and give you a detailed answer. 1. What should I do if the notes published by Xiaohongshu are missing? First, don't panic. If you find that your notes are missing, staying calm is key and don't panic. This may be caused by platform system failure or operational errors. Checking release records is easy. Just open the Xiaohongshu App and click "Me" → "Publish" → "All Publications" to view your own publishing records. Here you can easily find previously published notes. 3.Repost. If found

How to add product links in notes in Xiaohongshu? In the Xiaohongshu app, users can not only browse various contents but also shop, so there is a lot of content about shopping recommendations and good product sharing in this app. If If you are an expert on this app, you can also share some shopping experiences, find merchants for cooperation, add links in notes, etc. Many people are willing to use this app for shopping, because it is not only convenient, but also has many Experts will make some recommendations. You can browse interesting content and see if there are any clothing products that suit you. Let’s take a look at how to add product links to notes! How to add product links to Xiaohongshu Notes Open the app on the desktop of your mobile phone. Click on the app homepage

This tutorial shows you how to find specific text or phrases on all open tabs in Chrome or Edge on Windows. Is there a way to do a text search on all open tabs in Chrome? Yes, you can use a free external web extension in Chrome to perform text searches on all open tabs without having to switch tabs manually. Some extensions like TabSearch and Ctrl-FPlus can help you achieve this easily. How to search text across all tabs in Google Chrome? Ctrl-FPlus is a free extension that makes it easy for users to search for a specific word, phrase or text across all tabs of their browser window. This expansion

The Charm of Learning C Language: Unlocking the Potential of Programmers With the continuous development of technology, computer programming has become a field that has attracted much attention. Among many programming languages, C language has always been loved by programmers. Its simplicity, efficiency and wide application make learning C language the first step for many people to enter the field of programming. This article will discuss the charm of learning C language and how to unlock the potential of programmers by learning C language. First of all, the charm of learning C language lies in its simplicity. Compared with other programming languages, C language

Learn Pygame from scratch: complete installation and configuration tutorial, specific code examples required Introduction: Pygame is an open source game development library developed using the Python programming language. It provides a wealth of functions and tools, allowing developers to easily create a variety of type of game. This article will help you learn Pygame from scratch, and provide a complete installation and configuration tutorial, as well as specific code examples to get you started quickly. Part One: Installing Python and Pygame First, make sure you have

When editing text content in Word, you sometimes need to enter formula symbols. Some guys don’t know how to input the root number in Word, so Xiaomian asked me to share with my friends a tutorial on how to input the root number in Word. Hope it helps my friends. First, open the Word software on your computer, then open the file you want to edit, and move the cursor to the location where you need to insert the root sign, refer to the picture example below. 2. Select [Insert], and then select [Formula] in the symbol. As shown in the red circle in the picture below: 3. Then select [Insert New Formula] below. As shown in the red circle in the picture below: 4. Select [Radical Formula], and then select the appropriate root sign. As shown in the red circle in the picture below:

As a lifestyle sharing platform, Xiaohongshu covers notes in various fields such as food, travel, and beauty. Many users want to share their notes on Xiaohongshu but don’t know how to do it. In this article, we will detail the process of posting notes on Xiaohongshu and explore how to block specific users on the platform. 1. How to publish notes tutorial on Xiaohongshu? 1. Register and log in: First, you need to download the Xiaohongshu APP on your mobile phone and complete the registration and login. It is very important to complete your personal information in the personal center. By uploading your avatar, filling in your nickname and personal introduction, you can make it easier for other users to understand your information, and also help them pay better attention to your notes. 3. Select the publishing channel: At the bottom of the homepage, click the "Send Notes" button and select the channel you want to publish.
