Table of Contents
Entity recognition: chunking technology
Construction of chunking syntax
print(type(result))
Named entity: an exact noun phrase that refers to a specific type of individual, such as a date, person, organization, etc.
Home Backend Development Python Tutorial How to build a system?

How to build a system?

Jun 20, 2017 am 11:00 AM
nltk information study text notes

How to build a system for extracting structured information and data from unstructured text? What methods use this type of behavior? Which corpora are suitable for this work? Is it possible to train and evaluate the model?

Information extraction, especially structured information extraction, can be compared to database records. The corresponding relationship binds the corresponding data information. For unstructured data such as natural language, in order to obtain the corresponding relationship, the special relationship corresponding to the entity should be searched and recorded using some data structures such as strings and elements.

Entity recognition: chunking technology

For example: We saw the yellow dog, according to the idea of ​​chunking, the last three words will be divided into NP, and the three words inside Each word corresponds to DT/JJ/NN respectively; saw is divided into VBD; We is divided into NP. For the last three words, NP is the chunk (larger set). In order to achieve this, you canuse NLTK's own chunking syntax, similar to regular expressions, to implement sentence chunking.

Construction of chunking syntax

Just pay attention to three points:

  • ##Basic chunking:

    Chunking:{ Sub-block under the block} (similar to: "NP: {

    ?*}"A string like this). And ?*+ saves the meaning of the regular expression.

import nltk
sentence = [('the','DT'),('little','JJ'),('yellow','JJ'),('dog','NN'),('brak','VBD')]
grammer = "NP: {<DT>?<JJ>*<NN>}"cp = nltk.RegexpParser(grammer) #生成规则result = cp.parse(sentence) #进行分块print(result)

result.draw() #调用matplotlib库画出来
Copy after login


  • You can define a ## for sequences of identifiers that are not included in chunks. #gap

    }+{

    ##
    import nltk
    sentence = [(&#39;the&#39;,&#39;DT&#39;),(&#39;little&#39;,&#39;JJ&#39;),(&#39;yellow&#39;,&#39;JJ&#39;),(&#39;dog&#39;,&#39;NN&#39;),(&#39;bark&#39;,&#39;VBD&#39;),(&#39;at&#39;,&#39;IN&#39;),(&#39;the&#39;,&#39;DT&#39;),(&#39;cat&#39;,&#39;NN&#39;)]
    grammer = """NP:             {<DT>?<JJ>*<NN>}            }<VBD|NN>+{            """  #加缝隙,必须保存换行符cp = nltk.RegexpParser(grammer) #生成规则result = cp.parse(sentence) #进行分块print(result)
    Copy after login


can be called recursively, which conforms to the recursive nesting in the language structure. For example:
    VP: {*} PP:{}
  • . At this time, the parameter

    loop of the RegexpParser function can be set to 2 and looped multiple times to prevent omissions.

    Tree diagram
If you call to view the type, you will find that it is

nltk.tree. Tree. As you can tell from the name, this is a tree-like structure. nltk.Tree can realize tree structure, and supports splicing technology, providing node query and tree drawing. <div class="code" style="position:relative; padding:0px; margin:0px;"><pre class='brush:php;toolbar:false;'>tree1 = nltk.Tree(&amp;#39;NP&amp;#39;,[&amp;#39;Alick&amp;#39;])print(tree1) tree2 = nltk.Tree(&amp;#39;N&amp;#39;,[&amp;#39;Alick&amp;#39;,&amp;#39;Rabbit&amp;#39;])print(tree2) tree3 = nltk.Tree(&amp;#39;S&amp;#39;,[tree1,tree2])print(tree3.label()) #查看树的结点tree3.draw()</pre><div class="contentsignin">Copy after login</div></div>

IOB mark

stands for internal, external, and beginning respectively (the first letter of the English word). For classifications such as NP and NN mentioned above, you only need to add I-/B-/O- in front. This allows collections outside the rules to be exposed, similar to adding gaps above.

Developing and evaluating chunkers


NLTK already provides us with chunkers, reducing manual building rules. At the same time, it also provides content that has been divided into chunks for reference when we build our own rules.

#这段代码在python2下运行from nltk.corpus import conll2000print conll2000.chunked_sents(&#39;train.txt&#39;)[99] #查看已经分块的一个句子text = """   he /PRP/ B-NP   accepted /VBD/ B-VP   the DT B-NP   position NN I-NP   of IN B-PP   vice NN B-NP   chairman NN I-NP   of IN B-PP   Carlyle NNP B-NP   Group NNP I-NP   , , O   a DT B-NP   merchant NN I-NP   banking NN I-NP   concern NN I-NP   . . O"""result = nltk.chunk.conllstr2tree(text,chunk_types=[&#39;NP&#39;])
Copy after login

For the previously defined rules
cp

, you can use

cp.evaluate(conll2000.chunked_sents(' train.txt')[99]) to test the accuracy. Using the Unigram tagger we learned before, we can segment noun phrases into chunks and test the accuracy<div class="code" style="position:relative; padding:0px; margin:0px;"><pre class='brush:php;toolbar:false;'>class UnigramChunker(nltk.ChunkParserI):&quot;&quot;&quot; 一元分块器, 该分块器可以从训练句子集中找出每个词性标注最有可能的分块标记, 然后使用这些信息进行分块 &quot;&quot;&quot;def __init__(self, train_sents):&quot;&quot;&quot; 构造函数 :param train_sents: Tree对象列表 &quot;&quot;&quot;train_data = []for sent in train_sents:# 将Tree对象转换为IOB标记列表[(word, tag, IOB-tag), ...]conlltags = nltk.chunk.tree2conlltags(sent)# 找出每个词性标注对应的IOB标记ti_list = [(t, i) for w, t, i in conlltags] train_data.append(ti_list)# 使用一元标注器进行训练self.__tagger = nltk.UnigramTagger(train_data)def parse(self, tokens):&quot;&quot;&quot; 对句子进行分块 :param tokens: 标注词性的单词列表 :return: Tree对象 &quot;&quot;&quot;# 取出词性标注tags = [tag for (word, tag) in tokens]# 对词性标注进行分块标记ti_list = self.__tagger.tag(tags)# 取出IOB标记iob_tags = [iob_tag for (tag, iob_tag) in ti_list]# 组合成conll标记conlltags = [(word, pos, iob_tag) for ((word, pos), iob_tag) in zip(tokens, iob_tags)]return nltk.chunk.conlltags2tree(conlltags) test_sents = conll2000.chunked_sents(&quot;test.txt&quot;, chunk_types=[&quot;NP&quot;]) train_sents = conll2000.chunked_sents(&quot;train.txt&quot;, chunk_types=[&quot;NP&quot;]) unigram_chunker = UnigramChunker(train_sents)print(unigram_chunker.evaluate(test_sents))</pre><div class="contentsignin">Copy after login</div></div>

Named entity recognition and information extraction

Named entity: an exact noun phrase that refers to a specific type of individual, such as a date, person, organization, etc.

. If you go to Xu Yan classifier by yourself, you will definitely have a big head (ˉ▽ ̄~)~~. NLTK provides a trained classifier--

nltk.ne_chunk(tagged_sent[,binary=False]). If binary is set to True, then named entities are only tagged as NE; otherwise the tags are a bit more complex. <div class="code" style="position:relative; padding:0px; margin:0px;"><pre class='brush:php;toolbar:false;'>sent = nltk.corpus.treebank.tagged_sents()[22]print(nltk.ne_chunk(sent,binary=True))</pre><div class="contentsignin">Copy after login</div></div>

If the named entity is determined,
Relationship extraction

can be implemented to extract information. One way is to find all triples (X,a,Y). Among them, X and Y are named entities, and a is a string representing the relationship between the two. The example is as follows:

#请在Python2下运行import re
IN = re.compile(r&#39;.*\bin\b(?!\b.+ing)&#39;)for doc in nltk.corpus.ieer.parsed_docs(&#39;NYT_19980315&#39;):for rel in nltk.sem.extract_rels(&#39;ORG&#39;,&#39;LOC&#39;,doc,corpus=&#39;ieer&#39;,pattern = IN):print nltk.sem.show_raw_rtuple(rel)
Copy after login

The above is the detailed content of How to build a system?. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to delete Xiaohongshu notes How to delete Xiaohongshu notes Mar 21, 2024 pm 08:12 PM

How to delete Xiaohongshu notes? Notes can be edited in the Xiaohongshu APP. Most users don’t know how to delete Xiaohongshu notes. Next, the editor brings users pictures and texts on how to delete Xiaohongshu notes. Tutorial, interested users come and take a look! Xiaohongshu usage tutorial How to delete Xiaohongshu notes 1. First open the Xiaohongshu APP and enter the main page, select [Me] in the lower right corner to enter the special area; 2. Then in the My area, click on the note page shown in the picture below , select the note you want to delete; 3. Enter the note page, click [three dots] in the upper right corner; 4. Finally, the function bar will expand at the bottom, click [Delete] to complete.

What should I do if the notes I posted on Xiaohongshu are missing? What's the reason why the notes it just sent can't be found? What should I do if the notes I posted on Xiaohongshu are missing? What's the reason why the notes it just sent can't be found? Mar 21, 2024 pm 09:30 PM

As a Xiaohongshu user, we have all encountered the situation where published notes suddenly disappeared, which is undoubtedly confusing and worrying. In this case, what should we do? This article will focus on the topic of &quot;What to do if the notes published by Xiaohongshu are missing&quot; and give you a detailed answer. 1. What should I do if the notes published by Xiaohongshu are missing? First, don't panic. If you find that your notes are missing, staying calm is key and don't panic. This may be caused by platform system failure or operational errors. Checking release records is easy. Just open the Xiaohongshu App and click &quot;Me&quot; → &quot;Publish&quot; → &quot;All Publications&quot; to view your own publishing records. Here you can easily find previously published notes. 3.Repost. If found

How to add product links in notes in Xiaohongshu Tutorial on adding product links in notes in Xiaohongshu How to add product links in notes in Xiaohongshu Tutorial on adding product links in notes in Xiaohongshu Mar 12, 2024 am 10:40 AM

How to add product links in notes in Xiaohongshu? In the Xiaohongshu app, users can not only browse various contents but also shop, so there is a lot of content about shopping recommendations and good product sharing in this app. If If you are an expert on this app, you can also share some shopping experiences, find merchants for cooperation, add links in notes, etc. Many people are willing to use this app for shopping, because it is not only convenient, but also has many Experts will make some recommendations. You can browse interesting content and see if there are any clothing products that suit you. Let’s take a look at how to add product links to notes! How to add product links to Xiaohongshu Notes Open the app on the desktop of your mobile phone. Click on the app homepage

How to search for text across all tabs in Chrome and Edge How to search for text across all tabs in Chrome and Edge Feb 19, 2024 am 11:30 AM

This tutorial shows you how to find specific text or phrases on all open tabs in Chrome or Edge on Windows. Is there a way to do a text search on all open tabs in Chrome? Yes, you can use a free external web extension in Chrome to perform text searches on all open tabs without having to switch tabs manually. Some extensions like TabSearch and Ctrl-FPlus can help you achieve this easily. How to search text across all tabs in Google Chrome? Ctrl-FPlus is a free extension that makes it easy for users to search for a specific word, phrase or text across all tabs of their browser window. This expansion

Revealing the appeal of C language: Uncovering the potential of programmers Revealing the appeal of C language: Uncovering the potential of programmers Feb 24, 2024 pm 11:21 PM

The Charm of Learning C Language: Unlocking the Potential of Programmers With the continuous development of technology, computer programming has become a field that has attracted much attention. Among many programming languages, C language has always been loved by programmers. Its simplicity, efficiency and wide application make learning C language the first step for many people to enter the field of programming. This article will discuss the charm of learning C language and how to unlock the potential of programmers by learning C language. First of all, the charm of learning C language lies in its simplicity. Compared with other programming languages, C language

Getting Started with Pygame: Comprehensive Installation and Configuration Tutorial Getting Started with Pygame: Comprehensive Installation and Configuration Tutorial Feb 19, 2024 pm 10:10 PM

Learn Pygame from scratch: complete installation and configuration tutorial, specific code examples required Introduction: Pygame is an open source game development library developed using the Python programming language. It provides a wealth of functions and tools, allowing developers to easily create a variety of type of game. This article will help you learn Pygame from scratch, and provide a complete installation and configuration tutorial, as well as specific code examples to get you started quickly. Part One: Installing Python and Pygame First, make sure you have

Let's learn how to input the root number in Word together Let's learn how to input the root number in Word together Mar 19, 2024 pm 08:52 PM

When editing text content in Word, you sometimes need to enter formula symbols. Some guys don’t know how to input the root number in Word, so Xiaomian asked me to share with my friends a tutorial on how to input the root number in Word. Hope it helps my friends. First, open the Word software on your computer, then open the file you want to edit, and move the cursor to the location where you need to insert the root sign, refer to the picture example below. 2. Select [Insert], and then select [Formula] in the symbol. As shown in the red circle in the picture below: 3. Then select [Insert New Formula] below. As shown in the red circle in the picture below: 4. Select [Radical Formula], and then select the appropriate root sign. As shown in the red circle in the picture below:

How to publish notes tutorial on Xiaohongshu? Can it block people by posting notes? How to publish notes tutorial on Xiaohongshu? Can it block people by posting notes? Mar 25, 2024 pm 03:20 PM

As a lifestyle sharing platform, Xiaohongshu covers notes in various fields such as food, travel, and beauty. Many users want to share their notes on Xiaohongshu but don’t know how to do it. In this article, we will detail the process of posting notes on Xiaohongshu and explore how to block specific users on the platform. 1. How to publish notes tutorial on Xiaohongshu? 1. Register and log in: First, you need to download the Xiaohongshu APP on your mobile phone and complete the registration and login. It is very important to complete your personal information in the personal center. By uploading your avatar, filling in your nickname and personal introduction, you can make it easier for other users to understand your information, and also help them pay better attention to your notes. 3. Select the publishing channel: At the bottom of the homepage, click the &quot;Send Notes&quot; button and select the channel you want to publish.

See all articles