php - 如何搜索PDF内容?
大家讲道理
大家讲道理 2017-04-11 09:42:09
0
2
344

客户要求做全站的关键字搜索,包括PDF文档内容也要能搜到,目前的解决办法是将PDF转换成文本,写入数据库,然后搜索数据库字段。如果PDF不是文本内容,无法转换肯定无法搜索,是否有更好的解决方案?

大家讲道理
大家讲道理

光阴似箭催人老,日月如移越少年。

Antworte allen(2)
刘奇

额,使用标签呢?怎么还有全站搜pdf的功能啊,关注一下

迷茫
#python convert pdf to text
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import re
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO  import StringIO
#from io  import StringIO for python3
from io import open
from pdfminer.pdfpage import PDFPage
def pdf_txt(url):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    f = requests.get(url).content
    fp = StringIO(f)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()
    for page in PDFPage.get_pages(fp,
                                  pagenos,
                                  maxpages=maxpages,
                                  password=password,
                                  caching=caching,
                                  check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    str = retstr.getvalue()
    retstr.close()
    return str
txt=pdf_txt('http://pythonscraping.com/pages/warandpeace/chapter1.pdf')
print txt
#如果pdf含有中文,命令行输出乱码,可以输出到文件
#open('pdf.txt','wb').write(txt)
'''
CHAPTER I
"Well, Prince, so Genoa and Lucca are now just family estates of
theBuonapartes. But I warn you, if you don't tell me that this
means war,if you still try to defend the infamies and horrors
perpetrated bythat Antichrist- I really believe he is Antichrist- I will
havenothing more to do with you and you are no longer my friend,
no longermy 'faithful slave,' as you call yourself! But how do you
do? I seeI have frightened you- sit down and tell me all the news."
It was in July, 1805, and the speaker was the well-known
AnnaPavlovna Scherer, maid of honor and favorite of the
Empress MaryaFedorovna. With these words she greeted Prince
Vasili Kuragin, a manof high rank and importance, who was the
first to arrive at herreception. Anna Pavlovna had had a cough for
some days. She was, asshe said, suffering from la grippe; grippe
being then a new word inSt. Petersburg, used only by the elite.
All her invitations without exception, written in French,
anddelivered by a scarlet-liveried footman that morning, ran as
''' 
Beliebte Tutorials
Mehr>
Neueste Downloads
Mehr>
Web-Effekte
Quellcode der Website
Website-Materialien
Frontend-Vorlage
Über uns Haftungsausschluss Sitemap
Chinesische PHP-Website:Online-PHP-Schulung für das Gemeinwohl,Helfen Sie PHP-Lernenden, sich schnell weiterzuentwickeln!