Linux 시스템에서 node.js를 사용하여 Word 및 PDF 텍스트 콘텐츠를 추출하는 방법에 대한 사례 소개-리눅스 운영 및 유지 관리-php.cn

Linux 시스템에서 node.js를 사용하여 Word 및 PDF 텍스트 콘텐츠를 추출하는 방법에 대한 사례 소개

黄舟

풀어 주다： 2017-06-18 09:11:01

원래의

2040명이 탐색했습니다.

이 글에서는 주로 node.js를 사용하여 Linux 시스템에서 Word(doc/docx) 및 PDF 텍스트를 추출하는 방법을 소개합니다. 이 글에서는 참고하고 학습할 수 있도록 자세한 샘플 코드를 제공합니다. 편집자와 함께 살펴보세요.

Foreword

전체 텍스트 검색 엔진을 구축하려면 word/pdf와 같은 문서의 내용을 추출해야 합니다. PDF의 경우 xpdf와 같은 오픈 소스 솔루션이 있습니다.

하지만 Word 문서의 상황은 좀 더 복잡합니다.

PDF 텍스트 내용 추출

Debian Linux에서의 설치는 매우 간단합니다.

apt-get install xpdf

로그인 후 복사

여기서는 pdftotext 함수만 사용합니다. 직접 입력하여 도움말을 볼 수 있습니다.

root@raspberrypi:/var/www# pdftotext
pdftotext version 0.26.5
Copyright 2005-2014 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
Usage: pdftotext [options] <PDF-file> [<text-file>]
 -f <int>   : first page to convert
 -l <int>   : last page to convert
 -r <fp>   : resolution, in DPI (default is 72)
 -x <int>   : x-coordinate of the crop area top left corner
 -y <int>   : y-coordinate of the crop area top left corner
 -W <int>   : width of crop area in pixels (default is 0)
 -H <int>   : height of crop area in pixels (default is 0)
 -layout   : maintain original physical layout
 -fixed <fp>  : assume fixed-pitch (or tabular) text
 -raw    : keep strings in content stream order
 -htmlmeta   : generate a simple HTML file, including the meta information
 -enc <string>  : output text encoding name
 -listenc   : list available encodings
 -eol <string>  : output end-of-line convention (unix, dos, or mac)
 -nopgbrk   : don&#39;t insert page breaks between pages
 -bbox    : output bounding box for each word and page size to html. Sets -htmlmeta
 -opw <string>  : owner password (for encrypted files)
 -upw <string>  : user password (for encrypted files)
 -q    : don&#39;t print any messages or errors
 -v    : print copyright and version info
 -h    : print usage information
 -help    : print usage information
 --help   : print usage information
 -?    : print usage information

로그인 후 복사

테스트해 보세요.

root@raspberrypi:/var/www# pdftotext onceai.pdf onceai.txt
root@raspberrypi:/var/www# cat onceai.txt 产品介绍 顽石智能科技（上海）有限公司
....

로그인 후 복사

그런 다음 node.js에서 child_process를 사용하여 이 명령을 직접 호출하세요. pdftotext 콘텐츠는 텍스트 파일로 출력되며, 여기에는 추가 작업이 필요할 수 있습니다. 구체적인 코드는 생략합니다.

antiword를 사용하여 .doc의 콘텐츠를 추출하세요

여기서 antiword 오픈 소스 소프트웨어를 사용하여 이전 버전의 word2003 콘텐츠를 추출합니다. 설치도 매우 간단합니다.

apt-get install antiword

로그인 후 복사

도움말 보기:

root@raspberrypi:/var/www# antiword
 Name: antiword
 Purpose: Display MS-Word files
 Author: (C) 1998-2005 Adri van Os
 Version: 0.37 (21 Oct 2005)
 Status: GNU General Public License
 Usage: antiword [switches] wordfile1 [wordfile2 ...]
 Switches: [-f|-t|-a papersize|-p papersize|-x dtd][-m mapping][-w #][-i #][-Ls]
  -f formatted text output
  -t text output (default)
  -a <paper size name> Adobe PDF output
  -p <paper size name> PostScript output
   paper size like: a4, letter or legal
  -x <dtd> XML output
   like: db (DocBook)
  -m <mapping> character mapping file
  -w <width> in characters of text output
  -i <level> image level (PostScript only)
  -L use landscape mode (PostScript only)
  -r Show removed text
  -s Show hidden (by Word) text

로그인 후 복사

antiword direct 단어 content를 콘솔에 출력합니다.

root@raspberrypi:/var/www# antiword spec.doc

SYNC Mobile – Ford APA
Project Number: DFYST
Requirements Specification

로그인 후 복사

마찬가지로 node.js에서 child_process를 사용하여 이 명령을 호출합니다.

.docx

docx 문서의 경우 유전자 자체가 zip 파일이므로 먼저 node.js에서 압축을 푼 다음 text.docxword를 구문 분석하면 됩니다. document.xml 파일만 있으면 됩니다.

요약

위 내용은 Linux 시스템에서 node.js를 사용하여 Word 및 PDF 텍스트 콘텐츠를 추출하는 방법에 대한 사례 소개의 상세 내용입니다. 자세한 내용은 PHP 중국어 웹사이트의 기타 관련 기사를 참조하세요!