如何在Linux系統中利用node.js提取Word及PDF文字內容的案例介紹-linux運維-PHP中文網

首頁

運維

linux運維

如何在Linux系統中利用node.js提取Word及PDF文字內容的案例介紹

黄舟

Jun 18, 2017 am 09:11 AM

javascript linux node.js 使用系統

這篇文章主要為大家介紹了Linux系統中利用node.js提取Word(doc/docx)及PDF文本的內容，文中給出了詳細的範例程式碼供大家參考學習，需要的朋友們下面跟著小編來一起看看吧。

前言

想要做全文搜尋引擎，則需要將word/pdf等文件內容擷取出來。對於pdf有xpdf等一些開源方案。

但Word文件的情況則會複雜一些。

提取PDF文字內容

#XPDF是一個免費開源的軟體，用於顯示PDF文件，並可將pdf轉換成文字圖片等，同樣支援Windows版。在Debian Linux上安裝非常簡單:

apt-get install xpdf

登入後複製

我們這裡只使用pdftotext這個功能，直接輸入可查看幫助：

root@raspberrypi:/var/www# pdftotext
pdftotext version 0.26.5
Copyright 2005-2014 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
Usage: pdftotext [options] <PDF-file> [<text-file>]
 -f <int>   : first page to convert
 -l <int>   : last page to convert
 -r <fp>   : resolution, in DPI (default is 72)
 -x <int>   : x-coordinate of the crop area top left corner
 -y <int>   : y-coordinate of the crop area top left corner
 -W <int>   : width of crop area in pixels (default is 0)
 -H <int>   : height of crop area in pixels (default is 0)
 -layout   : maintain original physical layout
 -fixed <fp>  : assume fixed-pitch (or tabular) text
 -raw    : keep strings in content stream order
 -htmlmeta   : generate a simple HTML file, including the meta information
 -enc <string>  : output text encoding name
 -listenc   : list available encodings
 -eol <string>  : output end-of-line convention (unix, dos, or mac)
 -nopgbrk   : don&#39;t insert page breaks between pages
 -bbox    : output bounding box for each word and page size to html. Sets -htmlmeta
 -opw <string>  : owner password (for encrypted files)
 -upw <string>  : user password (for encrypted files)
 -q    : don&#39;t print any messages or errors
 -v    : print copyright and version info
 -h    : print usage information
 -help    : print usage information
 --help   : print usage information
 -?    : print usage information

登入後複製

測試一下：

root@raspberrypi:/var/www# pdftotext onceai.pdf onceai.txt
root@raspberrypi:/var/www# cat onceai.txt 产品介绍 顽石智能科技（上海）有限公司
....

登入後複製

然後在node. js中使用child_process直接呼叫此指令即可，pdftotext會將內容輸出以文字檔中，可能需要多一些操作。具體代碼略。

用antiword提取.doc 的內容

#我們這裡使用了antiword 開源軟體，來提取word2003以前版本的內容，安裝同樣非常簡單：

apt-get install antiword

登入後複製

查看幫助：

root@raspberrypi:/var/www# antiword
 Name: antiword
 Purpose: Display MS-Word files
 Author: (C) 1998-2005 Adri van Os
 Version: 0.37 (21 Oct 2005)
 Status: GNU General Public License
 Usage: antiword [switches] wordfile1 [wordfile2 ...]
 Switches: [-f|-t|-a papersize|-p papersize|-x dtd][-m mapping][-w #][-i #][-Ls]
  -f formatted text output
  -t text output (default)
  -a <paper size name> Adobe PDF output
  -p <paper size name> PostScript output
   paper size like: a4, letter or legal
  -x <dtd> XML output
   like: db (DocBook)
  -m <mapping> character mapping file
  -w <width> in characters of text output
  -i <level> image level (PostScript only)
  -L use landscape mode (PostScript only)
  -r Show removed text
  -s Show hidden (by Word) text

登入後複製

#antiword直接將word內容輸出到了console中：

root@raspberrypi:/var/www# antiword spec.doc

SYNC Mobile – Ford APA
Project Number: DFYST
Requirements Specification

登入後複製

同樣在node.js用child_process呼叫此指令即可。

解析提取.docx 的內容

#對於docx 文件來說，因基本身就是一個zip文件，只需要在node.js先將其解壓縮，再解析text.docx\word\document.xml 檔案即可。

總結

以上是如何在Linux系統中利用node.js提取Word及PDF文字內容的案例介紹的詳細內容。更多資訊請關注PHP中文網其他相關文章！

本網站聲明

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn

熱AI工具

Undresser.AI Undress

人工智慧驅動的應用程序，用於創建逼真的裸體照片

AI Clothes Remover

用於從照片中去除衣服的線上人工智慧工具。

Undress AI Tool

免費脫衣圖片

Clothoff.io

AI脫衣器

Video Face Swap

使用我們完全免費的人工智慧換臉工具，輕鬆在任何影片中換臉！

熱工具

記事本++7.3.1

好用且免費的程式碼編輯器

SublimeText3漢化版

中文版，非常好用

禪工作室 13.0.1

強大的PHP整合開發環境

Dreamweaver CS6

視覺化網頁開發工具

SublimeText3 Mac版

神級程式碼編輯軟體(SublimeText3)

熱門話題

Java教學

1666

CakePHP 教程

1425

Laravel 教程

1327

PHP教程

1273

C# 教程

1252

Related knowledge

Linux體系結構：揭示5個基本組件 Apr 20, 2025 am 12:04 AM

Linux系統的五個基本組件是：1.內核，2.系統庫，3.系統實用程序，4.圖形用戶界面，5.應用程序。內核管理硬件資源，系統庫提供預編譯函數，系統實用程序用於系統管理，GUI提供可視化交互，應用程序利用這些組件實現功能。

vscode上一步下一步快捷鍵 Apr 15, 2025 pm 10:51 PM

VS Code 一步/下一步快捷鍵的使用方法：一步（向後）：Windows/Linux：Ctrl ←；macOS：Cmd ←下一步（向前）：Windows/Linux：Ctrl →；macOS：Cmd →

git怎麼查看倉庫地址 Apr 17, 2025 pm 01:54 PM

要查看 Git 倉庫地址，請執行以下步驟：1. 打開命令行並導航到倉庫目錄；2. 運行 "git remote -v" 命令；3. 查看輸出中的倉庫名稱及其相應的地址。

notepad怎麼運行java代碼 Apr 16, 2025 pm 07:39 PM

雖然 Notepad 無法直接運行 Java 代碼，但可以通過借助其他工具實現：使用命令行編譯器 (javac) 編譯代碼，生成字節碼文件 (filename.class)。使用 Java 解釋器 (java) 解釋字節碼，執行代碼並輸出結果。

sublime寫好代碼後如何運行 Apr 16, 2025 am 08:51 AM

在 Sublime 中運行代碼的方法有六種：通過熱鍵、菜單、構建系統、命令行、設置默認構建系統和自定義構建命令，並可通過右鍵單擊項目/文件運行單個文件/項目，構建系統可用性取決於 Sublime Text 的安裝情況。

Linux的主要目的是什麼？ Apr 16, 2025 am 12:19 AM

Linux的主要用途包括：1.服務器操作系統，2.嵌入式系統，3.桌面操作系統，4.開發和測試環境。 Linux在這些領域表現出色，提供了穩定性、安全性和高效的開發工具。

laravel安裝代碼 Apr 18, 2025 pm 12:30 PM

要安裝 Laravel，需依序進行以下步驟：安裝 Composer（適用於 macOS/Linux 和 Windows）安裝 Laravel 安裝器創建新項目啟動服務訪問應用程序（網址：http://127.0.0.1:8000）設置數據庫連接（如果需要）

git軟件安裝 Apr 17, 2025 am 11:57 AM

安裝 Git 軟件包括以下步驟：下載安裝包運行安裝包驗證安裝配置 Git安裝 Git Bash（僅限 Windows）

See all articles

如何在Linux系統中利用node.js提取Word及PDF文字內容的案例介紹

熱AI工具

Undresser.AI Undress

AI Clothes Remover

Undress AI Tool

Clothoff.io

Video Face Swap

熱門文章

熱工具

記事本++7.3.1

SublimeText3漢化版

禪工作室 13.0.1

Dreamweaver CS6

SublimeText3 Mac版

熱門話題