How to Extract Text from Word File .doc, .docx, .xlsx, .pptx in PHP
Extracting text from uploaded Word documents is crucial for tasks like searching within documents, particularly in scenarios involving CVs/resumes. This article provides a comprehensive solution to this common problem.
Doc/Docx File Extraction
Doc/Docx files are binary blobs. For .doc files, you can use the fopen function, while for .docx files, you can utilize the zip_open function. This is because docx files are essentially ZIP files containing XML files.
Excel File Extraction
To extract text from XLSX files, we focus on a specific XML file, xl/sharedStrings.xml. We extract the content from this file and strip HTML tags for plain text.
PowerPoint File Extraction
PPTX files follow a similar approach. We iterate through slide XML files, extracting and concatenating their contents.
Class Implementation
We provide a PHP class named DocxConversion that encapsulates these extraction methods. The class accepts a file path as an argument and has the following functions:
Usage
To use this class, instantiate it with the file path and call the convertToText() method. The method returns the extracted text as a string.
Example:
$docObj = new DocxConversion("test.docx"); $docText = $docObj->convertToText(); echo $docText;
This script will extract the text from the specified .docx file and display it.
The above is the detailed content of How to Extract Text from Word, Excel, and PowerPoint Files in PHP?. For more information, please follow other related articles on the PHP Chinese website!