Home > Java > javaTutorial > body text

How to extract content from files within a ZIP archive using Apache Tika?

Susan Sarandon
Release: 2024-10-29 06:59:02
Original
302 people have browsed it

How to extract content from files within a ZIP archive using Apache Tika?

Extracting Content from Files Within a Zip Using Apache Tika

To fulfill your requirement of reading and extracting content from files within a zip archive using Apache Tika, you'll need to make some adjustments to your current code. While your approach is largely correct, the issue lies in obtaining the InputStream for each file in the zip.

Here's an updated version of your code that addresses this:

<code class="java">import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.BodyContentHandler;
import org.apache.tika.parser.ParseContext;
import org.xml.sax.SAXException;

public class SampleZipExtractNew {

    public static void main(String[] args) throws IOException {

        List<String> tempString = new ArrayList<>();
        StringBuffer sbf = new StringBuffer();

        File file = new File("C:\Users\xxx\Desktop\abc.zip");
        ZipFile zipFile = new ZipFile(file);

        Enumeration<? extends ZipEntry> entries = zipFile.entries();

        BodyContentHandler textHandler = new BodyContentHandler();
        Metadata metadata = new Metadata();

        Parser parser = new AutoDetectParser();

        while (entries.hasMoreElements()) {

            ZipEntry entry = entries.nextElement();

            try (InputStream inputStream = zipFile.getInputStream(entry)) {

                if (entry.getName().endsWith(".txt") || entry.getName().endsWith(".pdf") || entry.getName().endsWith(".docx")) {
                    parser.parse(inputStream, textHandler, metadata, new ParseContext());
                    tempString.add(textHandler.toString());
                }
            }
        }

        for (String text : tempString) {
            System.out.println("Apache Tika - Converted input string : " + text);
            sbf.append(text);
            System.out.println("Final text from all the three files " + sbf.toString());
        }
    }
}</code>
Copy after login

In this revised code:

  • We initialize a ZipFile instance with the zip file to be processed.
  • We iterate through the ZIP file's entries using an Enumeration, which provides access to each entry.
  • For each entry that ends with ".txt", ".pdf", or ".docx", we retrieve its InputStream.
  • Within the InputStream try-with-resources block, we invoke the Apache Tika parser to parse the content and extract the text.
  • The extracted text is added to a list for further processing, such as appending to a StringBuffer for consolidated extraction.

The above is the detailed content of How to extract content from files within a ZIP archive using Apache Tika?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!