Home > Java > javaTutorial > body text

How can I extract content from files within a zip archive using Apache Tika in Java?

Barbara Streisand
Release: 2024-10-30 03:33:28
Original
196 people have browsed it

How can I extract content from files within a zip archive using Apache Tika in Java?

Extracting Content from Files within a Zip Archive Using Apache Tika

Problem:

Develop a Java program that reads the contents of files stored within a zip archive utilizing Apache Tika. The zip archive contains various file formats (such as txt, pdf, and docx).

Solution:

To achieve the desired functionality, follow these steps:

  1. Parse the Zip Archive:

    • Utilize ZipInputStream to iterate through the entries in the zip archive.
    • Extract only the files of interest (e.g., txt, pdf, docx).
  2. Invoke Apache Tika:

    • Create an instance of a text handler (e.g., BodyContentHandler) for capturing the extracted content.
    • Instantiate a parser (e.g., AutoDetectParser) to identify the file type and apply the appropriate parsing method.
  3. Extract and Convert Content:

    • Parse each extracted file through the parser, extracting the content into the text handler.
    • Convert any binary or unsupported data into a String format.
  4. Consolidate Extracted Content:

    • Store the extracted content from all the files into a temporary list.
    • Combine the contents of the individual files into a single String for further processing or display.

Code Snippet (Modified):

<code class="java">import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.BodyContentHandler;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandlerFactory;
import org.xml.sax.SAXException;

public class ImprovedZipExtractor {

    public static void main(String[] args) {
        List<String> tempString = new ArrayList<>();
        StringBuffer sbf = new StringBuffer();

        File file = new File("C:\Users\xxx\Desktop\abc.zip");
        InputStream input;

        try {
            input = new FileInputStream(file);
            ZipInputStream zip = new ZipInputStream(input);
            ZipEntry entry = zip.getNextEntry();

            Metadata metadata = new Metadata();
            Parser parser = new AutoDetectParser();

            while (entry != null) {
                if (entry.getName().endsWith(".txt") || entry.getName().endsWith(".pdf")
                        || entry.getName().endsWith(".docx")) {
                    System.out.println("entry=" + entry.getName() + " " + entry.getSize());
                    parser.parse(zip, new BodyContentHandlerFactory(BodyContentHandlerFactory.INCLUDE_ENTITY_ROOT,
                            false).getNewBodyContentHandler(), metadata, new ParseContext());
                    tempString.add(sbf.toString());
                }
                entry = zip.getNextEntry();
            }
            zip.close();
            input.close();

            for (String text : tempString) {
                System.out.println("Apache Tika - Converted input string : " + text);
                sbf.append(text);
                System.out.println("Final text from all the three files " + sbf.toString());
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (SAXException e) {
            e.printStackTrace();
        } catch (TikaException e) {
            e.printStackTrace();
        }
    }
}</code>
Copy after login

Note: It's important to modify the code to prevent the sbf being overwritten during each iteration and to clear it outside the loop to store the concated content from all files.

The above is the detailed content of How can I extract content from files within a zip archive using Apache Tika in Java?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!