Home Java javaTutorial How can I extract content from files within a zip archive using Apache Tika in Java?

How can I extract content from files within a zip archive using Apache Tika in Java?

Oct 30, 2024 am 03:33 AM

How can I extract content from files within a zip archive using Apache Tika in Java?

Extracting Content from Files within a Zip Archive Using Apache Tika

Problem:

Develop a Java program that reads the contents of files stored within a zip archive utilizing Apache Tika. The zip archive contains various file formats (such as txt, pdf, and docx).

Solution:

To achieve the desired functionality, follow these steps:

  1. Parse the Zip Archive:

    • Utilize ZipInputStream to iterate through the entries in the zip archive.
    • Extract only the files of interest (e.g., txt, pdf, docx).
  2. Invoke Apache Tika:

    • Create an instance of a text handler (e.g., BodyContentHandler) for capturing the extracted content.
    • Instantiate a parser (e.g., AutoDetectParser) to identify the file type and apply the appropriate parsing method.
  3. Extract and Convert Content:

    • Parse each extracted file through the parser, extracting the content into the text handler.
    • Convert any binary or unsupported data into a String format.
  4. Consolidate Extracted Content:

    • Store the extracted content from all the files into a temporary list.
    • Combine the contents of the individual files into a single String for further processing or display.

Code Snippet (Modified):

<code class="java">import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.BodyContentHandler;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandlerFactory;
import org.xml.sax.SAXException;

public class ImprovedZipExtractor {

    public static void main(String[] args) {
        List&lt;String&gt; tempString = new ArrayList&lt;&gt;();
        StringBuffer sbf = new StringBuffer();

        File file = new File("C:\Users\xxx\Desktop\abc.zip");
        InputStream input;

        try {
            input = new FileInputStream(file);
            ZipInputStream zip = new ZipInputStream(input);
            ZipEntry entry = zip.getNextEntry();

            Metadata metadata = new Metadata();
            Parser parser = new AutoDetectParser();

            while (entry != null) {
                if (entry.getName().endsWith(".txt") || entry.getName().endsWith(".pdf")
                        || entry.getName().endsWith(".docx")) {
                    System.out.println("entry=" + entry.getName() + " " + entry.getSize());
                    parser.parse(zip, new BodyContentHandlerFactory(BodyContentHandlerFactory.INCLUDE_ENTITY_ROOT,
                            false).getNewBodyContentHandler(), metadata, new ParseContext());
                    tempString.add(sbf.toString());
                }
                entry = zip.getNextEntry();
            }
            zip.close();
            input.close();

            for (String text : tempString) {
                System.out.println("Apache Tika - Converted input string : " + text);
                sbf.append(text);
                System.out.println("Final text from all the three files " + sbf.toString());
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (SAXException e) {
            e.printStackTrace();
        } catch (TikaException e) {
            e.printStackTrace();
        }
    }
}</code>
Copy after login

Note: It's important to modify the code to prevent the sbf being overwritten during each iteration and to clear it outside the loop to store the concated content from all files.

The above is the detailed content of How can I extract content from files within a zip archive using Apache Tika in Java?. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot Article Tags

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Top 4 JavaScript Frameworks in 2025: React, Angular, Vue, Svelte Top 4 JavaScript Frameworks in 2025: React, Angular, Vue, Svelte Mar 07, 2025 pm 06:09 PM

Top 4 JavaScript Frameworks in 2025: React, Angular, Vue, Svelte

How does Java's classloading mechanism work, including different classloaders and their delegation models? How does Java's classloading mechanism work, including different classloaders and their delegation models? Mar 17, 2025 pm 05:35 PM

How does Java's classloading mechanism work, including different classloaders and their delegation models?

How can I use JPA (Java Persistence API) for object-relational mapping with advanced features like caching and lazy loading? How can I use JPA (Java Persistence API) for object-relational mapping with advanced features like caching and lazy loading? Mar 17, 2025 pm 05:43 PM

How can I use JPA (Java Persistence API) for object-relational mapping with advanced features like caching and lazy loading?

Iceberg: The Future of Data Lake Tables Iceberg: The Future of Data Lake Tables Mar 07, 2025 pm 06:31 PM

Iceberg: The Future of Data Lake Tables

How do I use Maven or Gradle for advanced Java project management, build automation, and dependency resolution? How do I use Maven or Gradle for advanced Java project management, build automation, and dependency resolution? Mar 17, 2025 pm 05:46 PM

How do I use Maven or Gradle for advanced Java project management, build automation, and dependency resolution?

Spring Boot SnakeYAML 2.0 CVE-2022-1471 Issue Fixed Spring Boot SnakeYAML 2.0 CVE-2022-1471 Issue Fixed Mar 07, 2025 pm 05:52 PM

Spring Boot SnakeYAML 2.0 CVE-2022-1471 Issue Fixed

Node.js 20: Key Performance Boosts and New Features Node.js 20: Key Performance Boosts and New Features Mar 07, 2025 pm 06:12 PM

Node.js 20: Key Performance Boosts and New Features

How do I implement multi-level caching in Java applications using libraries like Caffeine or Guava Cache? How do I implement multi-level caching in Java applications using libraries like Caffeine or Guava Cache? Mar 17, 2025 pm 05:44 PM

How do I implement multi-level caching in Java applications using libraries like Caffeine or Guava Cache?

See all articles