


How can I extract content from files within a zip archive using Apache Tika in Java?
Oct 30, 2024 am 03:33 AMExtracting Content from Files within a Zip Archive Using Apache Tika
Problem:
Develop a Java program that reads the contents of files stored within a zip archive utilizing Apache Tika. The zip archive contains various file formats (such as txt, pdf, and docx).
Solution:
To achieve the desired functionality, follow these steps:
-
Parse the Zip Archive:
- Utilize ZipInputStream to iterate through the entries in the zip archive.
- Extract only the files of interest (e.g., txt, pdf, docx).
-
Invoke Apache Tika:
- Create an instance of a text handler (e.g., BodyContentHandler) for capturing the extracted content.
- Instantiate a parser (e.g., AutoDetectParser) to identify the file type and apply the appropriate parsing method.
-
Extract and Convert Content:
- Parse each extracted file through the parser, extracting the content into the text handler.
- Convert any binary or unsupported data into a String format.
-
Consolidate Extracted Content:
- Store the extracted content from all the files into a temporary list.
- Combine the contents of the individual files into a single String for further processing or display.
Code Snippet (Modified):
<code class="java">import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; import java.util.ArrayList; import java.util.List; import java.util.zip.ZipEntry; import java.util.zip.ZipInputStream; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.BodyContentHandler; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.BodyContentHandlerFactory; import org.xml.sax.SAXException; public class ImprovedZipExtractor { public static void main(String[] args) { List<String> tempString = new ArrayList<>(); StringBuffer sbf = new StringBuffer(); File file = new File("C:\Users\xxx\Desktop\abc.zip"); InputStream input; try { input = new FileInputStream(file); ZipInputStream zip = new ZipInputStream(input); ZipEntry entry = zip.getNextEntry(); Metadata metadata = new Metadata(); Parser parser = new AutoDetectParser(); while (entry != null) { if (entry.getName().endsWith(".txt") || entry.getName().endsWith(".pdf") || entry.getName().endsWith(".docx")) { System.out.println("entry=" + entry.getName() + " " + entry.getSize()); parser.parse(zip, new BodyContentHandlerFactory(BodyContentHandlerFactory.INCLUDE_ENTITY_ROOT, false).getNewBodyContentHandler(), metadata, new ParseContext()); tempString.add(sbf.toString()); } entry = zip.getNextEntry(); } zip.close(); input.close(); for (String text : tempString) { System.out.println("Apache Tika - Converted input string : " + text); sbf.append(text); System.out.println("Final text from all the three files " + sbf.toString()); } } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } catch (SAXException e) { e.printStackTrace(); } catch (TikaException e) { e.printStackTrace(); } } }</code>
Note: It's important to modify the code to prevent the sbf being overwritten during each iteration and to clear it outside the loop to store the concated content from all files.
The above is the detailed content of How can I extract content from files within a zip archive using Apache Tika in Java?. For more information, please follow other related articles on the PHP Chinese website!

Hot Article

Hot tools Tags

Hot Article

Hot Article Tags

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

Top 4 JavaScript Frameworks in 2025: React, Angular, Vue, Svelte

How does Java's classloading mechanism work, including different classloaders and their delegation models?

How can I use JPA (Java Persistence API) for object-relational mapping with advanced features like caching and lazy loading?

Iceberg: The Future of Data Lake Tables

How do I use Maven or Gradle for advanced Java project management, build automation, and dependency resolution?

Spring Boot SnakeYAML 2.0 CVE-2022-1471 Issue Fixed

Node.js 20: Key Performance Boosts and New Features

How do I implement multi-level caching in Java applications using libraries like Caffeine or Guava Cache?
