jtd format file conversion analysis
In the project that I have been busy with since 2016, the module I am mainly responsible for is the file parsing part. When I was working on it, I made all kinds of mistakes and troubles. At least it is finally over. Now I have put all the parts in the project together. Let’s summarize the parsing of these files for future reference. The main documents parsed in this project include office files, pdf, csv, rtf, txt, jtd, and emails in eml, msg and pst formats, as well as rar and zip compression. When decompressing the package, there is actually a file in the mlf format. However, after my research and the research of the company's bosses, I can't overcome the difficulty for the time being, so I can only give up the file in this format for the time being, and other analysis has not been done. It has been done, mainly these. I will summarize them all one by one later. Regarding file parsing, I use Tika of Apache.
Today we will first take a look at the analysis of this jtd file. Some people may not know what this jtd file is. Let me explain it first:
jtd格式文件是由日本的文字处理软件一太郎生成的文件格式
It can be understood as a jtd format file. The word we usually use does not need to be edited and opened with Itaro software. Let me show you what this Itaro software looks like:
I was very surprised when I first saw this requirement. Embarrassing. How to do this? It’s still a Japanese software. I can’t understand it even if I check the information. I can’t find it on Baidu and stackoverflow. At this time, thanks to a big boss in the company who can understand Japanese, this The boss found a solution on a Japanese website. The website address is http://d.hatena.ne.jp/satorufujimori/20070227/1172549793
. The solution is to use vbs script to convert the jtd format file Convert to txt file, and then parse the corresponding txt to obtain the content. The script on the website is as follows:
//taro2txt.vbs Set taro = CreateObject("JXW.Application") taro.Visible = True taro.Documents.Open "c:\taro\a.jtd" taro.ActiveDocument.SaveAs "c:\out\a.txt", "", "", "", 10, "ShiftJIS" //※1 taro.Quit
Everyone pays attention to the 10, which is an identifier. 10 means converting the jtd format file into txt Format files, if you want to convert jtd format files into files in other formats, you need to replace 10 with other identifiers, but what is more embarrassing is that we did not find a specific document explaining which number represents which document, and then at that time I tried from 0 to 100, and a lot of messy formats came out. The only useful one is 10, which means that it can only convert jtd format files into txt format files. In this case, all the pictures in the original file will disappear. However, our business is to read the file content and enter it into Solr for retrieval, so if there is no picture, there will be no picture. Later, we adopted this method to solve the problem.
Through the above script, you can convert jtd files without passwords into txt files, but the most embarrassing thing is that our jtd format files have passwords. This is embarrassing, but fortunately it was solved in the end. , I forgot how to solve it at the time, but the solution is as follows:
//taro2txt.vbs Set taro = CreateObject("JXW.Application") taro.Visible = True taro.Documents.Open "c:\taro\a.jtd",password//在此处加上密码 taro.ActiveDocument.SaveAs "c:\out\a.txt", "", "", "", 10, "ShiftJIS" //※1 taro.Quit
After the script is completed, just click Run to convert the specific jtd file into a txt file, and then Just process the txt file and extract the content (the content extraction of txt format files will be explained in another article later).
The above problem has been solved, but there is still a problem. I can’t create a script file for all jtd files. Besides, I don’t know what files the customer has, so I thought of adding it to vbs. The script passes parameters. Although I don’t know the syntax of VBS, I still wrote it according to what is said on the Internet. The specific script content is as follows:
Option Explicit Dim a0 : a0 = WScript.Arguments(0) Dim a1 : a1 = WScript.Arguments(1) Dim a2 : a2 = WScript.Arguments(2) Dim taro ExchangeFile a0, a1, a2 Sub ExchangeFile(src,dest,password) Set taro = CreateObject("JXW.Application") taro.Visible = True taro.Documents.Open src,password taro.ActiveDocument.SaveAs dest, "", "", "", 10, "" taro.Quit End Sub
Where a0 represents the path of the jtd file, and a1 represents the path to the jtd file. The path of the generated txt format file, a2 represents the password of the jtd file, which is actually the process of passing parameters to call the function.
After the script is perfected, it is a question of using java to call the vbs script. I found the answer to this question on stackoverflow. The calling method is as follows:
public static void main(String[] args) { try { Runtime.getRuntime().exec( "wscript D:/Send_Mail_updated.vbs" ); } catch( IOException e ) { System.out.println(e); System.exit(0); } }
Through the above series of steps, you can succeed Convert jtd files into txt files, but there are several problems:
Calling the vbs script through the java program does not return a value indicating whether the txt file is actually generated. If the password The error is that the corresponding txt file cannot be generated. My processing method is to check whether the txt file has been generated every once in a while. After a certain number of times, it will be judged that the conversion failed. The number of times is based on the file size. For example, a 10M file will be Check every 5 seconds, 10 times in total. If the txt file is not generated, it will be judged as a failure. This is a waste of time when trying the password, and the file may be relatively large, or the machine configuration is not good enough. The txt file is generated, but after the check time has passed, it is directly determined that it cannot be converted correctly;
Every time you run the vbs script, the Ichitaro software will be opened, and when trying the password, if the password If an error occurs, a Windows error pop-up window will appear on the server where the application is deployed. Although Ichitaro's process will be killed in the end, the customer can clearly see the Itaro program and error prompts before it is killed. This is very Embarrassing things;
If the jtd file is too large, for example, when the file reaches 30M, the script conversion speed will be very slow. Question 2 also mentioned that during the file conversion process, the customer can If the Ichitaro program is seen on the server, if the client directly kills Itaro during this period, then the file conversion will definitely fail;
The above problems have not been solved yet, and there will be more later It depends on the usage after deployment at the customer's end. If the jtd format files at the customer's end are all under 10M, then there shouldn't be much of a problem. However, if the files exceed 30M, the conversion process will definitely be slow. And there is always the risk that the Ichitaro software will be killed during the conversion process. The specific situation depends on the customer's trial situation.
That’s all for now about file parsing in jtd format. As for the extraction of content after converting jtd format files into txt format files, I will write about it later.
The above is the detailed content of jtd format file conversion analysis. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



When deleting or decompressing a folder on your computer, sometimes a prompt dialog box "Error 0x80004005: Unspecified Error" will pop up. How should you solve this situation? There are actually many reasons why the error code 0x80004005 is prompted, but most of them are caused by viruses. We can re-register the dll to solve the problem. Below, the editor will explain to you the experience of handling the 0x80004005 error code. Some users are prompted with error code 0X80004005 when using their computers. The 0x80004005 error is mainly caused by the computer not correctly registering certain dynamic link library files, or by a firewall that does not allow HTTPS connections between the computer and the Internet. So how about

Recently, many netizens have asked the editor, what is the file hiberfil.sys? Can hiberfil.sys take up a lot of C drive space and be deleted? The editor can tell you that the hiberfil.sys file can be deleted. Let’s take a look at the details below. hiberfil.sys is a hidden file in the Windows system and also a system hibernation file. It is usually stored in the root directory of the C drive, and its size is equivalent to the size of the system's installed memory. This file is used when the computer is hibernated and contains the memory data of the current system so that it can be quickly restored to the previous state during recovery. Since its size is equal to the memory capacity, it may take up a larger amount of hard drive space. hiber

Practical tips for converting full-width English letters into half-width forms. In modern life, we often come into contact with English letters, and we often need to input English letters when using computers, mobile phones and other devices. However, sometimes we encounter full-width English letters, and we need to use the half-width form. So, how to convert full-width English letters to half-width form? Here are some practical tips for you. First of all, full-width English letters and numbers refer to characters that occupy a full-width position in the input method, while half-width English letters and numbers occupy a full-width position.

This article will introduce in detail how to convert months in PHP to English months, and give specific code examples. In PHP development, sometimes we need to convert digital months to English months, which is very practical in some date processing or data display scenarios. The implementation principles, specific code examples and precautions will be explained in detail below. 1. Implementation principle In PHP, you can convert digital months into English months by using the DateTime class and format method. Date

QQ Music allows everyone to enjoy watching movies and relieve boredom. You can use this software every day to easily satisfy your needs. A large number of high-quality songs are available for everyone to listen to. You can also download and save them. The next time you listen to them, you don’t need an Internet connection. The songs downloaded here are not in MP3 format and cannot be used on other platforms. After the membership songs expire, there is no way to listen to them again. Therefore, many friends want to convert the songs into MP3 format. Here, the editor explains You provide methods so that everyone can use them! 1. Open QQ Music on your computer, click the [Main Menu] button in the upper right corner, click [Audio Transcoding], select the [Add Song] option, and add the songs that need to be converted; 2. After adding the songs, click to select Convert to [mp3]

PHP Tutorial: How to Convert Int Type to String In PHP, converting integer data to string is a common operation. This tutorial will introduce how to use PHP's built-in functions to convert the int type to a string, while providing specific code examples. Use cast: In PHP, you can use cast to convert integer data into a string. This method is very simple. You only need to add (string) before the integer data to convert it into a string. Below is a simple sample code

How to convert full-width English letters into half-width letters In daily life and work, sometimes we encounter situations where we need to convert full-width English letters into half-width letters, such as when entering computer passwords, editing documents, or designing layouts. Full-width English letters and numbers refer to characters with the same width as Chinese characters, while half-width English letters refer to characters with a narrower width. In actual operation, we need to master some simple methods to convert full-width English letters into half-width letters so that we can process text and numbers more conveniently. 1. Full-width English letters and half-width English letters

[Analysis of the meaning and usage of midpoint in PHP] In PHP, midpoint (.) is a commonly used operator used to connect two strings or properties or methods of objects. In this article, we’ll take a deep dive into the meaning and usage of midpoints in PHP, illustrating them with concrete code examples. 1. Connect string midpoint operator. The most common usage in PHP is to connect two strings. By placing . between two strings, you can splice them together to form a new string. $string1=&qu
