In the project that I have been busy with since 2016, the module I am mainly responsible for is the file parsing part. When I was working on it, I made all kinds of mistakes and troubles. At least it is finally over. Now I have put all the parts in the project together. Let’s summarize the parsing of these files for future reference. The main documents parsed in this project include office files, pdf, csv, rtf, txt, jtd, and emails in eml, msg and pst formats, as well as rar and zip compression. When decompressing the package, there is actually a file in the mlf format. However, after my research and the research of the company's bosses, I can't overcome the difficulty for the time being, so I can only give up the file in this format for the time being, and other analysis has not been done. It has been done, mainly these. I will summarize them all one by one later. Regarding file parsing, I use Tika of Apache.
Today we will first take a look at the analysis of this jtd file. Some people may not know what this jtd file is. Let me explain it first:
jtd格式文件是由日本的文字处理软件一太郎生成的文件格式
It can be understood as a jtd format file. The word we usually use does not need to be edited and opened with Itaro software. Let me show you what this Itaro software looks like:
I was very surprised when I first saw this requirement. Embarrassing. How to do this? It’s still a Japanese software. I can’t understand it even if I check the information. I can’t find it on Baidu and stackoverflow. At this time, thanks to a big boss in the company who can understand Japanese, this The boss found a solution on a Japanese website. The website address is http://d.hatena.ne.jp/satorufujimori/20070227/1172549793
. The solution is to use vbs script to convert the jtd format file Convert to txt file, and then parse the corresponding txt to obtain the content. The script on the website is as follows:
//taro2txt.vbs Set taro = CreateObject("JXW.Application") taro.Visible = True taro.Documents.Open "c:\taro\a.jtd" taro.ActiveDocument.SaveAs "c:\out\a.txt", "", "", "", 10, "ShiftJIS" //※1 taro.Quit
Everyone pays attention to the 10, which is an identifier. 10 means converting the jtd format file into txt Format files, if you want to convert jtd format files into files in other formats, you need to replace 10 with other identifiers, but what is more embarrassing is that we did not find a specific document explaining which number represents which document, and then at that time I tried from 0 to 100, and a lot of messy formats came out. The only useful one is 10, which means that it can only convert jtd format files into txt format files. In this case, all the pictures in the original file will disappear. However, our business is to read the file content and enter it into Solr for retrieval, so if there is no picture, there will be no picture. Later, we adopted this method to solve the problem.
Through the above script, you can convert jtd files without passwords into txt files, but the most embarrassing thing is that our jtd format files have passwords. This is embarrassing, but fortunately it was solved in the end. , I forgot how to solve it at the time, but the solution is as follows:
//taro2txt.vbs Set taro = CreateObject("JXW.Application") taro.Visible = True taro.Documents.Open "c:\taro\a.jtd",password//在此处加上密码 taro.ActiveDocument.SaveAs "c:\out\a.txt", "", "", "", 10, "ShiftJIS" //※1 taro.Quit
After the script is completed, just click Run to convert the specific jtd file into a txt file, and then Just process the txt file and extract the content (the content extraction of txt format files will be explained in another article later).
The above problem has been solved, but there is still a problem. I can’t create a script file for all jtd files. Besides, I don’t know what files the customer has, so I thought of adding it to vbs. The script passes parameters. Although I don’t know the syntax of VBS, I still wrote it according to what is said on the Internet. The specific script content is as follows:
Option Explicit Dim a0 : a0 = WScript.Arguments(0) Dim a1 : a1 = WScript.Arguments(1) Dim a2 : a2 = WScript.Arguments(2) Dim taro ExchangeFile a0, a1, a2 Sub ExchangeFile(src,dest,password) Set taro = CreateObject("JXW.Application") taro.Visible = True taro.Documents.Open src,password taro.ActiveDocument.SaveAs dest, "", "", "", 10, "" taro.Quit End Sub
Where a0 represents the path of the jtd file, and a1 represents the path to the jtd file. The path of the generated txt format file, a2 represents the password of the jtd file, which is actually the process of passing parameters to call the function.
After the script is perfected, it is a question of using java to call the vbs script. I found the answer to this question on stackoverflow. The calling method is as follows:
public static void main(String[] args) { try { Runtime.getRuntime().exec( "wscript D:/Send_Mail_updated.vbs" ); } catch( IOException e ) { System.out.println(e); System.exit(0); } }
Through the above series of steps, you can succeed Convert jtd files into txt files, but there are several problems:
Calling the vbs script through the java program does not return a value indicating whether the txt file is actually generated. If the password The error is that the corresponding txt file cannot be generated. My processing method is to check whether the txt file has been generated every once in a while. After a certain number of times, it will be judged that the conversion failed. The number of times is based on the file size. For example, a 10M file will be Check every 5 seconds, 10 times in total. If the txt file is not generated, it will be judged as a failure. This is a waste of time when trying the password, and the file may be relatively large, or the machine configuration is not good enough. The txt file is generated, but after the check time has passed, it is directly determined that it cannot be converted correctly;
Every time you run the vbs script, the Ichitaro software will be opened, and when trying the password, if the password If an error occurs, a Windows error pop-up window will appear on the server where the application is deployed. Although Ichitaro's process will be killed in the end, the customer can clearly see the Itaro program and error prompts before it is killed. This is very Embarrassing things;
If the jtd file is too large, for example, when the file reaches 30M, the script conversion speed will be very slow. Question 2 also mentioned that during the file conversion process, the customer can If the Ichitaro program is seen on the server, if the client directly kills Itaro during this period, then the file conversion will definitely fail;
The above problems have not been solved yet, and there will be more later It depends on the usage after deployment at the customer's end. If the jtd format files at the customer's end are all under 10M, then there shouldn't be much of a problem. However, if the files exceed 30M, the conversion process will definitely be slow. And there is always the risk that the Ichitaro software will be killed during the conversion process. The specific situation depends on the customer's trial situation.
That’s all for now about file parsing in jtd format. As for the extraction of content after converting jtd format files into txt format files, I will write about it later.
The above is the detailed content of jtd format file conversion analysis. For more information, please follow other related articles on the PHP Chinese website!