PHPAnalysis is currently a widely used Chinese word segmentation class. It uses reverse matching mode word segmentation, so it is compatible with a wider range of encodings. Its variables and common functions are now explained in detail as follows:
1. More important member variables
$resultType = 1 Generated word segmentation result data type (1 is all, 2 is dictionary vocabulary and a single Chinese, Japanese and Korean simplified and traditional character and English, 3 is dictionary vocabulary and English) This is how to set it up.
$notSplitLen = 5 The minimum length of the split sentence
$toLower = false Convert all English words to lowercase
$differMax = false Use the maximum splitting mode to disambiguate bigrams
$unitWord = true Try to merge words (i.e. new word recognition)
$differFreq = false Use hot word priority mode for disambiguation
1. public function __construct($source_charset='utf-8', $target_charset='utf-8', $load_all=true, $source='')
Function description: Constructor
parameters List:
$source_charset Source string encoding
$target_charset Directory string encoding
$load_all Whether to fully load the dictionary (this parameter has been invalidated)
$source Source string
If both input and output It is utf-8. In fact, you don’t need to use any parameters for initialization, but set the text to be operated through the SetSource method
Function description: Set source string
Parameter list:
$ source Source string
$source_charset Source string encoding
$target_charset Directory string encoding
Return value: bool
Function description: Start performing word segmentation operation
Parameter list:
$optimize Whether to try to optimize the results after word segmentation
Return value: void
A basic word segmentation process:
////////////////////////////////////////
$pa = new PhpAnalysis();
//Set word segmentation attributes
$pa->resultType = 2;
$pa->differMax = true;
//Get the results you want
$pa->GetFinallyIndex();
//////////////////////// ////////////////
Function description: Setting the type of return result
is actually an operation on the member variable $resultType
The value of parameter $rstype is:
1 is all , 2 is the dictionary vocabulary and a single Chinese, Japanese, Korean and Traditional Chinese character and English, 3 is the dictionary vocabulary and English
Return value: void
Function description: Get the number of specified terms that appear most frequently (usually used to extract document keywords)
Parameter list:
$num = 10 Return the number of entries
Return value: keyword list separated by ","
Function description: Get the final word segmentation result
Parameter list:
$spword Separator between entries
Return value: string
Function description: Get the rough segmentation result
Return value: array
Function description: Get the rough segmentation result containing attribute information
Attributes (1 Chinese words and sentences, 2 ANSI vocabulary (including full-width), 3 ANSI punctuation marks (including full-width), 4 numbers (including full-width), 5 Chinese punctuation or unrecognizable characters)
Return value: array
Function description: Get hash index array
Return value: array('word'=>count,...) Sort by frequency of occurrence
Function description: Compile the text file dictionary into a dictionary
Parameter list:
$source_file Source text file
$ target_file Target file (if not specified, the current dictionary)
Return value: void
Function description: Export all entries of the current dictionary as text files
Parameter list:
$targetfile Target file
Return value: void