Home Backend Development PHP Tutorial 基于preg_match_all采集后数据处理的一点心得笔记(编码转换和正_PHP

基于preg_match_all采集后数据处理的一点心得笔记(编码转换和正_PHP

Jun 01, 2016 am 11:56 AM
regular transcoding

1、使用curl实现站外采集

具体请参考我上一篇笔记:http://www.bitsCN.com/article/46432.htm

2、编码转换
首先通过查看源代码找到采集的网站使用的编码,通过mb_convert_encoding函数进行转码;

具体使用方法:

复制代码 代码如下:
//源字符是$str

//以下已知原编码为GBK,转换为utf-8
mb_convert_encoding($str, "UTF-8", "GBK");

//以下未知原编码,通过auto自动检测后,转换编码为utf-8
mb_convert_encoding($str, "UTF-8", "auto");

3、为更好地避开换行符和空格等不定因素的阻碍,有必要先清除采集到的源码中的换行符、空格符和制表符

复制代码 代码如下:
//方法一,使用str_replace进行替换
$contents = str_replace("\r\n", '', $contents); //清除换行符
$contents = str_replace("\n", '', $contents); //清除换行符
$contents = str_replace("\t", '', $contents); //清除制表符
$contents = str_replace(" ", '', $contents); //清除空格符

//方法二,使用正则表达式进行替换
$contents = preg_replace("/([\r\n|\n|\t| ]+)/",'',$contents);

4、通过正则表达式匹配找出需要获得的代码段,使用preg_match_all实现该匹配

复制代码 代码如下:
函数解释:
int preg_match_all ( string pattern, string subject, array matches [, int flags] )
pattern即正规表达式
subject即要进行查找的原文
matches是用于储存输出结果的数组
flags是储存的模式,包括:
    PREG_PATTERN_ORDER;  //整个数组是二维数组,$arr1[0]是包括边界所构成匹配字符串的数组,$arr1[1]除去边界所构成的匹配字符串的数组
    PREG_SET_ORDER;  //整个数组是二维数组,$arr2[0][0]是第一个包括边界所构成的匹配的字符串,$arr2[0][1]是第一个除去边界所构成的匹配的字符串,之后的数组以此类推
    PREG_OFFSET_CAPTURE;  //整个数组是三维数组,$arr3[0][0][0]是第一个包括边界所构成的匹配的字符串,$arr3[0][0][1]是到达第一个匹配字符串的边界的偏移量(边界不算在内),之后以此类推,$arr2[1][0][0]是第一个包括边界所构成的匹配的字符串,$arr3[1][0][1]是到达第一个匹配字符串的边界的偏移量(边界算在内);

//实际应用
preg_match_all('/(.*?)/',$contents, $out, PREG_SET_ORDER);
$out将获取到所有匹配的元素
$out[0][0]将是包括

在内的全段字符
$out[0][1]将是仅包括(.*?)括号内所匹配到的字符段

//如此类推,第n个匹配到的字段可以用以下方法取得
$out[n-1][1]

//若正则表达式中存大多个括号,则取得句中第m个匹配点的方法是
$out[n-1][m]

5、取得要找到字符后,若要去掉html标签,使用PHP自带的函数strip_tags即可方便地实现

复制代码 代码如下:
//例
$result=strip_tags($out[0][1]);

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
Two Point Museum: All Exhibits And Where To Find Them
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to type underline on computer How to type underline on computer Feb 19, 2024 pm 08:36 PM

How to underline on the computer When entering text on the computer, we often need to use underlines to highlight certain content or mark it. However, for some people who are not very familiar with computer input methods, typing underline can be a bit confusing. This article will introduce you to how to underline on your computer. In different computer operating systems and software, the way to enter the underscore may be slightly different. The following will introduce the common methods on Windows operating system and Mac operating system respectively. First, let’s take a look at the operation in Windows

How to replace a string starting with something with php regular expression How to replace a string starting with something with php regular expression Mar 24, 2023 pm 02:57 PM

PHP regular expressions are a powerful tool for text processing and conversion. It can effectively manage text information by parsing text content and replacing or intercepting it according to specific patterns. Among them, a common application of regular expressions is to replace strings starting with specific characters. We will explain this as follows

How to match multiple words or strings using Golang regular expression? How to match multiple words or strings using Golang regular expression? May 31, 2024 am 10:32 AM

Golang regular expressions use the pipe character | to match multiple words or strings, separating each option as a logical OR expression. For example: matches "fox" or "dog": fox|dog matches "quick", "brown" or "lazy": (quick|brown|lazy) matches "Go", "Python" or "Java": Go|Python |Java matches words or 4-digit zip codes: ([a-zA

How to use regular expressions to remove Chinese characters in php How to use regular expressions to remove Chinese characters in php Mar 03, 2023 am 10:12 AM

How to remove Chinese in PHP using regular expressions: 1. Create a PHP sample file; 2. Define a string containing Chinese and English; 3. Use "preg_replace('/([\x80-\xff]*)/i', '',$a);" The regular method can remove Chinese characters from the query results.

How to use regular matching to remove html tags in php How to use regular matching to remove html tags in php Mar 21, 2023 pm 05:17 PM

In this article, we will learn how to remove HTML tags and extract plain text content from HTML strings using PHP regular expressions. To demonstrate how to remove HTML tags, let's first define a string containing HTML tags.

Explore the implementation mechanism of golang encoding conversion Explore the implementation mechanism of golang encoding conversion Feb 19, 2024 pm 03:21 PM

As a powerful programming language, Golang has high performance and concurrency capabilities, and also provides rich standard library support, including support for encoding conversion. This article will deeply explore the implementation principles of encoding conversion in Golang and analyze it with specific code examples. What is transcoding? Encoding conversion refers to the process of converting a sequence of characters from one encoding to another. In actual development, we often need to handle conversions between different encodings, such as converting UTF-8 encoded strings.

A simple way to learn dedecms encoding conversion function A simple way to learn dedecms encoding conversion function Mar 14, 2024 pm 02:09 PM

Learning dedecms encoding conversion function is not complicated. Simple code examples can help you quickly master this skill. In dedecms, the encoding conversion function is usually used to deal with problems such as Chinese garbled characters and special characters to ensure the normal operation of the system and the accuracy of data. The following will introduce in detail how to use the encoding conversion function of dedecms, allowing you to easily cope with various encoding-related needs. 1.UTF-8 to GBK In dedecms, if you need to convert UTF-8 encoded string to G

How to deal with encoding conversion problems in C++ development How to deal with encoding conversion problems in C++ development Aug 22, 2023 am 11:07 AM

How to deal with encoding conversion issues in C++ development. During the C++ development process, we often encounter problems that require conversion between different encodings. Because there are differences between different encoding formats, you need to pay attention to some details when performing encoding conversion. This article will introduce how to deal with encoding conversion issues in C++ development. 1. Understand different encoding formats. Before dealing with encoding conversion issues, you first need to understand different encoding formats. Common encoding formats include ASCII, UTF-8, GBK, etc. ASCII is the earliest encoding format

See all articles