Matching URLs with Regular Expressions
Regular expressions can be daunting initially, but they offer powerful pattern-matching capabilities for diverse data types. In the context of extracting URLs, a flexible pattern is necessary to accommodate variations in URL formats.
One robust regular expression that can capture URLs with or without leading protocols (e.g., "http://www" or "www") is:
((https?|ftp)://)? // Optional SCHEME ([a-z0-9+!*(),;?&=$_.-]+(:[a-z0-9+!*(),;?&=$_.-]+)?@)? // Optional User and Pass ([a-z0-9\-\.]*)\.(([a-z]{2,4})|([0-9]{1,3}\.([0-9]{1,3})\.([0-9]{1,3}))) // Host or IP address (:[0-9]{2,5})? // Optional Port (/([a-z0-9+$_%-]\.?)+)*/? // Path (\?[a-z+&$_.-][a-z0-9;:@&%=+/$_.-]*)? // Optional GET Query (#[a-z_.-][a-z0-9+$%_.-]*)? // Optional Anchor
To use this expression in PHP, enclose it in double quotes and pass it to the preg_match function along with the URL you want to evaluate. For example:
<code class="php">$url = 'www.example.com/etcetc'; if (preg_match("~^$regex$~i", $url)) { echo 'Matched URL without protocol'; }</code>
Similarly, for URLs with protocols:
<code class="php">$url = 'http://www.example.com/etcetc'; if (preg_match("~^$regex$~i", $url)) { echo 'Matched URL with protocol'; }</code>
This pattern should cover a wide range of URL formats while also protecting against potential malicious input containing characters such as "/".
The above is the detailed content of How to Match URLs Using Regular Expressions?. For more information, please follow other related articles on the PHP Chinese website!