[nycphp-talk] Regex for P Elements
justin
justin at justinhileman.info
Wed Jan 12 10:00:49 EST 2011
On Wed, Jan 12, 2011 at 8:24 AM, Randal Rust <randalrust at gmail.com> wrote:
> I am admittedly not very good with regular expressions. I am trying to
> pull all of the paragraphs out of an article, so that I can create
> inline links. Here is my script:
>
> $blockpattern='/<p*[^>]*>.*?<\/p>/';
> $blocks=preg_match_all($blockpattern, $txt, $blockmatches);
>
You really don't want the * after that first p, because this:
/<p*[^>]*>/
Means, essentially, "Match a `<` character, then any number of `p`
(including 0), then a bunch of things that aren't `>`". This regex
will match any pair of `<...>` -- i.e. any opening and closing html
tag in your document.
Dropping the first * will get you closer:
/<p[^>]*>/
But that's still not right, as it'll get false positives on `<pre>`
and `<param>` tags. Instead use this:
/<p(\s+[^>]*)?>/
Which only matches that "a bunch of things that aren't `>`" if there's
a space between the `p` and whatever comes next.
The second half of your regex is right, but it does have the newline
problem you mentioned. To get `.` to match newline characters, use the
`dotall` flag by adding `s` after the final slash:
/<p(\s+[^>]*)?>.*?<\/p>/s
So that leaves us with:
$blockpattern = '/<p(\s+[^>]*)?>.*?<\/p>/s';
--
http://justinhileman.com
More information about the talk
mailing list