maandag 12 september 2011

\s !=  

I'm using a lot of regex in Java because it's freaking handy. And so I'm walking trough some HTML and what did I find. The input String "<br>  </b>" wasn't matched by the regex "<br>[\\s]*</b>"! Which I found strange because a space is a whitespace, right? WRONG! The original text as it was on the web page was "<br> &nbsp;</b>" and the non-breaking space isn't included in the \s whitespace group.

So how did I solve it?
This problem can be solved by including \u00A0 in the regex ("<br>[\\s\\u00A0]*</b>") or by using \p{javaSpaceChar} in the regex ("<br>[\\p{javaSpaceChar}]*</b>").
The I couldn't find \p{javaSpaceChar} in most of the Java documentation, why? I don't know.