Microsoft Word converts certain characters into “smart characters”. Double quotes, dashes (em dash / en dash), bullets and so on.
These characters break PHP’s XML handling. (or at least they broke it for me – using simplexml_load_string!).
How do you clean them?
There is an old post that suggests using ereg_replace on a set of characters – essentially converts them to html entities.
Unfortunately, that did not work with me. Since the text is UTF8, the replace logic replaced alphabets too.
I tried a lot to get a solution, but could not find something that would work. Finally, just stripped out all non printing characters except line breaks and tabs.
[php]
private function cleanWordSpecialCharacters($body)
{
$body = preg_replace( ‘/[^[:print:]|\n|\r|\t]/’, ”, $body );
return $body;
}
[/php]
This too breaks with non English characters. Any suggestions?