I am using Text_Diff classes of PHP to generate differences between two XML documents. The output is not always valid XML – tag nesting is not always correct. This happens because my source files are XML and have their own tags. When Text_Diff inserts its own <ins>
and <del>
tags around the changed text, it messes up the tag hierarchy at times.
I am looking for a clean, fast and safe way to fix such invalid XML. Do you have any recommendations?
I have looked at Tidy, it’s PHP library and htmLawed. I liked htmLawed since it’s pure PHP implementation, but don’t know how fast it is compared to Tidy. Moreover, I need an XML cleaner, not necessarily XHTML cleaner. So even if I use these libraries, I will have to strip out the HTML parts from the output.
Do you have any suggestions / recommendations?