Be more explicit about encoding, especially with HtmlBlockEditor

I just spent a lot of time researching encoding because I had a problem (which turned out not to be encoding-related at all) placing content created with CKEditor and then saved into the database into an XML file. For awhile, I was certain it was CKEditor’s fault. I use the terms HtmlBlockEditor and CKeditor interchangeably. Here are the issues I think need to be improved:

  • CKEditor does not use UTF-8 by default. A lot of people are setting some entitites_latin thing to false to (partially?) resolve this. It seems to encode certain characters on its own, but that isn’t clear to anyone using EWL.
  • No attention has been paid to what type of text an HtmlBlockEditor expects when loading it, or what it produces when you retrieve its text. This is not responsible.
  • The documentation for HtmlBlockEditor should lay out very clear guidelines for whether you should HtmlEncode the content before you store it in your database, or not, and whether the output is UTF-8, at a minimum. It should also probably outline typical usage scenarios and the best practices for making the storage, loading/editing, and display of HTML content safe and convenient for everyone involved.

To a lesser extent, this information should be explicitly specified for other controls such as EwfTextBox. There is nothing stopping markup from getting in there (or is there - ASP can do it, but I think EWL disables that - EWL should say this and it should also give guidelines on how to safely store/display/load&edit this content).

I am all in favor of being super-opinionated on this stuff. On the server-side, in memory, I think we may already be ok since everything on the page is just represented with strings, and ASP.NET is responsible for encoding the HTTP response before it sends it. We can definitely specify the encoding it uses if we’d like.

Doesn’t the HtmlBlockEditor already handle the database storage for you? I think all you need to do is give it a table with a varchar(max) or CLOB column and it’ll do the rest. I think the database dictates the text encoding–that’s the difference between varchar and varbinary, or CLOB and BLOB.

As for EwfTextBox, in textarea mode we definitely prevent markup from getting in there:
https://enduracode.kilnhg.com/Code/Ewl/Group/Canonical/Files/Standard%20Library/EnterpriseWebFramework/Form%20Controls/Text%20Box/EwfTextBox.cs#40

In single-line mode, it comes down to this line:
https://enduracode.kilnhg.com/Code/Ewl/Group/Canonical/Files/Standard%20Library/EnterpriseWebFramework/Form%20Controls/Text%20Box/EwfTextBox.cs#160

I suspect that Control.Attributes.Add performs some kind of encoding, but I am not 100% sure.