Use Jinja2 autoescaping #1921

haxtibal · 2024-07-18T19:47:44Z

This enables Jinja2 autoescaping everywhere, except for the diff screen which relies on pre generated HTML. The MR is not meant to be merged as-is.

It fixes the issues observed in #1920. Most integration tests pass. Tests fail for UI forms, presumably because they call html.escape manually and now are escaped twice.

Jinja2 recommneds autoescaping:

In future versions of Jinja we might enable autoescaping by default for security reasons.
As such you are encouraged to explicitly configure autoescaping now instead of relying on the default.

Templates access way more functions and variables than just enumerate_all_fields(). All need to be escaped. This does it automatically without risk of forgetting some places.

This MR changes some formerly explicit escaping:

enumerate_all_fields_escaped was the html.escape() variant of enumerate_all_fields. It's only evaluated in Jinja2 templates, therefore completely covered by autoescaping now. This also eliminates the need for get_text_value_escaped, which was only used by enumerate_all_fields_escaped.
RequirementFormField.field_escaped_value was the html.escape() variant of field_unescaped_value. field_escaped_value is only evaluated in Jinja2 templates, therefore completely covered by autoescaping now.
SearchScreenViewObject.search_value was explictly html.escape()ed. search_value is only read by Jinja2 templates, therefore completely covered by autoescaping now.
SourceFileViewHTMLGenerator changes some pygments output lines. pygements output is escaped, but we still have to escape our own changes. Changed html.escape to markupsafe.escape because that's what Jinja2 uses. It's better to use the same escaping function everywhere.
TextToHtmlWriter escaped plain text from multiline statements. We still need to do that. Changed html.escape to markupsafe.escape for same reason as above.

Considerations about what to escape

render_*_link make strings used as URL in href="xyz" . It's derived from filenames and UIDs. UIDs are restricted by a grammar regex. But Unix files can have weird names like document"<injected>.sdoc. So generated links should be URL encoded. HTML encoding is then not strictly necessary (wouldn't hurt either; could be considered as safety net). Suggestion: Do proper URL encoding in another PR; for now autoescaping prevents at least injection attacks.

TODO:

investigate failed ruff checks: Unrelated, new or changed ruff rule PLR1714
investigate failed integration tests (works locally): unrelated, we need either old filecheck 0.0.24 or fixed filecheck > 1.0.0
investigate failed end2end tests
try to replace some | safe with Markup() in code
reconsider if autoescaping could be used for the diff screen
Templates often render a variable that contains the output of calling a nested Template.render() programmatically. This output can be considered safe, yet we have to wrap it in Markup. Is there a way to automatically flag Template.render() as safe? Should already be that way: "Jinja functions (macros, super, self.BLOCKNAME) always return template data that is marked as safe."

stanislaw

This looks reasonable. Please proceed with the Jinja2 autoescape approach 👍

haxtibal · 2024-07-20T14:13:50Z

Integration tests fail because the recently released filecheck 1.0.0 doesn't support --dump-input. This is a regression unrelated to this PR.

stanislaw · 2024-07-20T17:32:42Z

Integration tests fail because the recently released filecheck 1.0.0 AntonLydike/filecheck#19. This is a regression unrelated to this PR.

Sorry for the inconvenience.

We are in the process of transferring the filecheck project to a group of developers who want to actively maintain and develop their own version of filecheck instead of FileCheck.py that was previously developed by me. We were too quick to release a 1.0.0, and it doesn't have the dump-input implemented yet.

I have reported this here: AntonLydike/filecheck#19 (comment).

strictdoc/core/document_iterator.py

strictdoc/core/file_tree.py

haxtibal · 2024-07-20T22:41:53Z

The latest commit "Mark all template.render() output as Markup" is to be discussed: We have to flag every programmatically generated HTML string as "safe", otherwise autoescaping would wrongly process them. This notably applies to all template.render() calls. I hoped Jinja2 would automatically mark the render output as safe, but seemingly not. So I tried to centralize Markup wrapping a bit. The result is a systematic but big diff. Would it be ok for you?

I asked Jinja2 folks whether such a feature could be implemented, see pallets/jinja#2003.

strictdoc/export/html/html_templates.py

strictdoc/backend/reqif/p01_sdoc/sdoc_to_reqif_converter.py

strictdoc/server/helpers/turbo.py

...eqif/UC55_G1_validations/UC55_G1_T01_not_a_reqif_format/test_UC55_G1_T01_not_a_reqif_file.py

stanislaw · 2024-07-21T12:42:42Z

The latest commit "Mark all template.render() output as Markup" is to be discussed: We have to flag every programmatically generated HTML string as "safe", otherwise autoescaping would wrongly process them. This notably applies to all template.render() calls. I hoped Jinja2 would automatically mark the render output as safe, but seemingly not. So I tried to centralize Markup wrapping a bit. The result is a systematic but big diff. Would it be ok for you?

I like the idea of render_template_as_markup because it also reduces the boilerplate everywhere.

I commented two times, asking about the possible redundancy of Markup / autoescape=True. I would like you to confirm that both mechanisms have to be used by us (Markup and autospace=True).

I asked Jinja2 folks whether such a feature could be implemented, see pallets/jinja#2003.

I could not understand what you are referring to in your example there:

The Template knows inner is created with autoescape=True. Couldn't it mark the result of render automatically as safe?

How can it make the result of render automatically if the result is a simple str object?

>>> inner = Template('<b>{{ first_name }}</b> {{ last_name }}', autoescape=True).render(first_name='John', last_name="Doe")
>>> print(type(inner))
<class 'str'>

While reviewing your PR, I considered an alternative approach that could be possible instead of doing the escaping with Jinja.

I recently encountered several implicit conversions with Jinja, when an object is implicitly called __str__ and so something incorrect is rendered to Jinja without me or the integration tests noticing this magic stuff.

An alternative could be that the whole StrictDoc uses a wrapper of a str class, e.g. sdoc_str, that would disable the __str__ method with an explicit assert. Instead the safe version of a string would be implemented as Markup(self) and only this method would be called in each and every Jinja template.

With this approach, the diff could be even bigger but in return we would get a very good control over the strings that we really want to use everywhere. It is in the same direction like the code of field_escaped_value/field_unescaped_value that you are removing now but this escaping stuff would be inside the sdoc_str.

I have confidence that this approach could work just fine but I don't have enough data to compare the advantages/disadvantages of using Jinja's escaping magic vs packing that explicitly in sdoc_str.

The sdoc_str approach would fit the current philosophy of SDoc of doing as many things explicit as possible. The Jinja autoescaping stuff is a different approach but also a capable one.

Another argument for maintaining sdoc_str would be that the escaping business is really core to what StrictDoc does, so keeping this function at the core of all string types could make sense.

And finally there was another source for me thinking in the direction of custom string wrappers. StrictDoc has many cases where we want to deal with a non-empty string and we never want an empty string. Python doesnot support this out of the box, so having a class such as sdoc_ne_str or something like that would simplify code in many places. Maybe it is not too bad of an idea to combine the "non-emptiness" and "markup safety" in one or in a family of str-based wrappers.

EDITED:

These are only thoughts. Let's move forward with Jinja's autoescape, safe, Markup. The alternative approach could work but let's keep it aside.

strictdoc/export/html/renderers/markup_renderer.py

docs/strictdoc_25_design.sdoc

haxtibal · 2024-07-23T21:51:54Z

EDITED:

These are only thoughts. Let's move forward with Jinja's autoescape, safe, Markup. The alternative approach could work but let's keep it aside.

Sorry for not commenting on your alternative approach yet. It sounds interesting. I was quite focused here and wanted to get autoescaping in a workable state first. Will comment soon and suggest to hop back to #1920 for discussion to not interleave the autoescaping approach with the sdoc_str approach too much.

strictdoc/export/html/generators/view_objects/document_screen_view_object.py

stanislaw · 2024-07-24T17:42:42Z

Thanks a lot for contributing this!

SourceFileViewHTMLGenerator escaping code had multiple issues: - When marking each line from range_start_pragma_processor as safe, we wrapped not only strings but also whole tuples in Markup(). - For tuple elements we didn't differentiate whether they from pygments (already escaped) or from source file (not yet escaped). Only the later must be explicitly escaped. Relates to strictdoc-project#1921.

SourceFileViewHTMLGenerator escaping code had multiple issues: - When marking each line from range_start_pragma_processor as safe, we wrapped not only strings but also whole tuples in Markup(). - For tuple elements we didn't differentiate whether they from pygments (already escaped) or from source file (not yet escaped). Only the later must be explicitly escaped. Some changes are there to make mypy happy. Relates to strictdoc-project#1921.

haxtibal mentioned this pull request Jul 18, 2024

Epic: Correct escaping in HTML output #1920

Closed

8 tasks

stanislaw reviewed Jul 18, 2024

View reviewed changes

haxtibal force-pushed the tdmg/escape_singleline_html branch from 6039713 to 3757368 Compare July 20, 2024 11:46

haxtibal force-pushed the tdmg/escape_singleline_html branch from 3757368 to d1a7be8 Compare July 20, 2024 14:52

stanislaw mentioned this pull request Jul 20, 2024

Missing --dump-input flags AntonLydike/filecheck#19

Closed

stanislaw reviewed Jul 20, 2024

View reviewed changes

strictdoc/core/document_iterator.py Outdated Show resolved Hide resolved

stanislaw reviewed Jul 20, 2024

View reviewed changes

strictdoc/core/file_tree.py Outdated Show resolved Hide resolved

stanislaw added this to the 2024-Q3 milestone Jul 20, 2024

stanislaw added the Second roadmap label Jul 20, 2024

haxtibal force-pushed the tdmg/escape_singleline_html branch from d1a7be8 to 9b71ab0 Compare July 20, 2024 22:21

haxtibal force-pushed the tdmg/escape_singleline_html branch from 9b71ab0 to 92f45e1 Compare July 21, 2024 09:00