Skip to content

Conversation

naitoh
Copy link
Contributor

@naitoh naitoh commented Sep 6, 2025

Why?

GitHub: fix #289

We must reject all well-formed XMLs:

https://www.w3.org/TR/2006/REC-xml11-20060816/#proc-types

Validating and non-validating processors alike MUST report violations of this specification's well-formedness constraints in the content of the document entity and any other parsed entities that they read.

No root element XML is not well-formed because document requires one element:

https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-well-formed

[1] document ::= ( prolog element Misc* ) - ( Char* RestrictedChar Char* )

@naitoh naitoh marked this pull request as draft September 6, 2025 08:46
@naitoh naitoh marked this pull request as ready for review September 6, 2025 09:36
@naitoh naitoh requested a review from kou September 6, 2025 09:36
@naitoh naitoh force-pushed the add_root_tag_check branch from a36267b to 706c471 Compare September 6, 2025 14:34
@kou kou changed the title Added a check for the presence of the root tag Reject no root element XML as an invalid XML Sep 6, 2025
@naitoh naitoh requested a review from kou September 6, 2025 23:01
@naitoh naitoh force-pushed the add_root_tag_check branch from 706c471 to b89d656 Compare September 7, 2025 07:13
@naitoh naitoh force-pushed the add_root_tag_check branch from b89d656 to c0f1cd5 Compare September 7, 2025 07:38
@naitoh naitoh requested a review from kou September 7, 2025 07:42
@kou kou merged commit 6ba286c into ruby:master Sep 7, 2025
67 checks passed
@naitoh naitoh deleted the add_root_tag_check branch September 7, 2025 07:59
@jrafanie
Copy link

jrafanie commented Sep 8, 2025

First of all, thanks for the work on this project. 🙇

I'm noticing this handles empty string but not nil input.

In 3.4.2, they had the same behavior:

irb(main):003> REXML::Document.new("")
=> <UNDEFINED/>
irb(main):004> REXML::Document.new(nil)
=> <UNDEFINED/>

Now, in 3.4.3, the behavior is changed:

irb(main):002> REXML::Document.new("")
../.gem/ruby/3.3.9/gems/rexml-3.4.3/lib/rexml/parsers/baseparser.rb:271:in `pull_event': Malformed XML: No root element (REXML::ParseException)
Line: 0
Position: 0
Last 80 unconsumed characters:

	from ../.gem/ruby/3.3.9/gems/rexml-3.4.3/lib/rexml/parsers/baseparser.rb:249:in `pull'
	from ../.gem/ruby/3.3.9/gems/rexml-3.4.3/lib/rexml/parsers/treeparser.rb:21:in `parse'
	from ../.gem/ruby/3.3.9/gems/rexml-3.4.3/lib/rexml/document.rb:466:in `build'
	from ../.gem/ruby/3.3.9/gems/rexml-3.4.3/lib/rexml/document.rb:103:in `initialize'
	from (irb):2:in `new'
	from (irb):2:in `<main>'
	from <internal:kernel>:187:in `loop'
	from ../.gem/ruby/3.3.9/gems/bundler-2.6.2/lib/bundler/cli/console.rb:16:in `run'
	from ../.gem/ruby/3.3.9/gems/bundler-2.6.2/lib/bundler/cli.rb:482:in `console'
	from ../.gem/ruby/3.3.9/gems/bundler-2.6.2/lib/bundler/vendor/thor/lib/thor/command.rb:28:in `run'
	from ../.gem/ruby/3.3.9/gems/bundler-2.6.2/lib/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command'
	from ../.gem/ruby/3.3.9/gems/bundler-2.6.2/lib/bundler/vendor/thor/lib/thor.rb:538:in `dispatch'
	from ../.gem/ruby/3.3.9/gems/bundler-2.6.2/lib/bundler/cli.rb:35:in `dispatch'
	from ../.gem/ruby/3.3.9/gems/bundler-2.6.2/lib/bundler/vendor/thor/lib/thor/base.rb:584:in `start'
	from ../.gem/ruby/3.3.9/gems/bundler-2.6.2/lib/bundler/cli.rb:29:in `start'
	from ../.gem/ruby/3.3.9/gems/bundler-2.6.2/exe/bundle:28:in `block in <top (required)>'
	... 4 levels...
irb(main):003> REXML::Document.new(nil)
=> <UNDEFINED/>

Finally, should this print a warning in a patch release (3.4.x) and raise in the next minor release (3.5.x)? I noticed this because rexml hasn't validated this forever so there's possibly other code that was creating a document and adding elements later, see: https://github.com/search?q=REXML%3A%3ADocument.new%28%22%22%29&type=code

Thanks!

@kou
Copy link
Member

kou commented Sep 8, 2025

Thanks for sharing the information.

How about treating REXML::Document.new("") as REXML::Document.new(nil) for keeping backward compatibility?
(Note that REXML::Document.new(" ") (not an empty input) still rejects the input.)

@jrafanie
Copy link

jrafanie commented Sep 8, 2025

How about treating REXML::Document.new("") as REXML::Document.new(nil) for keeping backward compatibility?

If REXML::Document.new("") is invalid because we have no root element, I would think that REXML::Document.new(nil) is also invalid. Is that correct?

If this is true:
For 3.4.x, I would suggest we print a deprecation warning.
For 3.5.x, I would say we can raise an error for both empty string and nil arguments.

Note, I noticed this because I had some old code that was creating a new rexml document with an empty string and then adding elements afterwards. Maybe it's incorrect but it's always worked. I'd like to fix our code going forward to match what the correct behavior should be for the future.

(Note that REXML::Document.new(" ") (not an empty input) still rejects the input.)

Yes, that makes sense to me.

Thanks again for your quick feedback!

@kou
Copy link
Member

kou commented Sep 8, 2025

If REXML::Document.new("") is invalid because we have no root element, I would think that REXML::Document.new(nil) is also invalid. Is that correct?

No. It's not correct.

REXML::Document.new(nil) (== REXML::Document.new) is the correct way to create an empty document:

# When no arguments are given,
# returns an empty document:
#
# d = REXML::Document.new
# d.to_s # => ""

REXML::Document.new("") is an application bug:

# When argument +string+ is given, it must be a string
# containing a valid XML document:

If this is true: For 3.4.x, I would suggest we print a deprecation warning. For 3.5.x, I would say we can raise an error for both empty string and nil arguments.

In general, I'm negative about runtime deprecation warnings. They may be shown in application/indirect user environment. If they are shown in direct developer's environment, the developer can fix their code. But application/indirect users can't. These deprecation warnings are just annoying...

@jrafanie
Copy link

jrafanie commented Sep 9, 2025

@kou Thank you for the clarification and details on the difference between .new(nil) vs .new(""). It's quite helpful.

In general, I'm negative about runtime deprecation warnings. They may be shown in application/indirect user environment. If they are shown in direct developer's environment, the developer can fix their code. But application/indirect users can't. These deprecation warnings are just annoying...

I'm in agreement there but am concerned about breaking the interface of .new in a patch release, 3.4.3. I'm not expecting breakages upgrading from 3.4.2 to 3.4.3. That was surprising.

My suggestion would be your suggestion:

How about treating REXML::Document.new("") as REXML::Document.new(nil) for keeping backward compatibility?

In 3.4.4:

I suggest .new("") now prints a deprecation warning and forwards to .new(nil). This prevents breaking the previous behavior that existed for several years.

In 3.5.x, you can drop the deprecation warning on .new("") and raise the exception instead.

Thanks again for the feedback and understanding my concerns about the breakages I'm seeing.

@jrafanie
Copy link

jrafanie commented Sep 9, 2025

Also, please let me know if you'd prefer I open an issue with the breakage for 3.4.3. I wasn't sure if it was intentional to break from 3.4.2 to 3.4.3 if you used .new("")

@Fryguy
Copy link

Fryguy commented Sep 9, 2025

How about treating REXML::Document.new("") as REXML::Document.new(nil) for keeping backward compatibility?

This seems very reasonable to me, and probably would alleviate @jrafanie's concern as the patch release for it would put this back to not failing.

@kou
Copy link
Member

kou commented Sep 9, 2025

please let me know if you'd prefer I open an issue with the breakage for 3.4.3

Yes please for better visibility.

@kou
Copy link
Member

kou commented Sep 10, 2025

Opened: #296

naitoh added a commit to naitoh/rexml that referenced this pull request Sep 22, 2025
kou pushed a commit that referenced this pull request Sep 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

REXML::Document.new accepts no root input
4 participants