Skip to content

Conversation

cfallin
Copy link
Member

@cfallin cfallin commented Oct 1, 2025

This PR adds debug tags, a kind of metadata that can attach to CLIF instructions and be lowered to VCode instructions and as metadata on the produced compiled code. It also adds opaque descriptor blobs carried with stackslots. Together, these two features allow decorating IR with first-class debug instrumentation that is properly preserved by the compiler, including across optimizations and inlining. (Wasmtime's use of these features will come in followup PRs.)

The key idea of a "debug tag" is to allow the Cranelift embedder to express whatever information it needs to, in a format that is opaque to Cranelift itself, except for the parts that need translation during lowering. In particular, the DebugTag::StackSlot variant gets translated to a physical offset into the stackframe in the compiled metadata output. So, for example, the embedder can emit a tag referring to a stackslot, and another describing an offset in that stackslot.

The debug tags exist as a sequence on any given instruction; the meaning of the sequence is known only to the embedder, except that during inlining, the tags for the inlining call instruction are prepended to the tags of inlined instructions. In this way, a canonical use-case of tags as describing original source-language frames can preserve the source-language view even when multiple functions are inlined into one.

The descriptor on a stackslot may look a little odd at first, but its purpose is to allow serializing some description of stackslot-contained runtime user-program data, in a way that is firmly attached to the stackslot. In particular, in the face of inlining, this descriptor is copied into the inlining (parent) function from the inlined function when the stackslot entity is copied; no other metadata outside Cranelift needs to track the identity of stackslots and know about that motion. This fits nicely with the ability of tags to refer to stackslots; together, the embedder can annotate instructions as having certain state in stackslots, and describe the format of that state per stackslot.

This infrastructure is tested with some compile-tests now; testing of the interpretation of the metadata output will come with end-to-end debug instrumentation tests in a followup PR.

@cfallin cfallin requested a review from a team as a code owner October 1, 2025 03:36
@cfallin cfallin requested review from alexcrichton and removed request for a team October 1, 2025 03:36
@cfallin cfallin force-pushed the cranelift-debug-tags branch 4 times, most recently from e2771d9 to d2d8bb4 Compare October 1, 2025 04:05
This PR adds *debug tags*, a kind of metadata that can attach to CLIF
instructions and be lowered to VCode instructions and as metadata on
the produced compiled code. It also adds opaque descriptor blobs
carried with stackslots. Together, these two features allow decorating
IR with first-class debug instrumentation that is properly preserved
by the compiler, including across optimizations and
inlining. (Wasmtime's use of these features will come in followup
PRs.)

The key idea of a "debug tag" is to allow the Cranelift embedder to
express whatever information it needs to, in a format that is opaque
to Cranelift itself, except for the parts that need translation during
lowering. In particular, the `DebugTag::StackSlot` variant gets
translated to a physical offset into the stackframe in the compiled
metadata output. So, for example, the embedder can emit a tag
referring to a stackslot, and another describing an offset in that
stackslot.

The debug tags exist as a *sequence* on any given instruction; the
meaning of the sequence is known only to the embedder, *except* that
during inlining, the tags for the inlining call instruction are
prepended to the tags of inlined instructions. In this way, a
canonical use-case of tags as describing original source-language
frames can preserve the source-language view even when multiple
functions are inlined into one.

The descriptor on a stackslot may look a little odd at first, but its
purpose is to allow serializing some description of
stackslot-contained runtime user-program data, in a way that is firmly
attached to the stackslot. In particular, in the face of inlining,
this descriptor is copied into the inlining (parent) function from the
inlined function when the stackslot entity is copied; no other
metadata outside Cranelift needs to track the identity of stackslots
and know about that motion. This fits nicely with the ability of tags
to refer to stackslots; together, the embedder can annotate
instructions as having certain state in stackslots, and describe the
format of that state per stackslot.

This infrastructure is tested with some compile-tests now;
testing of the interpretation of the metadata output will come with
end-to-end debug instrumentation tests in a followup PR.
@cfallin cfallin force-pushed the cranelift-debug-tags branch from d2d8bb4 to 9409a84 Compare October 1, 2025 04:17
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen labels Oct 1, 2025
Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some assorted thoughts below, but I'd second @bjorn3's question a bit too. Overall I'm at least not personally quite following why we need to plumb so much through Cranelift vs reading the result of Cranelift's compilation. I mentioned below that it felt like we should be using a newtype-u32 rather than Vec<DebugTag> in places, but additionally if we're already plumbing around StackSlot-to-offset as the result of compilation that seems like it's a big part of what we want already? The only other major missing piece would be Inst-to-offset-in-output which looks related to MachDebugTagPos logic here for example.

I may not be the best reviewer for this though perhaps. Or perhaps I should read the next PR to see how this is all used.

@alexcrichton alexcrichton requested review from alexcrichton and fitzgen and removed request for alexcrichton October 1, 2025 16:49
Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Chris!

Also from the discussion in the cranelift meeting today:

  • we should re-add the sequence_point CLIF instruction
  • we should assert that the only instructions with debug tags are sequence_point intstructions

@cfallin
Copy link
Member Author

cfallin commented Oct 1, 2025

I have some assorted thoughts below, but I'd second @bjorn3's question a bit too. Overall I'm at least not personally quite following why we need to plumb so much through Cranelift vs reading the result of Cranelift's compilation. I mentioned below that it felt like we should be using a newtype-u32 rather than Vec<DebugTag> in places, but additionally if we're already plumbing around StackSlot-to-offset as the result of compilation that seems like it's a big part of what we want already? The only other major missing piece would be Inst-to-offset-in-output which looks related to MachDebugTagPos logic here for example.

To recap a bit what we discussed in today's Cranelift meeting for any other readers (or posterity):

  • The design here is more or less necessitated by supporting preservation across inlining (and we fully preserve debugging correctness across other optimizations too, so handling inlining as well doesn't seem out of place; moreover, excluding it seemed to me to be an unnecessary limit). In particular, inlining concatenates tags from the inlining call instruction to those of the inlined instructions (see this doc comment); this is what means we need a list of tags, rather than a single tag, and the need for these tags to stand alone (rather than be an indirect index into some other data) is forced by inlining as well.
  • The correctness across optimizations is ensured by placing tags on side-effectful instructions, which don't get removed. In the sibling PR for Wasmtime (Wasmtime: implement debug instrumentation and basic host API to examine runtime state. #11769) I add them on all trapping instructions, and on callsites; an earlier iteration of this work put them on sequence-point instructions that acted as explicit immovable markers instead. I explained the appeal of that latter design for conceptual clarity in the meeting and we agreed to go back to sequence-point instructions (calls will also need tags, so the inlining mechanism works).

@cfallin
Copy link
Member Author

cfallin commented Oct 1, 2025

Also:

  • Attaching tags to instruction indices and then recovering information at the output is error-prone for a number of reasons, but inlining again is what forces us to have tags as a first-class object that flow through the IR rather than live alongside it.
  • Cranelift needs to handle the translation of stackslot references from the IR level to the frame-layout level in the compilation metadata, so this part of the mechanism also needs to be first-class in the IR, and can't be indirected to the outside (wasmtime-cranelift or other embedder).

@cfallin
Copy link
Member Author

cfallin commented Oct 2, 2025

I've re-added sequence points in the latest push, and I'll remove the built-in stackslot descriptor blobs (in favor of a u64 that in Wasmtime's case can hold a FuncKey) next.

@github-actions github-actions bot added the cranelift:meta Everything related to the meta-language. label Oct 2, 2025
@cfallin
Copy link
Member Author

cfallin commented Oct 2, 2025

All comments are addressed now, I think -- thanks!

@cfallin
Copy link
Member Author

cfallin commented Oct 2, 2025

Alright, actually, now that I'm working through rebasing the Wasmtime half on this, I'm not convinced that removing the "blob of bytes descriptor" from the stackslot is actually the right move:

  • It's horribly complex to keep the side table and plumb it to the right places on the wasmtime-cranelift side, in particular around inlining. Inlining "just works" when the StackSlotData is self-describing. Now we need to carry the side-table on the wasmtime-cranelift side, let cranelift-codegen query callees and decide to inline via the Inline trait, but insert a side-effect somehow enacted by that trait's implementation to move descriptors behind the scenes back on the wasmtime-cranelift side. That seems very error-prone in comparison. (Note that because we implement a per-function compiler, there is no place for a single module-global side-table to exist and be updated by the parallel compilation.)
  • It's not any more efficient (we still are copying data from FuncEnvironment into the side-table, then between functions' side-tables when inlining). IIRC this was the original concern -- "too Vec-heavy" in some sense; but the descriptors still exist, just at arm's length and in a remote place that has to remain synchronized.

My wip branch is here for anyone curious but I didn't finish wiring up the part that actually queries/uses the descriptors; stopped when I realized what I would have to do in the InliningCompiler.

I pretty strongly feel that the way it worked before was simpler, more self-contained, and more maintainable; this approach is pushed in a bad direction by high-level feedback that may not have foreseen all the implications. Also, handling inlining is becoming the crux of the complexity, but only because of the review feedback (a sort of self-fulfilling prophecy) -- the original works fine.

@alexcrichton @fitzgen let me know what you think! I'll revert the last commit if you're OK with the above.

@cfallin
Copy link
Member Author

cfallin commented Oct 2, 2025

(To add a little more reasoning by analogy: everything else that goes into metadata in the compiled artifact comes directly from Cranelift, e.g. exception tables, source locations, unwind info, and the like; even the name_map, which is carried separately in CompiledFunction, comes directly from the CLIF, and the CLIF is updated by inlining and any other transforms. Following the precedent here, we should keep everything that describes what we'll emit in the IR itself; doing otherwise is constructing a parallel universe of info on the Wasmtime side that has to be maintained and updated carefully whenever we do interprocedural opts. The spirit of our RFC is to work with the compiler, not against it, and let it manage and translate our debugging instrumentation + metadata like any other part of the program.)

@alexcrichton
Copy link
Member

At a high level I'm not trying to make life more difficult here, but rather I'm trying to maintain a separation between Wasmtime and Cranelift to avoid having the "wasm mode" in Cranelift with a bunch of custom features just for Wasmtime. For example you say:

simpler, more self-contained, and more maintainable

but following this to the extreme means that we throw away the ability to embed Cranelift anywhere else and fuse it to Wasmtime specifically. That will objectively simplify Cranelift and make it more self-contained for Wasmtime's use case for example. Despite this though I don't think it makes sense to do, so this is part of a gray area of sorts where I thinks it's worth exploring solutions that avoid tacking on Wasmtime-specific features into Cranelift.

The reason I feel this is Wasmtime-specific is how this relates to other Cranelift features feels different. For example UserExternalName is heavily used by Wasmtime and is just an opaque 64-bit integer effectively from Cranelift's point of view. It would objectively be simpler if we didn't have to deal with this and just moved FuncKey directly into Cranelift, but I view it as valuable that we don't do that. Having a tight API boundary is also nice for future compatibility/extensibility in that it makes working on Cranelift, outside the context of Wasmtime, easier.

This is my high-level motivation at least which I wanted to explain. I'm not intentionally trying to make this harder to work with but basically trying to maintain a separation between Wasmtime and Cranelift. This is of course, as with everything, a balance to figure out what works and what doesn't. I don't want to over-rotate on "they must be separate" but I also don't want to over-rotate on "this works let's just ship it".


For this problem specifically, I could very well be missing details, but I was under the impression that the idea was to tack a 64-bit integer/value/etc onto a Descriptor. That'd be the serialization of FuncKey which means, at the end of compilation, we know what origin function a stack slot corresponds to. What I was then thinking would happen is that CompiledFunction in Cranelift would retain metadata describing the stack slot (locals, etc). During the linking phase we'd need to build the metadata section in the ELF, and we'd use Cranelift's MachBuffer output to find what stack slots a function has. We'd then use this FuncKey to look up in the map of all functions to go to the metadata that the function wants, and then the section would be emitted.

It's horribly complex to keep the side table and plumb it to the right places on the wasmtime-cranelift side

Along the lines of my initial comment here, I want to reiterate that I'm not trying to make things horribly complex. If they are then they are and we should go a different route.

For now though I'm not sure myself at least why inlining would come into play given that there's a stack-slot-per-function that's significant so the descriptor is the FuncKey for that function. I'm also not sure why carrying around the side table would be difficult if the per-function metadata were placed into a CompiledFunction and then looked up during the final linking phase.

It's not any more efficient

In some sense we've long since given up on this level of efficiency. There's a ton of inefficiencies in the "don't clone this" or "don't move this" category related to Cranelift compilation and it's something we've never really focused on. I don't mean to claim "if we remove this Vec we're retain our lightning-fast compilation" but moreso "if we can I think it'd be good to avoid taking this debt on". For example in the old iteration it looked like whenever a Callee is made it would clone the descriptor field of stack slots. That naively seems like it would clone per-call within a function. Now I have no idea how big the metadata is and it's always solvable by also slapping a few Arcs around.

Basically efficiency isn't a huge point for me, it's just something I noticed. The higher-level point to me is that it feels like a wart in Cranelift's design where everywhere else embedder-related things are integers but here it happens to be a Vec<u8> because that happens to be what we want for this exact use case in Wasmtime

@cfallin
Copy link
Member Author

cfallin commented Oct 2, 2025

So, a few things:

  • I don't think I'm arguing to make Cranelift Wasmtime-specific at all. I agree that's a bad end-goal; I chose Vec<u8> for the stack-slot descriptor specifically because it was an opaque representation.
  • I do think this case is a bit different. Your mention of the UserExternalName example is illustrative: CLIF itself contains a list of all external names referenced; this list is legible to and kept updated by the inliner. That's a good example of "IR contains all relevant info", and we're assured that we'll get it right. In contrast, here, externalizing the information about the frame slot causes some really subtle and surprising possible footguns. To dig into your sketch a bit:

For this problem specifically, I could very well be missing details, but I was under the impression that the idea was to tack a 64-bit integer/value/etc onto a Descriptor. That'd be the serialization of FuncKey which means, at the end of compilation, we know what origin function a stack slot corresponds to. What I was then thinking would happen is that CompiledFunction in Cranelift would retain metadata describing the stack slot (locals, etc). During the linking phase we'd need to build the metadata section in the ELF, and we'd use Cranelift's MachBuffer output to find what stack slots a function has. We'd then use this FuncKey to look up in the map of all functions to go to the metadata that the function wants, and then the section would be emitted.

That's fine until we get to the point of looking up the metadata. Note that the metadata (the stackslot format) is an artifact of the compilation of the original function; it's not computed separately or just part of the source or whatnot. So we need a merge step. The simple answer here is to let each function emit only its own original source-level-function metadata as it's appended to the text section.

But there are at least two problems with this:

  • What happens if we do function-level DCE? (We can't; or we need a special "we still need your metadata as a donation to the cause, but you don't get to go in .text" mode.)
  • We actually need another pre-pass to create the mapping from FuncKeys to frame-table descriptor indices. (In today's setup all descriptors are local to the function so we just lazily create them as needed during translation of progpoints.)
    • And this refactors the whole trait setup in Compiler -- right now it's explicitly one pass with an "append to text section" method.

This is part of what I mean by new complexity -- the need to keep things in sync, and rely on the invariant that someone else will later emit that metadata, and the need to compute the mapping that the layer of indirection introduces, makes me a bit squeamish. In contrast, IR that is fully self-describing gives us solutions to all of the above for free.

@cfallin
Copy link
Member Author

cfallin commented Oct 2, 2025

For example in the old iteration it looked like whenever a Callee is made it would clone the descriptor field of stack slots. That naively seems like it would clone per-call within a function.

I don't think that's true -- descriptors were copied as part of the compilation pipeline (same as serialized data would be in a hypothetical refactor), but callsites only need to create tags to refer to them.

@bjorn3
Copy link
Contributor

bjorn3 commented Oct 2, 2025

We actually need another pre-pass to create the mapping from FuncKeys to frame-table descriptor indices.

This could be derived from the FuncKey the same way ExternalName is, can't it?

@cfallin
Copy link
Member Author

cfallin commented Oct 2, 2025

We actually need another pre-pass to create the mapping from FuncKeys to frame-table descriptor indices.

This could be derived from the FuncKey the same way ExternalName is, can't it?

Unfortunately not: ExternalName is built by directly embedding the FuncKey (here), which is a bitpacked u64 not necessarily in a contiguous sequence; whereas we need indices into a descriptor table in the frame-table section that are contiguous.

@cfallin
Copy link
Member Author

cfallin commented Oct 2, 2025

I've pushed this a little further still and the next impedance mismatch is that frame slots can be at different FP offsets when inlined; so one can build a global table from FuncKey to descriptor, but part of the descriptor varies depending on the use-site, so one needs another layer of descriptors to reference those descriptors and add an FP offset.

Lest this be accounted under "complexity of inlining", I'll note again that everything just works when the IR is self-describing; trying to spread the concept of a descriptor across two layers, when the thing it is attached to is a builtin IR entity that is freely manipulated by the compiler, is the source of the problems...

@alexcrichton
Copy link
Member

Well at this point it seems like the way forward is Vec<u8> as a descriptor entry in stack slots. That being said I still feel like it's an unresolved point to me. It's not the end of the world to paper over and just see how it works out, so I don't feel the need to block anything. That being said I'll still put my thoughts down:

... chose Vec for the stack-slot descriptor specifically because it was an opaque representation.

... Your mention of the UserExternalName example is illustrative: ...

Personally I don't know how to grapple with these two things. On one hand it's ok for external symbols to be two 32-bit integers when they're more naturally represented in many other places as strings. This then works well enough during inlining that there's no reason to change this. For stack slots, though, you're saying it makes more sense to use a bag-of-bytes descriptor despite symbols, for example, intentionally (I assume?) not using that. I also don't understand why a bag-of-bytes or integer-on-stack-slot is any different w.r.t. inlining, both strategies simply copy whatever's there to somewhere else. The only difference is the interpretation later on and then it's either the bytes are there or a lookup table is needed.

Now I agree that looking up the metadata by key is slightly more tricky. I don't understand why it's so tricky it's not worth pursuing, though. The merge step already exists in the form of linking everything together and we already have to have mapping logic related to handling relocations. This is certainly a tricky step, but I don't understand why building a map at the beginning would make it any more meaningfully trickier.

This is part of what I mean by new complexity -- the need to keep things in sync, and rely on the invariant that someone else will later emit that metadata, and the need to compute the mapping that the layer of indirection introduces,

I don't fully grasp the increase in complexity here. My assumption is that a descriptor for a stack slot is more-or-less a function of a FuncKey. Given that there's not really anything to keep in sync other than something needs to produce a map of FuncKey to descriptor and each function compilation would produce its descriptor as part of the CompiledFunction. While DCE seems like something to keep in mind that doesn't seem like an insurmountable amount of complexity (e.g. the function is passed in with a flag saying "don't put this in the text section").

Overall I agree it's different than what's already here today, but I again don't fully understand the rationale about the drastic increase in complexity.

In contrast, IR that is fully self-describing gives us solutions to all of the above for free.

...

For example in the old iteration it looked like whenever a Callee is made it would clone the descriptor field of stack slots.

I don't think that's true ...

I meant this line from the previous version of this PR.

@cfallin
Copy link
Member Author

cfallin commented Oct 2, 2025

Alright, I managed to cobble a solution together (update after rebase on the Wasmtime side here). Apologies for the pushback -- it's a subjective design thing and I'm still not super-happy with the coupling, versus fully self-describing IR, but at least this is possible. It did require some small changes to the signatures in the compiler traits and you can see the extra pass over functions but at least it's enabled only when the tunable is set.

(I think that by redesigning the serialization format a bit to separate out the slot-offset-from-FP from the offset-within-slot, I can share descriptors between inlined copies of functions rather than emitting them separately for each function that uses them; I'll poke at that on the Wasmtime side.)

@alexcrichton
Copy link
Member

This all looks reasonable to me, thanks! A skim over #11769 all looks pretty reasonable to me too. I'll defer to @fitzgen for final approval on this though

/// the identity of stackslots to be carried along with the IR
/// entities. This opaque `StackSlotKey` allows the embedder to do
/// so.
pub key: Option<StackSlotKey>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use PackedOption?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen cranelift:meta Everything related to the meta-language. cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants