-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Cranelift: add debug tag infrastructure. #11768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
e2771d9
to
d2d8bb4
Compare
This PR adds *debug tags*, a kind of metadata that can attach to CLIF instructions and be lowered to VCode instructions and as metadata on the produced compiled code. It also adds opaque descriptor blobs carried with stackslots. Together, these two features allow decorating IR with first-class debug instrumentation that is properly preserved by the compiler, including across optimizations and inlining. (Wasmtime's use of these features will come in followup PRs.) The key idea of a "debug tag" is to allow the Cranelift embedder to express whatever information it needs to, in a format that is opaque to Cranelift itself, except for the parts that need translation during lowering. In particular, the `DebugTag::StackSlot` variant gets translated to a physical offset into the stackframe in the compiled metadata output. So, for example, the embedder can emit a tag referring to a stackslot, and another describing an offset in that stackslot. The debug tags exist as a *sequence* on any given instruction; the meaning of the sequence is known only to the embedder, *except* that during inlining, the tags for the inlining call instruction are prepended to the tags of inlined instructions. In this way, a canonical use-case of tags as describing original source-language frames can preserve the source-language view even when multiple functions are inlined into one. The descriptor on a stackslot may look a little odd at first, but its purpose is to allow serializing some description of stackslot-contained runtime user-program data, in a way that is firmly attached to the stackslot. In particular, in the face of inlining, this descriptor is copied into the inlining (parent) function from the inlined function when the stackslot entity is copied; no other metadata outside Cranelift needs to track the identity of stackslots and know about that motion. This fits nicely with the ability of tags to refer to stackslots; together, the embedder can annotate instructions as having certain state in stackslots, and describe the format of that state per stackslot. This infrastructure is tested with some compile-tests now; testing of the interpretation of the metadata output will come with end-to-end debug instrumentation tests in a followup PR.
d2d8bb4
to
9409a84
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some assorted thoughts below, but I'd second @bjorn3's question a bit too. Overall I'm at least not personally quite following why we need to plumb so much through Cranelift vs reading the result of Cranelift's compilation. I mentioned below that it felt like we should be using a newtype-u32 rather than Vec<DebugTag>
in places, but additionally if we're already plumbing around StackSlot
-to-offset as the result of compilation that seems like it's a big part of what we want already? The only other major missing piece would be Inst
-to-offset-in-output which looks related to MachDebugTagPos
logic here for example.
I may not be the best reviewer for this though perhaps. Or perhaps I should read the next PR to see how this is all used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Chris!
Also from the discussion in the cranelift meeting today:
- we should re-add the
sequence_point
CLIF instruction - we should assert that the only instructions with debug tags are
sequence_point
intstructions
To recap a bit what we discussed in today's Cranelift meeting for any other readers (or posterity):
|
Also:
|
…quence points or calls.
I've re-added sequence points in the latest push, and I'll remove the built-in stackslot descriptor blobs (in favor of a |
…lty in not-used case.
All comments are addressed now, I think -- thanks! |
Alright, actually, now that I'm working through rebasing the Wasmtime half on this, I'm not convinced that removing the "blob of bytes descriptor" from the stackslot is actually the right move:
My wip branch is here for anyone curious but I didn't finish wiring up the part that actually queries/uses the descriptors; stopped when I realized what I would have to do in the I pretty strongly feel that the way it worked before was simpler, more self-contained, and more maintainable; this approach is pushed in a bad direction by high-level feedback that may not have foreseen all the implications. Also, handling inlining is becoming the crux of the complexity, but only because of the review feedback (a sort of self-fulfilling prophecy) -- the original works fine. @alexcrichton @fitzgen let me know what you think! I'll revert the last commit if you're OK with the above. |
(To add a little more reasoning by analogy: everything else that goes into metadata in the compiled artifact comes directly from Cranelift, e.g. exception tables, source locations, unwind info, and the like; even the |
At a high level I'm not trying to make life more difficult here, but rather I'm trying to maintain a separation between Wasmtime and Cranelift to avoid having the "wasm mode" in Cranelift with a bunch of custom features just for Wasmtime. For example you say:
but following this to the extreme means that we throw away the ability to embed Cranelift anywhere else and fuse it to Wasmtime specifically. That will objectively simplify Cranelift and make it more self-contained for Wasmtime's use case for example. Despite this though I don't think it makes sense to do, so this is part of a gray area of sorts where I thinks it's worth exploring solutions that avoid tacking on Wasmtime-specific features into Cranelift. The reason I feel this is Wasmtime-specific is how this relates to other Cranelift features feels different. For example This is my high-level motivation at least which I wanted to explain. I'm not intentionally trying to make this harder to work with but basically trying to maintain a separation between Wasmtime and Cranelift. This is of course, as with everything, a balance to figure out what works and what doesn't. I don't want to over-rotate on "they must be separate" but I also don't want to over-rotate on "this works let's just ship it". For this problem specifically, I could very well be missing details, but I was under the impression that the idea was to tack a 64-bit integer/value/etc onto a
Along the lines of my initial comment here, I want to reiterate that I'm not trying to make things horribly complex. If they are then they are and we should go a different route. For now though I'm not sure myself at least why inlining would come into play given that there's a stack-slot-per-function that's significant so the descriptor is the
In some sense we've long since given up on this level of efficiency. There's a ton of inefficiencies in the "don't clone this" or "don't move this" category related to Cranelift compilation and it's something we've never really focused on. I don't mean to claim "if we remove this Basically efficiency isn't a huge point for me, it's just something I noticed. The higher-level point to me is that it feels like a wart in Cranelift's design where everywhere else embedder-related things are integers but here it happens to be a |
So, a few things:
That's fine until we get to the point of looking up the metadata. Note that the metadata (the stackslot format) is an artifact of the compilation of the original function; it's not computed separately or just part of the source or whatnot. So we need a merge step. The simple answer here is to let each function emit only its own original source-level-function metadata as it's appended to the text section. But there are at least two problems with this:
This is part of what I mean by new complexity -- the need to keep things in sync, and rely on the invariant that someone else will later emit that metadata, and the need to compute the mapping that the layer of indirection introduces, makes me a bit squeamish. In contrast, IR that is fully self-describing gives us solutions to all of the above for free. |
I don't think that's true -- descriptors were copied as part of the compilation pipeline (same as serialized data would be in a hypothetical refactor), but callsites only need to create tags to refer to them. |
This could be derived from the |
Unfortunately not: |
I've pushed this a little further still and the next impedance mismatch is that frame slots can be at different FP offsets when inlined; so one can build a global table from FuncKey to descriptor, but part of the descriptor varies depending on the use-site, so one needs another layer of descriptors to reference those descriptors and add an FP offset. Lest this be accounted under "complexity of inlining", I'll note again that everything just works when the IR is self-describing; trying to spread the concept of a descriptor across two layers, when the thing it is attached to is a builtin IR entity that is freely manipulated by the compiler, is the source of the problems... |
Well at this point it seems like the way forward is
Personally I don't know how to grapple with these two things. On one hand it's ok for external symbols to be two 32-bit integers when they're more naturally represented in many other places as strings. This then works well enough during inlining that there's no reason to change this. For stack slots, though, you're saying it makes more sense to use a bag-of-bytes descriptor despite symbols, for example, intentionally (I assume?) not using that. I also don't understand why a bag-of-bytes or integer-on-stack-slot is any different w.r.t. inlining, both strategies simply copy whatever's there to somewhere else. The only difference is the interpretation later on and then it's either the bytes are there or a lookup table is needed. Now I agree that looking up the metadata by key is slightly more tricky. I don't understand why it's so tricky it's not worth pursuing, though. The merge step already exists in the form of linking everything together and we already have to have mapping logic related to handling relocations. This is certainly a tricky step, but I don't understand why building a map at the beginning would make it any more meaningfully trickier.
I don't fully grasp the increase in complexity here. My assumption is that a descriptor for a stack slot is more-or-less a function of a Overall I agree it's different than what's already here today, but I again don't fully understand the rationale about the drastic increase in complexity.
I meant this line from the previous version of this PR. |
Alright, I managed to cobble a solution together (update after rebase on the Wasmtime side here). Apologies for the pushback -- it's a subjective design thing and I'm still not super-happy with the coupling, versus fully self-describing IR, but at least this is possible. It did require some small changes to the signatures in the compiler traits and you can see the extra pass over functions but at least it's enabled only when the tunable is set. (I think that by redesigning the serialization format a bit to separate out the slot-offset-from-FP from the offset-within-slot, I can share descriptors between inlined copies of functions rather than emitting them separately for each function that uses them; I'll poke at that on the Wasmtime side.) |
/// the identity of stackslots to be carried along with the IR | ||
/// entities. This opaque `StackSlotKey` allows the embedder to do | ||
/// so. | ||
pub key: Option<StackSlotKey>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use PackedOption?
This PR adds debug tags, a kind of metadata that can attach to CLIF instructions and be lowered to VCode instructions and as metadata on the produced compiled code. It also adds opaque descriptor blobs carried with stackslots. Together, these two features allow decorating IR with first-class debug instrumentation that is properly preserved by the compiler, including across optimizations and inlining. (Wasmtime's use of these features will come in followup PRs.)
The key idea of a "debug tag" is to allow the Cranelift embedder to express whatever information it needs to, in a format that is opaque to Cranelift itself, except for the parts that need translation during lowering. In particular, the
DebugTag::StackSlot
variant gets translated to a physical offset into the stackframe in the compiled metadata output. So, for example, the embedder can emit a tag referring to a stackslot, and another describing an offset in that stackslot.The debug tags exist as a sequence on any given instruction; the meaning of the sequence is known only to the embedder, except that during inlining, the tags for the inlining call instruction are prepended to the tags of inlined instructions. In this way, a canonical use-case of tags as describing original source-language frames can preserve the source-language view even when multiple functions are inlined into one.
The descriptor on a stackslot may look a little odd at first, but its purpose is to allow serializing some description of stackslot-contained runtime user-program data, in a way that is firmly attached to the stackslot. In particular, in the face of inlining, this descriptor is copied into the inlining (parent) function from the inlined function when the stackslot entity is copied; no other metadata outside Cranelift needs to track the identity of stackslots and know about that motion. This fits nicely with the ability of tags to refer to stackslots; together, the embedder can annotate instructions as having certain state in stackslots, and describe the format of that state per stackslot.
This infrastructure is tested with some compile-tests now; testing of the interpretation of the metadata output will come with end-to-end debug instrumentation tests in a followup PR.