-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[libc++] Mark __{emplace,push}_back_slow_path as noinline #94379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-libcxx Author: Nikolas Klauser (philnik777) ChangesThese are almost ceratinly intended to not be inlined. This significantly reduces code size when Fixes #94360 Full diff: https://github.com/llvm/llvm-project/pull/94379.diff 1 Files Affected:
diff --git a/libcxx/include/vector b/libcxx/include/vector
index cbfc2cefa1fd9..c542b3e70cdd1 100644
--- a/libcxx/include/vector
+++ b/libcxx/include/vector
@@ -844,10 +844,11 @@ private:
}
template <class _Up>
- _LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI inline pointer __push_back_slow_path(_Up&& __x);
+ _LIBCPP_NOINLINE _LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI inline pointer __push_back_slow_path(_Up&& __x);
template <class... _Args>
- _LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI inline pointer __emplace_back_slow_path(_Args&&... __args);
+ _LIBCPP_NOINLINE _LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI inline pointer
+ __emplace_back_slow_path(_Args&&... __args);
// The following functions are no-ops outside of AddressSanitizer mode.
// We call annotations for every allocator, unless explicitly disabled.
|
This unfortunately will hurt performance greatly (think 5x) on a simple loop like |
Furthermore, I don't understand the philosophy of push_back_slow_path in this case. The act of constructing a new type at the end of the vector happens regardless if the buffer needs to grow or not. Why would you duplicate the construction code? You just want to grow the buffer, which is the same between emplace_back or push_back. This way of tailcalling slow paths only make sense if the function is outline in the first place, where it potentially prevents stack frames. But push_back is made to be inlined. |
My tendency here is to trust the compiler. The example output on the code presented in the bug may seem silly, but why wouldn't the compiler inline there? It's the only function in a translation unit, and it contains a couple of lines. I wonder under what circumstances the compiler chooses to not inline the code. However, when we prevent the compiler from inlining, we prevent it from optimizing away dead code. Before you change, this silly example is optimized away. void foo() {
std::vector<int> v;
v.push_back(42);
}
// foo(): # @foo()
// ret Now it generates foo(): # @foo()
push rbx
sub rsp, 48
xorps xmm0, xmm0
movaps xmmword ptr [rsp + 16], xmm0
mov qword ptr [rsp + 32], 0
mov dword ptr [rsp + 12], 42
lea rdi, [rsp + 16]
lea rsi, [rsp + 12]
call int* std::__2::vector<int, std::__2::allocator<int> >::__push_back_slow_path<int>(int&&)
mov rdi, qword ptr [rsp + 16]
test rdi, rdi
je .LBB0_3
mov qword ptr [rsp + 24], rdi
call operator delete(void*)@PLT
.LBB0_3:
add rsp, 48
pop rbx
ret
mov rbx, rax
mov rdi, qword ptr [rsp + 16]
test rdi, rdi
je .LBB0_6
mov qword ptr [rsp + 24], rdi
call operator delete(void*)@PLT
.LBB0_6:
mov rdi, rbx
call _Unwind_Resume@PLT
.L.str:
.asciz "vector" Use your discretion here. Forbidding the optimizer from inlining the slow path may be a good choice.
|
My point is that the code is wrongheaded. Either the compiler inlines everything and enormous bloat is caused, and if it decides not to inline performance in useful cases drops like a rock. The code should be written that the performance of the fast path is superior independent of the inline/outline decisions.
Yes due to the special nature of the new/delete functions the compiler could do this with full inlining. I don't think the side effect free no-ops are that important. I think for stdlib cases outline functions could be annotated with some kind of alloc/realloc/free attributes that similar transformations could still happen, but in the end not all that important for actual code. |
I agree, it would be nice if we could have it all. Could you talk about the concrete problem that lead you here? This conversation is lacking a grounding. It's easy to point at the "bloat", but why does "bloat" matter concretely to you?
I think code like this appears in less contrived manners more than one might expect. The point of the example is to show how little the optimizer can do with such trivial code. If it can't optimize around that, then I expect it to do worse elsewhere. Here's another less contrived example that compiles away: static std::vector<int> foo(std::vector<int> const& LHS, std::vector<int> const& RHS)
{
// Yes, this was originally a bug, but for some reason it causes the code to optimize better?
std::vector<int> v(3);
for (int i=0; i < 3; ++i)
v.push_back(LHS[i] + RHS[i]);
return v;
}
int main() {
int sum = 42;
for (auto x : foo({1, 2, 3}, {1, 2, 3})) {
sum += x;
}
return sum;
} |
When someone optimizes some code one can equivalent ask "Code is 20% too slow, but how does that matter concretely to you?" Well, I can confidently say I'm not suffering terribly personally. I'm not seeing it as you pasted it, but "return std::move(v);" does indeed do the trick. Never do RVO for vector, always std::move a vector. https://godbolt.org/z/j4q5EcqK9 Anyway, I'm just pointing out what, to me at least, is obviously too much code generated for what push_back does. We do suffer from massive binaries and often it's a death by thousands cuts and a lot of code that just is too bloated. The c++ standard due to it's templated nature often contributes significantly. Now I think due to it's fundamental status in the ecosystem it's good to give code generation of standard lib functions a good inspection. |
libcxx/include/vector
Outdated
|
||
template <class _Up> | ||
_LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI inline pointer __push_back_slow_path(_Up&& __x); | ||
_LIBCPP_NOINLINE _LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI inline pointer __push_back_slow_path(_Up&& __x); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@philnik777 Would you be open simply removing the inline
here so that with -Os
, the compiler doesn't inline it.
That way we don't have to add _LIBCPP_NOINLINE
and pessimize other code?
I would LGTM that change in a heartbeat.
Otherwise, I think we should try to get some concrete performance numbers here.
The reason I asked for a concrete 1st order problem caused by the codegen is because I don't think it is "obvious" To take your example about code "being 20% too slow", I would have to refer to something like "It means my program handles 5% fewer queries a second" as my 1st order concern. It's too slow because it affects my programs throughput measurably. Intuition has been a very poor tool in my experience benchmarking code, and I so I'm asking for something more in order to understand and help resolve your concern. How did you first come to inspect the assembly for |
I don't think of this as "a problem of the inliner". The code, as is, is written in a way that on a very fundamental level provides no good option to the compiler. Ie. either, one, inline everything and get good performance on the fast path or, two, outline the slow path and get terrible performance. A simple
will be much slower. Because the fast path is suddenly much slower, due to the fact that the pointer to vector escapes. The compiler will store and reload size, because it can't prove the store of i doesn't overwrite the size variable. All I'm saying is that a redesign of the library such that the potential non-inline fallback functions are static member functions that pass in the data by value and return by value. Now the compiler can choose to inline or outline functions at will and it won't affect performance at all.
See above analysis of push_back performance concerns for reasons i inspect push_back. I saw 5x degradation (with different versions of clang and -o2 vs -o3) due not inlining the throw_bad_length_error (hence this-pointer escapes hence function is 5x too slow). |
I feel like we're talking past each other. Best of luck. |
I did some code archeology, I see |
I'm running through some internal benchmarks right now. |
We just discussed this issue with @philnik777 and we went through several attempted solutions. First, we wanted to make the code look like what follows, which is simplest from the readability point of view. By using Unfortunately, this rewrite doesn't work because it doesn't support diff --git a/libcxx/include/vector b/libcxx/include/vector
index bfbf1ea1cfc9..8d4e32866675 100644
--- a/libcxx/include/vector
+++ b/libcxx/include/vector
@@ -656,6 +656,7 @@ public:
return this->__begin_ == this->__end_;
}
_LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI size_type max_size() const _NOEXCEPT;
+ _LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI void __reserve_for_real(size_type __n);
_LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI void reserve(size_type __n);
_LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI void shrink_to_fit() _NOEXCEPT;
@@ -901,9 +902,6 @@ private:
template <class _Up>
_LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI inline pointer __push_back_slow_path(_Up&& __x);
- template <class... _Args>
- _LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI inline pointer __emplace_back_slow_path(_Args&&... __args);
-
// The following functions are no-ops outside of AddressSanitizer mode.
// We call annotations for every allocator, unless explicitly disabled.
//
@@ -1460,14 +1458,21 @@ vector<_Tp, _Allocator>::at(size_type __n) const {
return this->__begin_[__n];
}
+// This one assumes that we don't have enough capacity
+// It allows callers to inline the capacity check
+template <class _Tp, class _Allocator>
+_LIBCPP_CONSTEXPR_SINCE_CXX20 void vector<_Tp, _Allocator>::__reserve_for_real(size_type __n) {
+ if (__n > max_size())
+ this->__throw_length_error();
+ allocator_type& __a = this->__alloc();
+ __split_buffer<value_type, allocator_type&> __v(__n, size(), __a);
+ __swap_out_circular_buffer(__v);
+}
+
template <class _Tp, class _Allocator>
_LIBCPP_CONSTEXPR_SINCE_CXX20 void vector<_Tp, _Allocator>::reserve(size_type __n) {
if (__n > capacity()) {
- if (__n > max_size())
- this->__throw_length_error();
- allocator_type& __a = this->__alloc();
- __split_buffer<value_type, allocator_type&> __v(__n, size(), __a);
- __swap_out_circular_buffer(__v);
+ __reserve_for_real(__n);
}
}
@@ -1529,19 +1534,6 @@ _LIBCPP_CONSTEXPR_SINCE_CXX20 inline _LIBCPP_HIDE_FROM_ABI void vector<_Tp, _All
this->__end_ = __end;
}
-template <class _Tp, class _Allocator>
-template <class... _Args>
-_LIBCPP_CONSTEXPR_SINCE_CXX20 typename vector<_Tp, _Allocator>::pointer
-vector<_Tp, _Allocator>::__emplace_back_slow_path(_Args&&... __args) {
- allocator_type& __a = this->__alloc();
- __split_buffer<value_type, allocator_type&> __v(__recommend(size() + 1), size(), __a);
- // __v.emplace_back(std::forward<_Args>(__args)...);
- __alloc_traits::construct(__a, std::__to_address(__v.__end_), std::forward<_Args>(__args)...);
- __v.__end_++;
- __swap_out_circular_buffer(__v);
- return this->__end_;
-}
-
template <class _Tp, class _Allocator>
template <class... _Args>
_LIBCPP_CONSTEXPR_SINCE_CXX20 inline
@@ -1551,16 +1543,14 @@ _LIBCPP_CONSTEXPR_SINCE_CXX20 inline
void
#endif
vector<_Tp, _Allocator>::emplace_back(_Args&&... __args) {
- pointer __end = this->__end_;
- if (__end < this->__end_cap()) {
- __construct_one_at_end(std::forward<_Args>(__args)...);
- ++__end;
- } else {
- __end = __emplace_back_slow_path(std::forward<_Args>(__args)...);
- }
- this->__end_ = __end;
+ if (size() == capacity()) [[__unlikely__]] {
+ __reserve_for_real(__recommend(size() + 1));
+ }
+
+ // Now we're in the fast path, and we can assume that we have enough capacity.
+ __construct_one_at_end(std::forward<_Args>(__args)...);
#if _LIBCPP_STD_VER >= 17
- return *(__end - 1);
+ return *(this->__end_ - 1);
#endif
} The other thing we investigated was to rewrite the code like this: diff --git a/libcxx/include/vector b/libcxx/include/vector
index bfbf1ea1cfc9..3b62dc029b76 100644
--- a/libcxx/include/vector
+++ b/libcxx/include/vector
@@ -1542,6 +1542,32 @@ vector<_Tp, _Allocator>::__emplace_back_slow_path(_Args&&... __args) {
return this->__end_;
}
+// This function is basically doing the following code:
+// if (cond) [[likely]]
+// __if();
+// else
+// __else();
+//
+// However, it does so without preventing the optimizer from inlining
+// the `else` branch in the case where the condition is known at compile-time.
+// This should really be fixed in the optimizer instead, see <LLVM BUG>.
+// Once that bug is fixed, there is no reason to keep this utility around.
+template <class _If, class _Else>
+_LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX14
+void __if_likely_else(bool __cond, _If __if, _Else __else) {
+ if (__builtin_constant_p(__cond)) {
+ if (__cond)
+ __if();
+ else
+ __else();
+ } else {
+ if (__cond) [[__likely__]]
+ __if();
+ else
+ __else();
+ }
+}
+
template <class _Tp, class _Allocator>
template <class... _Args>
_LIBCPP_CONSTEXPR_SINCE_CXX20 inline
@@ -1552,12 +1578,17 @@ _LIBCPP_CONSTEXPR_SINCE_CXX20 inline
#endif
vector<_Tp, _Allocator>::emplace_back(_Args&&... __args) {
pointer __end = this->__end_;
- if (__end < this->__end_cap()) {
- __construct_one_at_end(std::forward<_Args>(__args)...);
- ++__end;
- } else {
- __end = __emplace_back_slow_path(std::forward<_Args>(__args)...);
- }
+
+ bool __has_capacity = __end < this->__end_cap();
+ std::__if_likely_else(__has_capacity,
+ [&] {
+ __construct_one_at_end(std::forward<_Args>(__args)...);
+ ++__end;
+ },
+ [&] {
+ __end = __emplace_back_slow_path(std::forward<_Args>(__args)...);
+ });
+
this->__end_ = __end;
#if _LIBCPP_STD_VER >= 17
return *(__end - 1);
Basically, this applies |
2f5f5f4
to
fc9fdba
Compare
fc9fdba
to
3ac5bde
Compare
3ac5bde
to
60e3e5b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but can you please run the SPEC benchmarks and provide the before/after comparison? I wouldn't be surprised if there was some sort of change.
// This makes the compiler inline `__else()` if `__cond` is known to be false. Currently LLVM doesn't do that without | ||
// the `__builtin_constant_p`, since it considers `__else` unlikely even through it's known to be run. | ||
// See https://llvm.org/PR154292 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// This makes the compiler inline `__else()` if `__cond` is known to be false. Currently LLVM doesn't do that without | |
// the `__builtin_constant_p`, since it considers `__else` unlikely even through it's known to be run. | |
// See https://llvm.org/PR154292 | |
// This function performs an if-else on the given condition, calling either `__if()` or `__else()`. The `__if()` branch is annotated as being likely. | |
// | |
// However, it increases the likelihood that the `__else()` branch will be inlined if `__cond` is known to be a constant by | |
// the optimizer, which LLVM currently doesn't do when naively applying the [[likely]] attribute. | |
// See https://llvm.org/PR154292 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice. https://godbolt.org/z/rK6EjTehG indeed shows that the codegen is a lot better.
I've run the benchmarks and the difference seems to be within margin of error - possibly a tiny bit better. |
Please note that this patch has broken lldb-remote-linux-ubuntu.
|
This also broke premerge. We should probably get LLDB data formatters coverage here given it looks like the CI all passed here. |
I've reverted this in 7f2e9b1 due to the lack of response here. Hopefully that doesn't cause too much trouble. |
I'll take a look at why the tests were affected by this next week |
For what it's worth, this patch regressed binary size on our chromium android binary size a bit. With this patch: https://chromium-review.googlesource.com/c/chromium/src/+/6943035 ("APK Size (+50.9 KiB)") It's a fairly small regression so it doesn't affect us much, but maybe worth mentioning. The bot uses O2, ThinLTO, PGO, ICF, ffunction-sections / -fdata-sections. |
…lvm#94379)" This reverts commit 7f2e9b1.
These are almost certainly intended to not be inlined. This significantly reduces code size when
push_back
andemplace_back
are used heavily.Fixes #94360