-
Notifications
You must be signed in to change notification settings - Fork 233
Fix for outstanding riak_kv/679 dataloss cases #1643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for outstanding riak_kv/679 dataloss cases #1643
Conversation
new_key_epoch was called for all puts where the local get returned a notfound (no_old_object in the parlance of the recent refactor.) A new epoch is only needed in certain situtions. This change means a new epoch is only created when needed on no_old_object puts. This is an intermediate step toward fixing the outstanding kv679 case where a vnode has a stale/old/bad local vclock on disk. See the test https://github.com/russelldb/riak_test/blob/rdb/gh-kv679-fb-byz/tests/kv679_dataloss_fb.erl for more details. This commit also simplifies the maybe_new_key_epoch function.
And to dodge this weird erlang bug http://erlang.org/pipermail/erlang-questions/2017-March/091844.html
In this test all writes are coordinated by the head of the preflist, pfhead, and replicated to at least one other node chosen from 2 further primaries and 3 fallbacks. Every write is there for w=2 and acked on pfhead one other node. Every now and again one node may replicate to another. Every now and again pfhead may "forget" it's local values (this is like a disk error or partially successful delete.) In this commit the test catches the lost acked write bug pretty quickly. Next up is to add the vnode counter and see if the test can catch the case as yet unfixed. Also added a convenience function to riak_object.
This commit adds the epoch counter to the coordinator. It's a WIP commit that fixes the counter example from the previous commit (which showed dataloss.)
This counter example shows the same bug as the riak_test https://github.com/basho/riak_test/blob/rdb/gh-kv679-fb-byz/tests/kv679_dataloss_fb.erl Yay!
This commit adds a no-op/no-dot VV increment with a new epoch actor to an object if it is needed. If the incoming object contains an entry for the vnode greater than the local one then that indicates that the local vnode has suffered amnesia and needs to ensure it uses a new actor id next time it coordinates a write on this key. The simplest way to do that is add a VV entry of Actor:Epoch+1 with a counter of 1. Note the new function on riak_object as this increment must _NEVER_ create a dot.
Contrary to previous commit, there is a variation on the bug that requires new actor epoch even if the local object is found.
An amnesiac vnode may recieve a replica/handoff put with it's own vnode ID in, this is a strong hint that something is rotten in Denmark, add a new vnode epoch to the clock as a non-dot/no-op entry
I think this all needs a refactor.
is_local_amnesia seems a better function name than is_greater
Thanks @russelldb! Settings---
minimum_reviewers: 2
merge: false
merge_method: merge
build_steps:
- make clean
- make deps
- make compile
- make test
- make xref
- make dialyzer
org_mode: true
timeout: 1800 |
There seems to be an issue with build step **make_test,make_dialyzer** ! ☁️✅ MERGE
✅ MAKE_CLEAN
✅ MAKE_DEPS
✅ MAKE_COMPILE
⛔ MAKE_TEST
✅ MAKE_XREF
⛔ MAKE_DIALYZER
⬜ 0 of 2 Code reviews from organization basho |
@russelldb if you can start by addressing the dialyzer warnings while Nick and I are reviewing I'd appreciate it. On first pass, makes sense to me and generally looks good, but I want to think a bit more about it. There are a few stylistic things we're trying to avoid in here, but nothing we can't clean up post-merge. |
Happy to address the stylistic concerns if you air them. Did I introduce new dialyzer warnings? |
Ah, I see it, thumbs is pretty useful/helpful. |
Yea - we ❤️ Thumbs these days... only real issues are where we may have set some of the settings in a way that we later figured out we didn't like... I was just running dialyzer on |
src/riak_kv_vnode.erl
Outdated
{oldobj, CurObj}; | ||
false -> | ||
{newobj, ResObj} | ||
if NewEpoch -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please replace this with a case NewEpoch
- trying to kill if
statements with 🔥 these days, and would like to get rid of the ?ELSE macro as well (but understand that the other usage isn't currently part of this PR, so it's OK to leave it for now).
src/riak_kv_vnode.erl
Outdated
[] | ||
end, | ||
{EpochId, State2} = new_key_epoch(State), | ||
{EpochId, State2} = maybe_new_key_epoch(Coord, State, undefined, RObj), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using undefined
here, could we use a more distinctive atom like no_object
or something along those lines? Would make the code a bit clearer, and also help guard against any future bugs where an unrelated use of undefined
might leak into this code. Admittedly this is somewhat of a personal style preference though, so I won't argue if you'd strongly prefer leave it as-is. Looking through the code, we'd have to update maybe_new_key_epoch
and also highest_actor
, but I think it should be a safe and easy change.
(Commenting on the first commit where this appears, but it does look like this code later got moved into maybe_update_vclock
.)
src/riak_kv_vnode.erl
Outdated
-spec maybe_new_key_epoch(Coord::boolean(), State::#state{}, | ||
LocalObject::riak_object:riak_object(), | ||
IncomingObject::riak_object:riak_object()) -> | ||
{ActorId:: binary(), #state{}}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the variable names to the type specs - definitely improves readability in many places.
In the case of ActorId
I see we specify ActorId :: binary()
in a couple places. Maybe worth defining a local actor_id()
type in this module to avoid redundancy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I found an existing type vnodeid()
but I'm not sure if it muddies the water since in some places that means the one generated by the vnode_status_mgr, and some cases it means the actor_id generated by smushing together the vnodeid and epoch. Maybe we need both, using the bitstring <<_:Length>> type spec?
%% `ActorBase' that has acted on this key. `KeyEpoch' is the highest | ||
%% epoch for the `ActorBase'. `Counter' is the greatest event seen by | ||
%% the `VnodeId'. | ||
-spec highest_actor(binary(), riak_object:riak_object()) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are now allowed to pass undefined
into this function, we should add it to the type spec.
@russelldb what happened? I was about to continue my review, and was excited to get this stuff merged. |
I have no idea, I didn't do anything! |
There is one still outstanding, but it is baffling, and, I suspect predates these changes.
Many thanks for the review. I'm working on it. One dialyzer warning remains, I have had to post to the erlang ML for help. |
Re-running the eqc I've found some new counter examples. Hold off on this until I figure them out, please. |
The generator for put has to ensure that W=2 and that the preflist is distinct nodes. Also, for some reason the epoch update on non-coordinating writes was commented out, so I re-instated that, as it should be there.
@russelldb Was chatting with @JeetKunDoug and he mentioned a test failure during the Giddyup run for |
@nickelization looking at it. So far a bit baffled. All the |
@russelldb been a busy day and I haven't had much chance to look into this any further yet (I probably should've mentioned earlier, but today is my last day working for Basho, so I've had some loose ends to tie up). The one potential lead I have for you via Doug is that he noticed those new "amnesia" log messages showing up during repairs. I guess that might not be unexpected, due to the nature of the test...but again, I haven't had a chance to look into it any further yet, and might not get to it before my time here is up :( |
With thanks to Dough Rohrer. `new_key_epoch/1` is only ever called if `#counter_state.use == true`. Also address stylistic complaints about use of `if ?ELSE`
There seems to be an issue with build step **make_test,make_dialyzer** ! ☁️✅ MERGE
✅ MAKE_CLEAN
✅ MAKE_DEPS
✅ MAKE_COMPILE
⛔ MAKE_TEST
✅ MAKE_XREF
⛔ MAKE_DIALYZER
⬜ 0 of 2 Code reviews from organization basho |
@nickelization wow. Well thanks for all the time you put into this PR up to now. All the best in whatever you're doing next. Stay in touch, please! Hope we can work together again in the future. |
@JeetKunDoug Finally figured it out. https://github.com/basho/riak_test/blob/develop/tests/partition_repair.erl#L223 is the problem The data in the vnode partition no longer matches the data in the stash since it now has the new epoch actor entry added to the vclock. I guess I need to makes this |
@russelldb yea - that looks about right to me.. a more intelligent comparison should do it. LMK if you need anything else, or when you've got changes done, and I'll take another look. |
The partition repair test deletes all the data at a partition, and then repairs it from neighbouring partitions. The subset of repaired data that was originally coordinated by the deleted partition's vnode showed up as `notfound` since the latest kv679 changes here basho/riak_kv#1643. The reason is that the fix in the KV repo adds a new actor to the repaired key's vclock. Prior to this change `verify` in partition_repair.erl did a simple equality check on the binary encoded riak_object values. This change takes into account that a new actor may be in the vclock at the repaired vnode, and uses a semantic equality check based on riak_object merge and riak object equal.
@JeetKunDoug I pushed a fix to the riak-test at basho/riak_test#1296 |
@russelldb thanks - it seemed like yo had solutions for the dialyzer errors - any clues on the eunit failure in the Thumbs output? Looks potentially unrelated but curious if you have taken a look yet? If not I can try to see if it's reproducible locally on develop or not and we can go from there. |
@russelldb you're welcome, and thanks! Same to you, on all counts :) |
@JeetKunDoug I used your solutions for the dialyzer errors! Will look at the eunit |
riak_kv_vnode:put_merge is exported for testing, and the kv679 changes change the function signature from `vnodeid` to a two-tuple `{IsNewEpoch::boolean(), VnodeId::binary()}` this commit reflects that change in the test code.
There seems to be an issue with build step **make_dialyzer** ! ☁️✅ MERGE
✅ MAKE_CLEAN
✅ MAKE_DEPS
✅ MAKE_COMPILE
✅ MAKE_TEST
✅ MAKE_XREF
⛔ MAKE_DIALYZER
⬜ 0 of 2 Code reviews from organization basho |
@JeetKunDoug Fixed the failing eunit.
is looking more and more pertinent as time goes on :D |
Thanks, again, to @JeetKunDoug
Looks good! 👍✅ MERGE
✅ MAKE_CLEAN
✅ MAKE_DEPS
✅ MAKE_COMPILE
✅ MAKE_TEST
✅ MAKE_XREF
✅ MAKE_DIALYZER
⬜ 0 of 2 Code reviews from organization basho |
Now that all the unit test/dialyzer issues are taken care of, going to kick off another integration test run... should have results in the morning. |
+1 - integration tests run and triaged. Looks good. Thanks @russelldb for the great work, as always. |
+1 |
@thumbot merge |
Wooo! Thanks @JeetKunDoug and @nickelization for the review. |
…icate Fix for outstanding riak_kv/679 dataloss cases - manually merging because it seems Thumbot is a bit slow today.
These commits fix the remaining kv679 dataloss edge cases. They include a simple quickcheck model/simulator. The riak-tests for these changes can be found here basho/riak_test#1296.
The fix is a slight change to the per-actor-epoch code that shipped in 2.1, whereas before we only updated the epoch on coordinating writes, now we update it on non-coordinating writes where we detect some local amnesia. The fix is to add a no-op/no-dot entry to the version vector with a new actor-epoch and a counter of one. This ensures that the next write coordinated by the vnode uses the new epoch, and is not instead a repeated event.