-
Notifications
You must be signed in to change notification settings - Fork 2.8k
fix(service): follow event schedule interval #5848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix(service): follow event schedule interval #5848
Conversation
* Update controller throttle so events schedule at `max(lastRun+interval, now+minSync)` * Rework `TestShouldRunOnce` flow to cover the stricter scheduling contract * Add per-source tracking of when node data is required * Add lazy node handler registration * Introduce node-address diffing handler to prevent heartbeat-only trigger * Adjust service informer tests for new detection path and explicit node-event opt-in Signed-off-by: Tobias Harnickell <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @TobyTheHutt. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
As mentioned in the PR message, here's the test setup and the tests. The tests basically:
Setup: Prepare the cluster: kubectl create ns dns
helm repo add bitnami https://charts.bitnami.com/bitnami
helm -n dns install etcd bitnami/etcd --set auth.rbac.create=false Create file isClusterService: false
serviceType: ClusterIP
servers:
- zones:
- zone: example.test.
port: 53
plugins:
- name: errors
- name: log
- name: etcd
parameters: example.test
configBlock: |
stubzones
path /skydns
endpoint http://etcd.dns.svc.cluster.local:2379
- name: forward
parameters: . /etc/resolv.conf
- name: cache
parameters: 30
- name: health
- name: ready Create coredns instance: Create & deploy apiVersion: v1
kind: ServiceAccount
metadata:
name: external-dns
namespace: dns
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: external-dns
rules:
- apiGroups: [""]
resources: ["services","endpoints","pods","nodes"]
verbs: ["get","watch","list"]
- apiGroups: ["networking.k8s.io","discovery.k8s.io","extensions"]
resources: ["ingresses","endpointslices"]
verbs: ["get","watch","list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: external-dns-viewer
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: external-dns
subjects:
- kind: ServiceAccount
name: external-dns
namespace: dns
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: external-dns
namespace: dns
spec:
strategy:
type: Recreate
selector:
matchLabels:
app: external-dns
template:
metadata:
labels:
app: external-dns
spec:
serviceAccountName: external-dns
containers:
- name: external-dns
image: extdns:dev
imagePullPolicy: Never
args:
- --provider=coredns
- --source=ingress
- --source=service
- --interval=10m
- --events
- --domain-filter=example.test
- --log-level=debug
env:
- name: ETCD_URLS
value: http://etcd.dns.svc.cluster.local:2379 Create and deploy a noise-free service: apiVersion: v1
kind: Service
metadata:
name: hello
annotations:
external-dns.alpha.kubernetes.io/hostname: hello.example.test
spec:
type: ClusterIP
selector:
app: demo
ports:
- port: 80
targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: demo
spec:
replicas: 1
selector:
matchLabels:
app: demo
template:
metadata:
labels:
app: demo
spec:
containers:
- name: web
image: nginx:alpine
ports:
- containerPort: 8080 Generate some churn: while true; do
for n in $(kubectl get nodes -o name); do
kubectl label $n edns-churn=$(date +%s%N) --overwrite >/dev/null 2>&1 || true
done
sleep 2
done Watch logs: kubectl logs -n dns deploy/external-dns -f Result: |
/ok-to-test |
From first look, the solution in service.go file looks over complicated. Maybe we just remove if sc.serviceTypeFilter.isRequired(v1.ServiceTypeNodePort) {
_, _ = sc.nodeInformer.Informer().AddEventHandler(eventHandlerFunc(handler))
} Basically, there is a external-dns/source/service.go Line 185 in 413015e
Changes in controller.go not sure if directly related, so could be added as refactoring PR |
I agree that it's a large file, though I did not take a time to evaluate what is overall "clutter" or unnecessarily complexity. To the relevant lines: services What I could do though is simplify the code by dropping controller Let me know what you think, or if I misunderstood anything from your feedback. UPDATE: I gave the feedback some more thought, especially regarding the bulk of my changes. I think I can still drop lazy detection & flags, and unconditionally register a diffing node handler for when a feature depends on node data. I'll push a new commit for that, along with new test results. UPDATE2: I just understood that with your last comment, you likely wanted to bring up the mix of concerns I added in this one PR. I agree and will split this PR so the commit history in the repo remains clean. |
Reverts controller changes from ccd3cb9 to separate scheduling fix from service logic updates. Introduced new service logic: * Lazy register node informer only when needed * Add node address change handler with minimal comparisons * support detection of node-dependent services before enabling node events * Update service tests Signed-off-by: Tobias Harnickell <[email protected]>
Signed-off-by: Tobias Harnickell <[email protected]>
New test results show behaviour as expected. Generally,
Plus, it does not currently measure any min vs interval contradiction under genuine event pressure. Interval in silent cluster (no service or ingress changes): Interval in noisy cluster (see script below): Churn script: SVC=test-extdns
kubectl create service clusterip $SVC --tcp=80:80 2>/dev/null || true
while true; do
kubectl annotate service $SVC external-dns.alpha.kubernetes.io/ttl=$(shuf -i 30-300 -n1) --overwrite
sleep 2
done |
I know there is an effort added for this solution. Same time I'm on the edge The bug/behavior-change was introduced unintentionally here https://github.com/kubernetes-sigs/external-dns/pull/5613/files#diff-cf68f602fa7c20e5341f3b83054df68ade1586a144b1eae5347e0ac47096d3aa aka added event handler without any explicit requirement. Usually we do split refactoring and behaviour change. But as it triggers reconciles more often then required, probably not worth the benefits. Pros/cons if simple remove Pros
Cons
What is questioned in current implementation
The other solution is to check where node source is enabled if sc.serviceTypeFilter.isRequired(v1.ServiceTypeNodePort) && sc.isNodeSourceEnabled {
_, _ = sc.nodeInformer.Informer().AddEventHandler(eventHandlerFunc(handler))
} From high level view, there is nothing wrong, as the external-dns arguments are; --interval=10m
--events In theory they are mutually exclusive. You could not have 10 minutes in between reconciliations and rely on kubernetes events. As described here https://github.com/kubernetes-sigs/external-dns/blob/master/docs/advanced/rate-limits.md#introduction So let's w8 for other reviewers. If we decided to add this fix to a codebase, most likely worth to move changes to |
What does it do ?
This PR introduces changes to cover the scheduling contract for events.
Partial fix for Ticket: #5796
What is NOT implemented here, is proper controller throttle so events schedule at
max(lastRun+interval, now+minSync)
. To prevent mix of concerns, this will be added in a separate PR.As a direct consequence, also the
TestShouldRunOnce
has not been changed to cover a recommended, stricter scheduling contract.Motivation
With 0.19.0 the service source started wiring the
--events
handler to the shared node informer. Since the default service type filter is empty, every node is observed. This means, the handler now fires on every node status heartbeat.The motivation is to re-enforce the scheduler contract and ensure predictable behaviour for event schedule intervals.
More
This PR was tested in a virtual environment. For brevity, I post the setup and results in a comment.