[06:53:50] gehel: o/ [06:54:12] there are two crits for PYBAL CRITICAL - CRITICAL - wdqs-internal_80: Servers wdqs2004.codfw.wmnet are marked down but pooled [06:54:46] just triple checking the status (if we need to act on it or if it is waiting for maintenance etc..) [06:59:00] Damn, there was 5he same thing yesterday, by the time I checked, it was back to normal [06:59:24] On the way back from daycare, I'll check as soon as I'm back [06:59:52] elukey: can you depool that server in the meantime pls? [07:00:49] gehel: sure [07:00:55] Thanks ! [07:01:07] elukey@wdqs2004:~$ sudo -i depool [07:01:07] Depooling all services on wdqs2004.codfw.wmnet [07:01:07] codfw/wdqs-internal/wdqs/wdqs2004.codfw.wmnet: pooled changed yes => no [07:01:10] gehel: --^ [07:21:29] Thanks! [07:37:27] elukey: strange, the alert still says "wdqs2004.codfw.wmnet are marked down but pooled", but wdqs2004 is depooled [07:37:47] yeah [07:38:03] https://config-master.wikimedia.org/pybal/codfw/wdqs-internal looks sane [07:38:18] I checked the readiness-probe and it does return 200 for wdqs2004 [07:38:27] and checked that this is the URL configured in the LVS check [07:39:04] (http://localhost/readiness-probe) [07:40:11] ema: around by any chance? There is some pybal magic that I don't understand here... [07:40:38] or maybe vgutierrez ? [07:40:52] I'm around [07:41:24] vgutierrez: for context: there is an alert in icinga "PYBAL CRITICAL - CRITICAL - wdqs-internal_80: Servers wdqs2004.codfw.wmnet are marked down but pooled" for LVS200[36] [07:41:49] checking wdqs2004, it looks OK, and the URL used for the LVS check is returning HTTP/200 [07:42:12] checking... [07:42:15] wdqs2004 is depooled (thanks to Luca) but the alert is still there [07:43:25] I'm suspecting the issue is around pybal / etcd, but mostly because that's the part that I don't understand much [07:45:02] hmm pybal depooled wdqs2004 effectively in lvs2006 and lvs2003 [07:45:09] TCP 10.2.1.41:80 wrr [07:45:09] -> 10.192.16.4:80 Route 10 0 0 [07:45:09] -> 10.192.48.92:80 Route 10 0 1 [07:45:17] that's how it looks wdqs-internal for ipvs [07:45:22] .4 is 2005 and .92 is 2006 [07:45:36] the logs report something similar [07:45:40] Jul 19 07:01:01 lvs2003 pybal[38544]: [wdqs-internal_80] INFO: Merged enabled server wdqs2005.codfw.wmnet, weight 10 [07:45:40] Jul 19 07:01:01 lvs2003 pybal[38544]: [wdqs-internal_80] INFO: Merged enabled server wdqs2006.codfw.wmnet, weight 10 [07:45:40] Jul 19 07:01:01 lvs2003 pybal[38544]: [wdqs-internal_80] INFO: Merged disabled server wdqs2004.codfw.wmnet, weight 10 [07:46:18] that part, I expect, since lucas depooled it [07:46:32] but why do we still have an alert saying that it is pooled [07:46:49] and why do we have an alert at all since running the check manually seems fine [07:48:17] is /usr/local/lib/nagios/plugins/check_pybal on the lvs hosts still returning the same result? [07:49:15] the check should be /usr/local/lib/nagios/plugins/check_pybal --url http://localhost:9090/alerts [07:49:15] apparently yes [07:49:22] is screaming in 2003 and 2006 [07:50:32] that is pybal itself right? [07:50:38] (checking on the host) [07:50:53] yep seems so [07:51:51] Oh, the alert comes from pybal itself, the icinga check only publishes it, without any analysis [07:52:19] so it looks like pybal on lvs2003/6 has some internal incoherence [07:52:34] maybe it is indeed some state on conf2XXX that is messed up [07:53:20] weird [07:53:22] I'd say in pybal itself, since it has depooled wdqs2004 (as vgutierrez pointed out above) but is still reporting that error [07:53:30] seems a pybal issue TBH [07:53:40] so pybal is seeing wdqs as up right now [07:53:43] pybal_monitor_status{host="wdqs2004.codfw.wmnet",monitor="ProxyFetch",service="wdqs-internal_80"} 1.0 [07:53:44] $ confctl --quiet select name="wdqs2.*" get [07:53:44] {"wdqs2004.codfw.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=codfw,cluster=wdqs-internal,service=wdqs"} [07:53:47] {"wdqs2005.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=wdqs-internal,service=wdqs"} [07:53:50] {"wdqs2006.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=wdqs-internal,service=wdqs"} [07:53:52] we can always repool wdqs2004, it looks fine as far as I can tell on my side [07:53:53] {"wdqs2001.codfw.wmnet": {"weight": 15, "pooled": "yes"}, "tags": "dc=codfw,cluster=wdqs,service=wdqs"} [07:53:56] {"wdqs2002.codfw.wmnet": {"weight": 15, "pooled": "yes"}, "tags": "dc=codfw,cluster=wdqs,service=wdqs"} [07:53:59] {"wdqs2003.codfw.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=codfw,cluster=wdqs,service=wdqs"} [07:54:02] there are 2 depooled [07:54:04] what's the threshold in pybal for this cluster? [07:54:17] but [07:54:17] that's as far the etcd goes [07:54:20] vgutierrez@lvs2003:~$ curl http://localhost:9090/alerts [07:54:20] CRITICAL - wdqs-internal_80: Servers wdqs2004.codfw.wmnet are marked down but pooledv [07:54:30] those are actually 2 different clusters [07:55:01] volans: depool-threshold = .5 [07:55:09] yeah was looking right now, thx :) [07:55:32] let me reboot pybal on lvs2006 [07:55:48] vgutierrez: wait [07:55:57] * vgutierrez waiting [07:56:04] I also need to repool wdqs2003 (data transfer is now completed), but I'll wait until we understand that issue a little bit better before doing it [07:56:04] so we have 3 hosts in cluster wdqs and 3 in cluster wdqs-internal [07:56:05] better wait for volans [07:56:09] with depool-threshold = .5 [07:56:12] so if one is depooled [07:56:17] as soon as another has a hiccup [07:56:20] pybal doesn't depool it [07:56:22] and complains [07:56:31] but that's not the case right now [07:56:50] what do you mean? there is one depooled per cluster already [07:57:03] 2004 in internal and 2003 in public [07:57:10] yes, but pybal complains about the node that is already depooled [07:57:18] not about another node [07:57:45] * volans wonders if at that point pybal treats any host in the pool poolable [07:57:58] althought IIRC it should not depool the second one at all in the first place [07:58:12] CRITICAL - wdqs-internal_80: Servers wdqs2004.codfw.wmnet are marked down but pooled [07:58:18] this the alert right now coming from pybal itself [07:58:34] and that's simply not true, cause wdqs2004 is currently depooled (in etcd AND pybal) [07:59:13] before restarting, can we check in the code the bit that triggers this? So it willl be easy to see if anything is in a messy state.. (and possibly fix the bug) [08:00:27] should be https://github.com/wikimedia/PyBal/blob/1.15/pybal/instrumentation.py#L79 right? [08:00:55] hmmm I'd say 1.15-stretch, but yeah [08:02:25] If I had to guess somebody has been nerd sniped and already looking at the Python code [08:02:33] will not say the name of the person though [08:03:59] "looking at the Python code and preparing a 13 patch series with small refactorings" you meant [08:04:28] ahhaahahhahahah [08:04:48] moritzm: you are giving away the name of the person in this way [08:05:13] and that's how you get a new pybal version from the scratch for free [08:05:28] * vgutierrez runs to another continent [08:06:34] this test is interesting [08:06:34] https://github.com/wikimedia/PyBal/blob/942a31290326a635294a905e19d8cef2e7cc6181/pybal/test/test_instrumentation.py#L108-L112 [08:06:55] gehel: just to be sure, all the other nodes in wdqs-internal are healthy right? [08:07:08] in codfw, yes [08:07:23] gehel: BTW, pybal began to consider wdqs2004 not healthy yesterday [08:07:28] in eqiad I have some doubts about 1008 (it's inactive at the moment) [08:07:29] Jul 18 14:13:02 lvs2006 pybal[1176]: [wdqs-internal_80] ERROR: Monitoring instance ProxyFetch reports server wdqs2004.codfw.wmnet (disabled/up/not pooled) down: 502 Bad Gateway [08:07:29] Jul 18 14:13:17 lvs2006 pybal[1176]: [wdqs-internal_80 ProxyFetch] WARN: wdqs2004.codfw.wmnet (disabled/partially up/not pooled): Fetch failed (http://localhost/readiness-probe), 1.713 s [08:07:29] Jul 18 14:13:27 lvs2006 pybal[1176]: [wdqs-internal_80 ProxyFetch] WARN: wdqs2004.codfw.wmnet (disabled/partially up/not pooled): Fetch failed (http://localhost/readiness-probe), 0.004 s [08:08:13] vgutierrez: I had a timing issue in my data transfer procedure, 2004 was shutdown before being depooled [08:08:25] ack [08:08:30] but before it was wdqs2006.codfw.wmnet (around 9am yesterday) [08:08:43] but nothing in today's logs [08:08:58] nope [08:09:05] pybal is considering the 3 nodes healthy right now [08:09:10] I have not started on those data transfer yet for today (but need to finish them before I leave) [08:09:55] the probe on wdqs1008 is healthy, but there was a hiccup during the data transfer, so we might be missing some data on that node [08:10:15] vgutierrez: Last State Change: 2019-07-18 12:58:09 [08:10:20] for the icinga alert [08:10:56] I think that's what messed up with pybal [08:10:59] Jul 18 12:58:00 lvs2006 pybal[1176]: [wdqs-internal_80] ERROR: Could not depool server wdqs2004.codfw.wmnet because of too many down! [08:11:42] could be was looking at the same logs and thinking the same [08:11:58] * gehel needs to grab a coffee, not yet awake :( [08:12:03] biab [08:12:04] although at Jul 18 12:58:45 it reported all 3 healthy [08:12:07] it is the bit of the code that does self.pooledDownServers.add(server) [08:12:32] in theory, there should be a use case for its removal right? [08:12:47] if not, it may stay there even if a explicit depool is done [08:12:57] (trying to think out loud) [08:13:16] elukey: also that part was mostly rewritten in master, just not released ;) [08:14:52] lovely [08:15:20] so.. may I restart pybal on lvs2006? :) [08:15:56] +1 [08:16:12] +1 [08:17:28] vgutierrez@lvs2006:~$ curl http://localhost:9090/alerts [08:17:28] OK - All pools are health [08:18:34] shall I continue with lvs2003? [08:19:51] seems reasonable to me at this point [08:21:25] vgutierrez@lvs2003:~$ curl http://localhost:9090/alerts [08:21:25] OK - All pools are healthy [08:21:32] gehel: sorry for the pybal glitch <3 [08:21:37] thanks vgutierrez ! [08:22:09] thanks! [08:22:23] vgutierrez, elukey, volans: thanks a lot for the analysis! [08:23:04] I just repooled 2003 [08:23:54] and 2004 [08:25:00] and now time to break a few more of those wdqs servers... [08:25:46] 2004 has been pooled as expected [08:25:49] TCP 10.2.1.41:80 wrr [08:25:50] -> 10.192.0.20:80 Route 10 0 1 [08:25:50] -> 10.192.16.4:80 Route 10 0 0 [08:25:50] -> 10.192.48.92:80 Route 10 0 1 [08:25:53] (lvs2003) [08:27:16] good! [17:57:50] ema, if you're still around could you join us in #wikimedia-releng? We're having some acme-chief problems in deployment-prep [17:58:01] (or mutante might also be able to help with this) [18:00:28] what's the issue? [19:56:22] Question for the Debian people, we curently have fastnetmon 1.1.3 in buster, and 1.1.4 in testing/unstable. If I want to try to have 1.1.4 in buster I need to do this: https://wikitech.wikimedia.org/wiki/Debian_Packaging#Rebuilding_a_package ? [20:09:29] XioNoX: that's about right, yeah [20:10:26] XioNoX: if you download the source packages from unstable, it will probably compile for buster easily [20:10:38] not much skew between the two at present :) [20:12:38] thx! not worth doing for now but it's good to know [20:13:52] if we end using fastnetmon in the long run, there's also the option to build the latest version of it for buster-backports [20:14:00] it's now available in a day or two [21:13:08] https://puppetboard.wikimedia.org/nodes?status=failed is clean again. last puppet run fixed. [21:23:40] nice! [22:00:29] https://help.github.com/en/articles/navigating-code-on-github [22:43:12] XioNoX: that is great