[04:37:08] 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11797531 (10RobH) I'll put a more detailed timeline and update tomorrow but as it stands now: * unisys engineer showed up at 10am singapore time * swapped mainboard, damaged the CPU bracket and mainb... [07:21:15] 06Traffic, 10Pywikibot, 06SRE, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11797798 (10matej_suchanek) >>! In T421642#11785461, @Xqt wrote: > The problems began on March 25th: > {F74901675} Please (re)attach the file, so that it's visible if i... [07:37:39] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: gerrit: Adapt timeouts to avoid 502 errors in CI jobs - https://phabricator.wikimedia.org/T421827#11797856 (10ABran-WMF) Connection reuse has been [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/1268557... [07:53:14] I've just merged the realserver parts for the dumps lvs services. when would be a good time for you to enable the load balancers? would sometime this afternoon work? [07:54:02] taavi: yep, sounds good to me! [07:54:29] excellent, thank you, will ping you later then [08:14:12] 06Traffic, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st): Surge in webrequest validation check - https://phabricator.wikimedia.org/T422030#11797902 (10Fabfur) >>! In T422030#11794907, @Vgutierrez wrote: > It looks like the root cause is [[ https://github.com/haproxy/haproxy/commit/0b7a5a64eb51ce4b22866... [08:17:59] 06Traffic, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st): Surge in webrequest validation check - https://phabricator.wikimedia.org/T422030#11797907 (10JAllemandou) Thanks for confirming the invalid-events change @Vgutierrez. There still is something I don't understand: * The pattern we see in v3.0 seem... [08:26:40] 06Traffic, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st): Surge in webrequest validation check - https://phabricator.wikimedia.org/T422030#11797939 (10Vgutierrez) Yes, sequence numbers are enerated by haproxy itself, even if it results in a SSL handshake error where the sequence number doesn't reach ha... [08:47:07] 06Traffic, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st): Surge in webrequest validation check - https://phabricator.wikimedia.org/T422030#11798013 (10JAllemandou) >>! In T422030#11797939, @Vgutierrez wrote: > Yes, sequence numbers are enerated by haproxy itself, even if it results in a SSL handshake e... [09:12:24] 06Traffic, 10Pywikibot, 06SRE, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11798075 (10Xqt) >>! In T421642#11797798, @matej_suchanek wrote: > > Please (re)attach the file, so that it's visible if it's important ([[ https://www.mediawiki.org/wi... [09:37:52] 06Traffic, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st): Surge in webrequest validation check - https://phabricator.wikimedia.org/T422030#11798111 (10Vgutierrez) I've replicated locally a SSL handshake failure using haproxy with `log-format-sd %{+E}o\ [haproxykafka@0\ %[capture.req.hdr(0),json(ascii)]... [10:27:47] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: gerrit: Adapt timeouts to avoid 502 errors in CI jobs - https://phabricator.wikimedia.org/T421827#11798247 (10ArthurTaylor) Got this just now: https://integration.wikimedia.org/ci/job/quibble-with-gated-extensi... [11:24:44] fabfur: I am ready whenever you are. patch is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1268506 [12:03:03] taavi: right after lunch! :) [12:11:40] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: gerrit: Adapt timeouts to avoid 502 errors in CI jobs - https://phabricator.wikimedia.org/T421827#11798726 (10ABran-WMF) >>! In T421827#11798247, @ArthurTaylor wrote: > Got this just now: https://integration.wi... [12:36:38] taavi: you do the pybal restart ? [12:38:08] fabfur: yeah, I can do it, the instructions just say to have someone from your team around [12:38:13] disabling puppet on A:lvs-eqiad [12:38:28] I'm around! :) [12:40:06] merging the patch [12:40:20] ack [12:41:55] running puppet on the secondary (lvs1020) [12:43:29] restarting pybal on lvs1020 [12:44:10] `ipvsadm -L -n` looks good [12:44:19] are the services pooled ? [12:44:43] I pooled the backends in conftool, yes [12:44:49] ack! [12:45:43] I can see the routes on the routers, and can curl the new service IP manually [12:46:33] everything looks good, moving on to the active high-traffic2 lb (lvs1018) [12:46:53] running puppet there [12:48:10] hmm I see an alert for 'dumps-lb_873: Servers clouddumps1001.wikimedia.org are marked down but pooled' [12:48:36] healthcheck? [12:49:31] seems like the healthchecks are not making it past ferm? [12:50:03] I added https://gerrit.wikimedia.org/g/operations/puppet/+/f597ab248d56612002e6fcab7ffba5099a81667f/modules/profile/manifests/dumps/distribution/rsync.pp#20 yesterday though [12:52:21] aha, that comes from `haproxy_allowed_healthcheck_sources` which does not include the LVS interfaces in the public VLANs [12:52:51] ok [12:54:05] proposed hotpatch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1268942 [12:59:31] updated fw rules have been deployed, now waiting for 1020 to see those as up [12:59:38] Apr 08 12:59:21 lvs1020 pybal[2525934]: [dumps-lb_873] INFO: Server clouddumps1002.wikimedia.org (enabled/partially up/not pooled) is up [13:00:27] 👍 [13:01:29] fabfur: I still see a 'Services known to PyBal but not to IPVS: set(['10.2.2.94:443'])' alert for 1020, with that IP being wdqs-internal-scholarly.svc.eqiad.wmnet, is that something I need to worry about before moving on? [13:02:04] it shouldn't be related to your changes, isn't? [13:02:43] indeed should not [13:03:29] I'd say don't worry and proceed [13:04:17] ack [13:05:05] restarting pybal on 1018 [13:09:49] 1018 looks good, and all alerts for it cleared [13:10:49] so next up is changing state: to production? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1268507 [13:12:37] ^^ if I can help clean up the WDQS stuff LMK [13:14:10] taavi: GO for me [13:15:20] merging [13:17:20] inflatador: it seems like the only wdqs-internal-scholarly host in eqiad is pooled=no [13:19:44] fabfur: I think this is all done (unless I'm missing something?), thank you again [13:20:01] if it works it works :) [13:33:31] taavi thanks, I believe we forgot to repool after a recent reimage. Will correct [13:42:33] inflatador: hmm, that check still hasn't recovered [13:58:35] ^^ If anyone knows what the `Services known to PyBal but not to IPVS: set(['10.2.2.94:443'])'` alert means LMK. I'll check wikitech, but the LVS pool is new-ish so maybe we missed something when deploying it? [14:03:10] 06Traffic, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st): Surge in webrequest validation check - https://phabricator.wikimedia.org/T422030#11799759 (10JAllemandou) Summarizing here a talk we had on slack with @Vgutierrez and @Fabfur : * In v3.0 we were experiencing unexpected sequence-id increment. Thi... [14:06:09] inflatador: in the past I've seen such alerts when all the backends were accidentally depooled only recover with a full pybal restart, unfortunately [14:06:50] https://wikitech.wikimedia.org/wiki/PyBal#Services_known_to_PyBal_but_not_to_IPVS [meeting] [14:06:54] restart pybal [14:19:48] sukhe ACK, I'll leave that up to your team unless you really want me in there ;P [14:32:03] I will take it shortly in case someone else doesn't :> [15:23:12] inflatador: has anything changed with wdqs-internal-scholarly today? [15:27:00] sukhe no, we reimaged one of its hosts (wdqs1027) on Apr 1 (ref T421714 ) and I must've forgotten to repool it after that [15:27:03] T421714: Data platform: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421714 [15:45:28] 06Traffic, 13Patch-For-Review: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402#11800304 (10Fabfur) [16:14:53] 10netops, 06Infrastructure-Foundations, 06SRE: cr1-esams failed upgrade - https://phabricator.wikimedia.org/T422525#11800433 (10cmooney) Ok Juniper came back with the following: ` I found that your version 23.4R2-S7.4 is hitting the PR1933049. Unfortunately, this is a confidential PR, but in order to get thi... [16:39:33] 06Traffic, 13Patch-For-Review: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402#11800568 (10Fabfur) [16:58:05] 06Traffic, 06Infrastructure-Foundations: sre.hosts.reboot-single cookbook removes any and all downtimes after reboot - https://phabricator.wikimedia.org/T422261#11800634 (10BCornwall) This highlights the larger problem of the opacity of cookbooks, particularly those that purport to be generalized. Removing dow... [17:12:55] 06Traffic, 06Infrastructure-Foundations: sre.hosts.reboot-single cookbook removes any and all downtimes after reboot - https://phabricator.wikimedia.org/T422261#11800754 (10Volans) I don't think that's possible in Icinga due to Icinga "APIs", for the Alermanager downtime it already removes only the downtime cr...