[00:02:11] (03PS4) 10BBlack: Add 'cdn' conftool service to all caches [puppet] - 10https://gerrit.wikimedia.org/r/863336 (https://phabricator.wikimedia.org/T324336) [00:04:34] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1012.eqiad.wmnet with reason: host reimage [00:07:40] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1012.eqiad.wmnet with reason: host reimage [00:11:02] (03CR) 10BBlack: [C: 03+2] Add 'cdn' conftool service to all caches [puppet] - 10https://gerrit.wikimedia.org/r/863336 (https://phabricator.wikimedia.org/T324336) (owner: 10BBlack) [00:12:29] !log bblack@cumin1001 conftool action : set/weight=1; selector: service=cdn [00:12:45] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: service=cdn [00:15:17] (03PS4) 10BBlack: Switch pybal + scripts to 'cdn' service [puppet] - 10https://gerrit.wikimedia.org/r/863337 (https://phabricator.wikimedia.org/T324336) [00:16:10] !log disabling puppet on all cp and lvs hosts for conftool key changes. Please coordinate if any lvs/pybal/cpNNNN depooling/work is needed during this transition! [00:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:10] (03CR) 10BBlack: [C: 03+2] Switch pybal + scripts to 'cdn' service [puppet] - 10https://gerrit.wikimedia.org/r/863337 (https://phabricator.wikimedia.org/T324336) (owner: 10BBlack) [00:29:28] !log lvs4010: restart pybal to test etcd key changes - T324336 [00:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:32] T324336: Replace edge cache conftool entries 'varnish-fe' and 'ats-tls' with singular 'cdn' - https://phabricator.wikimedia.org/T324336 [00:32:03] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:34] (03CR) 10BBlack: [C: 03+2] Switch roll-restart-varnish to 'cdn' service [cookbooks] - 10https://gerrit.wikimedia.org/r/863339 (https://phabricator.wikimedia.org/T324336) (owner: 10BBlack) [00:34:43] (03PS2) 10BBlack: Switch roll-restart-varnish to 'cdn' service [cookbooks] - 10https://gerrit.wikimedia.org/r/863339 (https://phabricator.wikimedia.org/T324336) [00:34:51] (03PS1) 10Ssingh: hiera: unify eqsin LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/865799 (https://phabricator.wikimedia.org/T322048) [00:35:59] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38640/console" [puppet] - 10https://gerrit.wikimedia.org/r/865799 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [00:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [00:45:08] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1012.eqiad.wmnet with OS bullseye [00:46:55] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:06] !log lvsNNNN: restart pybal to apply etcd key changes on all "secondary" lvs at all sites - T324336 (5 hosts, ulsfo completed previously) [00:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:11] T324336: Replace edge cache conftool entries 'varnish-fe' and 'ats-tls' with singular 'cdn' - https://phabricator.wikimedia.org/T324336 [00:52:00] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:53:48] bblack: ^ expected, right? [00:54:25] I'm not sure, digging around [00:54:32] I'm not even sure what it is or means, to be honest [00:56:05] we have had these for a day or so now and I am guessing they are related to the prom file staleness that we were talking about earlier [00:56:16] https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:56:44] yeah ok [00:56:51] so not related to the current work. god.og was made aware since we (at least I for sure) don't know how to fix this [00:57:01] ahh, sorry for the misdirect :) [00:57:07] np, alwasy good to check :) [00:57:07] np, thanks for checking! [01:00:41] !log lvsNNNN: restart pybal to apply etcd key changes on all "high-traffic2" lvs at all sites - T324336 [01:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:45] T324336: Replace edge cache conftool entries 'varnish-fe' and 'ats-tls' with singular 'cdn' - https://phabricator.wikimedia.org/T324336 [01:01:53] (03CR) 10Cwhite: [C: 03+1] prometheus: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/865648 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [01:02:34] (03CR) 10Cwhite: [C: 03+1] netmon: Remove netmon2001 from the alertmanager rw api [puppet] - 10https://gerrit.wikimedia.org/r/865693 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [01:04:15] (03CR) 10Cwhite: [C: 03+1] netmon: Set netmon2002 the main instance in codfw [puppet] - 10https://gerrit.wikimedia.org/r/865711 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [01:04:38] (03CR) 10Cwhite: [C: 03+1] netmon: Add the netmon2002 instance as a ganeti rapi node. [puppet] - 10https://gerrit.wikimedia.org/r/865707 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [01:04:50] (03CR) 10Cwhite: [C: 03+1] netmon: Remove the netmon2001 instance as passive node [puppet] - 10https://gerrit.wikimedia.org/r/865695 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [01:05:24] !log lvsNNNN: restart pybal to apply etcd key changes on all "high-traffic1" lvs at all sites - T324336 [01:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:53] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:08:18] (ProbeDown) firing: (5) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:08:34] here [01:08:34] looking [01:08:35] PROBLEM - Host en.wikipedia.org is DOWN: CRITICAL - Destination Unreachable (en.wikipedia.org) [01:08:50] shout if I can help [01:09:02] here [01:09:05] (acked) [01:09:13] PROBLEM - Host policy.wikimedia.org is DOWN: CRITICAL - Destination Unreachable (policy.wikimedia.org) [01:09:18] (ProbeDown) firing: (5) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:09:20] I don't think it's actually down? [01:09:29] PROBLEM - Host phab.wmfusercontent.org is DOWN: CRITICAL - Destination Unreachable (phab.wmfusercontent.org) [01:09:41] PROBLEM - Host en.planet.wikimedia.org is DOWN: CRITICAL - Destination Unreachable (en.planet.wikimedia.org) [01:09:55] no, not down for me [01:10:28] activeconns in lvs looks sane in eqiad too, checking others [01:10:33] PROBLEM - Host debmonitor.wikimedia.org is DOWN: CRITICAL - Destination Unreachable (debmonitor.wikimedia.org) [01:10:37] maybe I've caused a monitoring problem? [01:11:01] PROBLEM - Host en.wikibooks.org is DOWN: CRITICAL - Destination Unreachable (en.wikibooks.org) [01:11:01] PROBLEM - Host en.m.wikipedia.org is DOWN: CRITICAL - Destination Unreachable (en.m.wikipedia.org) [01:11:29] I can curl en.wikipedia.org from alert1001 but I can't ping it [01:11:49] ok [01:12:02] but not 100% sure if that was already true, due to firewall policy or anything [01:12:33] I'm confident the site's up though [01:13:01] yeah, seems to be monitoring but can't find the smoking gun [01:13:18] (ProbeDown) firing: (6) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip6) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:13:53] I didn't get a page for that second one, FYI (and I have the app open) [01:14:02] same here [01:15:08] they are icinga-level "host down" alerts, so I'm sure they must be pings. those non-paging spam ones above anyways. [01:15:20] ugh, if anyone gets confused by the message I just posted on the victorsops timeline, it's because I was trying to search :) and there's no delete button [01:16:07] there was a huge diff in naggen-generated config, can see it in syslog, hmmmm [01:16:12] from a puppet run I assume [01:16:43] Dec 8 00:06:07 alert1001 puppet-agent[25916]: (/Stage[main]/Icinga::Naggen/File[/etc/icinga/objects/puppet_hosts.cfg]/content) --- /etc/icinga/obj [01:16:45] (JobUnavailable) firing: (2) Reduced availability for job swagger_check_restbase_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:16:46] ects/puppet_hosts.cfg#0112022-12-07 23:58:08.580150543 +0000 [01:16:46] [.....] [01:17:02] hmm ping -4 works [01:17:13] and the alerts are for IPv6 above [01:17:18] ok [01:17:22] oh, so it does [01:18:10] that doesn't even look right, the ipv6 [01:18:13] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:18:16] is the site up via ipv6? I'm not sure if I have v6 connectivity at home [01:18:39] let's check, I have via phone [01:20:03] works for me, https fetch of enwiki over ipv6 from external (Linode) host [01:20:21] 👍 [01:20:45] it hit codfw though, let me try manual IPs for the others [01:21:18] yep [01:22:25] eqiad ipv6 might be broken [01:23:31] yep, nothing for PING en.wikipedia.org(text-lb.eqiad.wikimedia.org (2620:0:861:ed1a::1)) 56 data bytes [01:24:49] no conns for it in ipvsadm either (ipv6 for text@eqiad) [01:25:33] upload@eqiad seems fine for ipv6 though... [01:27:05] I'm going to try a pybal restart on lvs1017 (but with a slower stop->start cycle by a couple seconds) in case it's some pybal one-off flakiness. [01:27:19] !log lvs1017: restart pybal, attempt to fix text-ipv6 service [01:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:29] RECOVERY - Host en.planet.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [01:27:34] ha [01:27:51] that recovery is from the temporary flip to the secondary lvs (1020), most likely [01:28:18] (ProbeDown) firing: (6) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip6) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:28:22] ok. so I did a diff between LVSes not on eqiad form Puppetboard. nothing stands out in the agent run [01:29:17] PROBLEM - Host en.planet.wikimedia.org is DOWN: CRITICAL - Destination Unreachable (en.planet.wikimedia.org) [01:29:21] I'm gonna just stop pybal on lvs1017 for now (with puppet disabled), because the secondary lvs seemed to pick up the traffic just fine (briefly) [01:29:28] this will at least end the impact while we look around more [01:29:45] !log lvs1017 - disable puppet and stop pybal to fix ipv6 for now [01:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:21] RECOVERY - Host debmonitor.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.12 ms [01:30:27] RECOVERY - Host en.planet.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.14 ms [01:30:43] eqiad ipv6 back for me [01:30:57] RECOVERY - Host policy.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.14 ms [01:31:33] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:31:45] (JobUnavailable) firing: (2) Reduced availability for job swagger_check_restbase_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:31:48] ^ this would be expected [01:31:58] [the BGP crX-eqiad alert, he means] [01:31:59] RECOVERY - Host en.wikipedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [01:32:03] yes, sorry, that [01:32:18] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [01:32:28] nothing makes intuitive sense about this, will need to dig some more [01:32:49] nothing in the pybal journal on lvs1017 [01:32:51] RECOVERY - Host phab.wmfusercontent.org is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [01:32:52] it's pretty weird indeed [01:33:18] (ProbeDown) resolved: (6) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip6) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:34:09] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [01:34:09] PROBLEM - pybal on lvs1017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [01:34:18] (ProbeDown) resolved: (5) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:34:27] RECOVERY - Host en.wikibooks.org is UP: PING OK - Packet loss = 0%, RTA = 0.14 ms [01:34:27] RECOVERY - Host en.m.wikipedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.14 ms [01:35:05] it was the same process for the changes on both lvs1017 (high-traffic1 primary) and lvs1020 (secondary). lvs1020 was done first. [01:35:12] PROBLEM - PyBal connections to etcd on lvs1017 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [01:35:20] lvs1020 (and the other sites) was fine, but lvs1017... [01:35:36] you can see in the logs it's configuring ipv6 nodes, you can see in ipvsadm that it looks fine too. the IP is defined on the loopback [01:35:47] just no conns show up, like it failed to make the BGP advert for that IP to the routers [01:35:57] (03PS4) 10Ryan Kemper: wdqs data-reload.py: fix usage comment [cookbooks] - 10https://gerrit.wikimedia.org/r/865788 (owner: 10Bking) [01:36:13] but if it actually had failed to do so, the secondary would've gotten the traffic and we wouldn't have seen the failures [01:36:31] so it (lvs1017) clearly "stole" the traffic in the BGP sense, then failed to handle it [01:40:59] no failures from your cumin runs too I am assuming [01:41:00] ? [01:41:04] nope [01:41:39] right now I'm just double-checking that everything's working everywhere (all dcs, both ip versions, etc) in case monitoring is missing any other isolated case other than lvs1017 [01:41:45] (JobUnavailable) firing: (10) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:45:29] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:47:04] everything else looks sane - checked both text+upload, both ipv4+ipv6, at all 6 sites. they're all showing active lvs connections like normal. lvs1017 ipv6 text@eqiad is the only oddball case [01:47:18] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [01:47:42] nothing really stands out that makes lvs1017 a problem, really [01:48:44] FWIW it affected both :80 and :443 too [01:49:01] so it's not even like it's just one "service" or "port", it was both of them for that ipv6 [01:49:11] peeking at router config... [01:49:24] Dec 8 01:05:29 lvs1017 kernel: [21704132.613267] IPVS: sh: TCP 208.80.154.224:80 - no destination available [01:49:27] Dec 8 01:05:29 lvs1017 kernel: [21704132.614697] IPVS: sh: TCP 208.80.154.224:80 - no destination available [01:49:30] ouch [01:49:59] that's normalish [01:50:12] during the window pybal is down, some stray packets land there with nowhere to go [01:50:35] does the timing line up? I know you have things in UTC [01:50:57] yes [01:51:08] more properly, it's when pybal is coming back up [01:51:11] ok [01:51:18] it wipes out the ipvs state of those pools while reconfiguring them [01:51:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:55:53] it's hard to see what might be amiss with pybal down on 1017. so... while puppet remains disabled, I'm gonna manually edit its pybal.conf to set the MED to 101 (so it loses to the working secondary) and then fire it back up to poke around. [01:56:10] I guess that would've been better in log-form :) [01:56:43] !log lvs1017 - manually setting BGP MED to 101 and starting pybal (should come back and and speak BGP, but not steal traffic from lvs1020) [01:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:39] RECOVERY - pybal on lvs1017 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [01:57:41] RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:58:21] RECOVERY - PyBal connections to etcd on lvs1017 is OK: OK: 12 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [02:05:16] [still digging into root cause here, it may take a while!] [02:05:54] !log sretest1001 - puppet disabled, manipulating routing on this host to conduct tests... [02:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:02] so, I added a manual ipv6 route on sretest1001: [02:15:26] 2620:0:861:ed1a::1 via 2620:0:861:107:e63d:1aff:fe7a:cbe1 dev eno1 metric 1024 pref medium [02:16:01] that "via" being the IPv6 address of lvs1017 on the correct vlan interface (which faces sretest1001's home vlan) [02:16:21] and I can "curl -6 https://en.wikipedia.org/" from there still, and it seems to be flowing through lvs1017 [02:17:00] root@lvs1017:~# tcpdump -vvvnpi ens1f1np1.1020 'dst port 443 and dst host 2620:0:861:ed1a::1' [02:17:11] 02:16:34.974710 IP6 (flowlabel 0xbba9c, hlim 64, next-header TCP (6) payload length: 40) 2620:0:861:107:10:64:48:138.53416 > 2620:0:861:ed1a::1.443: Flags [S], cksum 0x8351 (correct), seq 2487128472, win 43200, options [mss 1440,sackOK,TS val 365740256 ecr 0,nop,wscale 9], length 0 [02:17:17] [... many more packets...] [02:17:37] so I can see the packets traversing lvs1017 for this test case, successfully, and resulting in sretest1001 getting an http-level response [02:19:41] I can even see those test connections stacking up in lvs1017's "InActConn" afterwards, so it's really going through IPVS [02:19:47] ok. so should we try resetting the MED and seeing if it picks up trafic? :) [02:20:12] I can try again briefly, yeah. Althogh that would be super dissappointing, as we'd still have no explanation [02:20:44] yeah [02:21:29] !log restarting pybal on lvs1017 manually again with bgp_med=0 (should take traffic, may or may not do so very usefully!) [02:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:36] routers show the active route towards lvs1017 as expected [02:22:57] PROBLEM - Host debmonitor.wikimedia.org is DOWN: CRITICAL - Destination Unreachable (debmonitor.wikimedia.org) [02:22:58] v6 is now failing for me [02:23:00] I can still do my tests from sretest1001 with the manual route [02:23:01] there it is [02:23:08] but yeah, doesn't work via the real routers... [02:23:49] PROBLEM - Host en.planet.wikimedia.org is DOWN: CRITICAL - Destination Unreachable (en.planet.wikimedia.org) [02:23:51] (deleted the manual route on sretest1001 to make it go via crX, then it times out and fails like everything else) [02:24:18] (ProbeDown) firing: (5) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:24:24] !log lvs1017 - restary pybal manually again, back on bgp_med=101 (traffic goes back to lvs1020) [02:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:33] back [02:24:47] RECOVERY - Host debmonitor.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [02:24:57] RECOVERY - Host en.planet.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.12 ms [02:25:18] (ProbeDown) firing: (3) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:25:27] so, all the sretest thing has done has narrowed down the scenario [02:26:00] ipvs is working on lvs1017 if you can get the packets there via manual routing. but they don't get there via crX-eqiad... [02:27:17] I'm gonna do one more check like that, hopefully fast enough to not cause as much spam, because I didn't have a sniffer running last time. [02:27:51] [done] [02:27:57] no packets [02:28:31] so, when lvs1017 is the active BGP route for text ipv6, the packets aren't even arriving at lvs1017 from the routers [02:28:44] yet this was all working fine earlier. this doesn't make any sense. [02:29:18] (ProbeDown) resolved: (6) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip6) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:30:18] (ProbeDown) resolved: (3) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:31:30] it can't be that pybal picked up any changes in the config restart so it's not that [02:31:39] even on the routers, nothing stands out between lvs1017 and lvs1020 [02:32:35] have you done logging/tracing on JunOS? we may see something there [02:32:44] not sure, just spitballing [02:32:49] yeah [02:32:59] the only change was the etcd key, which seems to be working fine in general [02:33:12] I suspect whatever this is, it was just lying in wait for the next lvs1017 pybal restart somehow [02:33:46] checking to see when it was last restarted before mine... [02:33:49] but if there was a confg change, that would apply to other hosts too? and there is nothing in the per-host override as well [02:35:10] looks like November 3 [02:35:21] (last pybal restart) [02:35:52] syslog doesn't go back that far, but pybal.log does [02:39:40] * urandom settles in to a lengthy back-scroll [02:40:02] logstash has it [02:42:46] which was the date of the confd disk space thing [02:42:53] hmmm [02:43:39] anyways, that puts some kind of time boundary on - it was since Nov 3, since we didn't have this problem at that pybal restart [02:45:07] I guess it helps but if you look at what changed, it's still hard to pin on a particular event or commit [02:45:26] we haven't really done anything on the LVSes themselves in eqiad at least [02:45:52] that still doesn't explain why just lvs1017 and not some other high-traffic1 LVS [02:46:01] I'm thinking more like some router config thing [02:46:10] not that I have any smoking gun on anything, but it seems more-likely [02:46:32] there was the stuff lately in eqiad with VRFs and IPv6 TTL messages that was kind of mysterious [02:47:08] on lvs1017, there is stuff like (in syslog): [02:47:08] although if it was working fine before pybal restarted... I can't think of a type of routing config problem that would've let it work fine earlier, but break it just because pybal disconnected its BGP session and then started it up again. [02:47:13] > IPv6 header not found [02:47:19] I really can't seem to make anything of it [02:47:40] where? [02:47:55] sukhe@lvs1017:~$ sudo dmesg -e | grep -i ipv6 [02:48:01] but then that's been there for quite some time [02:48:05] oh yeah [02:48:29] that goes back to Nov 29 apparently, hmmm [02:49:08] I tried searching for it since it stood out [02:49:09] and Nov 29 is, coincidentally, the date of the IPv6 VRF-related homer commit I was mentioning above... [02:49:28] https://gerrit.wikimedia.org/r/c/operations/homer/public/+/861896 [02:49:44] better context in phab: https://phabricator.wikimedia.org/T324033 [02:50:22] still [02:50:45] I could see maybe something like that causing BGP to be unable to reconnect (but letting the old BGP session stay alive until the next pybal restart) [02:51:12] but BGP is still connecting fine. it's hard to imagine if that was the time this started, that it wouldn't have affected new client IPv6 sessions immediately. [02:51:43] but then again, I should probably be careful to expand my imagination when thinking of potential juniper-related issues :P [02:52:49] I think in all the possible theories so far, this one fits the best [02:52:53] but this does explain why upload was fine? [02:53:06] does this [02:54:04] not really, no [02:54:20] but they are distinct in some subtle ways, aside from the separate IP addresses themselves [02:54:25] that's an oddity too [02:54:28] they live in different rows, on different VLANs, etc [02:54:49] (different native host vlans I mean, which is where BGP advertises from) [02:55:44] at this point, netopsen would probably be 10x (at least) more efficient than me at debugging this [02:56:00] but also, we're "stable" other than the lack of LVS redundancy in eqiad, so I don't want to wake anyone up [02:56:04] 10SRE, 10Infrastructure-Foundations, 10Mail: MX: increasing disk space - https://phabricator.wikimedia.org/T305567 (10jhathaway) p:05Medium→03High [02:56:31] I'm inclined to just shift gears and make task and incident report, etc, and debug this with them in the morning [02:56:38] +1 from me on that [02:56:40] 10SRE, 10Infrastructure-Foundations, 10Mail: MX: increasing disk space - https://phabricator.wikimedia.org/T305567 (10jhathaway) @Dzahn I agree, sorry for letting this one slip out of my todo list, I'll take care of it tomorrow [02:56:41] fwiw :) [02:56:57] do you want me to submit a patch for setting the MED so that we can enable Puppet? [02:57:01] or do you want it to be disabled? [02:57:10] yeah that's a good idea, thanks [02:57:12] on it [02:57:14] just set it to 101 for that host [02:57:17] cool [02:58:36] I'm sure I missed some other things we should've done earlier too, re: status updates, since it was technically a partial outage [02:59:48] (03PS1) 10Ssingh: hiera: temporary set bgp-med to 101 for lvs1017 [puppet] - 10https://gerrit.wikimedia.org/r/865823 [02:59:49] yeah we can do that tomorrow from the logs [03:00:31] technically this needs a task too. but a 10PM task is a bad idea so tomorrow :P [03:00:32] basic impact was 01:05 -> 01:29, loss of text ipv6 service for . [03:01:00] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38642/console" [puppet] - 10https://gerrit.wikimedia.org/r/865823 (owner: 10Ssingh) [03:01:19] and again around 01:21 -> 01:24 for a testing attempt [03:01:25] noted [03:01:30] err sorry [03:01:35] and again around 02:21 -> 02:24 for a testing attempt [03:02:24] bblack: https://gerrit.wikimedia.org/r/c/operations/puppet/+/865823 [03:02:41] (03CR) 10BBlack: [C: 03+2] hiera: temporary set bgp-med to 101 for lvs1017 [puppet] - 10https://gerrit.wikimedia.org/r/865823 (owner: 10Ssingh) [03:02:46] thanks! [03:02:59] we can update the yaml with the comment to the task tomorrow [03:04:51] agent ran with no changes, so we're all good there [03:05:00] cool [03:05:33] I'm gonna take a break and grab some dinner, and then I'll make at least some basic-level incident doc / task stuff later. [03:05:40] please do [03:05:43] (dinner) [03:05:55] we can do the doc tomorrow as well but up to you :P [03:06:21] I also still had one patch left pending on my etcd changes, which was to remove the old keys to reduce any confusion and/or ferret out any missed references to them somewhere I didn't know about by seeing what that affects, which was https://gerrit.wikimedia.org/r/c/operations/puppet/+/863338 [03:06:56] but I'm gonna leave that on hold for now. the state of that effort right now is that the conftool key 'cdn' is in effect, and the old ones 'ats-tls' and 'varnish-fe' still exist, but all the tooling isn't using them. [03:07:18] (all the tooling I know of, at least the crtical stuff in the production path and anything I could find with git grep) [03:08:48] I technically can't prove that the etcd changes aren't directly causative, so it doesn't make sense to make it even harder if we end up having to revert back through it. [03:09:04] yeah fair enough [03:10:45] I'm betting on some subtle ipv6 routing issue that was lying in wait and was triggered into effect by restarting pybal's BGP sessions. But if I was good at betting, I'd be much richer, so take that with a grain of salt! :) [03:11:33] wouldn't we all be :P [03:11:51] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [03:12:32] ^ not us, this has been happening before [03:12:41] ack, yeah [03:12:55] (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:13:35] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [03:36:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance [03:37:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance [04:01:28] (03PS1) 10Ladsgroup: Set externallinks migration to WRITE_BOTH in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865828 (https://phabricator.wikimedia.org/T321662) [04:34:53] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [04:36:03] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [04:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [04:45:51] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [04:46:59] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [05:06:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2112.codfw.wmnet with reason: Maintenance [05:06:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2112.codfw.wmnet with reason: Maintenance [05:14:13] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [05:14:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2112.codfw.wmnet with reason: Maintenance [05:14:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2112.codfw.wmnet with reason: Maintenance [05:17:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2112.codfw.wmnet with reason: Maintenance [05:17:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2112.codfw.wmnet with reason: Maintenance [05:19:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2112.codfw.wmnet with reason: Maintenance [05:19:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2112.codfw.wmnet with reason: Maintenance [05:20:03] (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] netmon: Add the netmon2002 instance as a ganeti rapi node. [puppet] - 10https://gerrit.wikimedia.org/r/865707 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [05:20:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2112 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P42546 and previous config saved to /var/cache/conftool/dbconfig/20221208-052036-ladsgroup.json [05:21:15] (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] netmon: Remove netmon2001 from the alertmanager rw api [puppet] - 10https://gerrit.wikimedia.org/r/865693 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [05:22:22] (03PS2) 10Andrea Denisse: netmon: Set netmon2002 the main instance in codfw [puppet] - 10https://gerrit.wikimedia.org/r/865711 (https://phabricator.wikimedia.org/T315523) [05:24:12] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38643/console" [puppet] - 10https://gerrit.wikimedia.org/r/865711 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [05:25:18] (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] netmon: Set netmon2002 the main instance in codfw [puppet] - 10https://gerrit.wikimedia.org/r/865711 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [05:26:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [05:26:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [05:26:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:26:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:27:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T322618)', diff saved to https://phabricator.wikimedia.org/P42547 and previous config saved to /var/cache/conftool/dbconfig/20221208-052705-ladsgroup.json [05:27:09] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [05:29:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T322618)', diff saved to https://phabricator.wikimedia.org/P42548 and previous config saved to /var/cache/conftool/dbconfig/20221208-052917-ladsgroup.json [05:31:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2098.codfw.wmnet with reason: Maintenance [05:31:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2098.codfw.wmnet with reason: Maintenance [05:31:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2100.codfw.wmnet with reason: Maintenance [05:32:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2100.codfw.wmnet with reason: Maintenance [05:32:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [05:32:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2108.codfw.wmnet with reason: Maintenance [05:32:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [05:32:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P42549 and previous config saved to /var/cache/conftool/dbconfig/20221208-053236-ladsgroup.json [05:32:44] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [05:32:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2108.codfw.wmnet with reason: Maintenance [05:32:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T322618)', diff saved to https://phabricator.wikimedia.org/P42550 and previous config saved to /var/cache/conftool/dbconfig/20221208-053253-ladsgroup.json [05:34:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P42551 and previous config saved to /var/cache/conftool/dbconfig/20221208-053447-ladsgroup.json [05:35:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T322618)', diff saved to https://phabricator.wikimedia.org/P42552 and previous config saved to /var/cache/conftool/dbconfig/20221208-053509-ladsgroup.json [05:35:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2112 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P42553 and previous config saved to /var/cache/conftool/dbconfig/20221208-053541-ladsgroup.json [05:44:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P42554 and previous config saved to /var/cache/conftool/dbconfig/20221208-054423-ladsgroup.json [05:49:13] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [05:49:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P42555 and previous config saved to /var/cache/conftool/dbconfig/20221208-054953-ladsgroup.json [05:50:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P42556 and previous config saved to /var/cache/conftool/dbconfig/20221208-055015-ladsgroup.json [05:50:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2112 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P42557 and previous config saved to /var/cache/conftool/dbconfig/20221208-055046-ladsgroup.json [05:59:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P42558 and previous config saved to /var/cache/conftool/dbconfig/20221208-055930-ladsgroup.json [06:05:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P42559 and previous config saved to /var/cache/conftool/dbconfig/20221208-060500-ladsgroup.json [06:05:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P42560 and previous config saved to /var/cache/conftool/dbconfig/20221208-060522-ladsgroup.json [06:05:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2112 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P42561 and previous config saved to /var/cache/conftool/dbconfig/20221208-060551-ladsgroup.json [06:09:13] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [06:14:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T322618)', diff saved to https://phabricator.wikimedia.org/P42562 and previous config saved to /var/cache/conftool/dbconfig/20221208-061436-ladsgroup.json [06:14:41] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [06:20:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P42563 and previous config saved to /var/cache/conftool/dbconfig/20221208-062006-ladsgroup.json [06:20:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [06:20:11] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [06:20:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [06:20:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P42564 and previous config saved to /var/cache/conftool/dbconfig/20221208-062028-ladsgroup.json [06:20:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T322618)', diff saved to https://phabricator.wikimedia.org/P42565 and previous config saved to /var/cache/conftool/dbconfig/20221208-062028-ladsgroup.json [06:20:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2120.codfw.wmnet with reason: Maintenance [06:20:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2120.codfw.wmnet with reason: Maintenance [06:20:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T322618)', diff saved to https://phabricator.wikimedia.org/P42566 and previous config saved to /var/cache/conftool/dbconfig/20221208-062050-ladsgroup.json [06:21:46] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:23:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T322618)', diff saved to https://phabricator.wikimedia.org/P42567 and previous config saved to /var/cache/conftool/dbconfig/20221208-062306-ladsgroup.json [06:36:03] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [06:38:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P42568 and previous config saved to /var/cache/conftool/dbconfig/20221208-063813-ladsgroup.json [06:39:41] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [06:45:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P42569 and previous config saved to /var/cache/conftool/dbconfig/20221208-064541-ladsgroup.json [06:45:45] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [06:51:14] (03PS1) 10BBlack: Revert "hiera: temporary set bgp-med to 101 for lvs1017" [puppet] - 10https://gerrit.wikimedia.org/r/866145 [06:53:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P42570 and previous config saved to /var/cache/conftool/dbconfig/20221208-065319-ladsgroup.json [06:53:32] (03CR) 10BBlack: [C: 03+2] Revert "hiera: temporary set bgp-med to 101 for lvs1017" [puppet] - 10https://gerrit.wikimedia.org/r/866145 (owner: 10BBlack) [06:55:30] !log lvs1017: restarting pybal to take back text traffic (med reverted to normal, underlying problem w/ ipv6 addressed) [06:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:59] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 31800 [06:57:23] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 31800 [07:00:05] kormat, marostegui, and Amir1: That opportune time is upon us again. Time for a Primary database switchover deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221208T0700). [07:00:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P42571 and previous config saved to /var/cache/conftool/dbconfig/20221208-070048-ladsgroup.json [07:08:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T322618)', diff saved to https://phabricator.wikimedia.org/P42572 and previous config saved to /var/cache/conftool/dbconfig/20221208-070825-ladsgroup.json [07:08:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [07:08:30] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [07:08:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [07:08:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T322618)', diff saved to https://phabricator.wikimedia.org/P42573 and previous config saved to /var/cache/conftool/dbconfig/20221208-070847-ladsgroup.json [07:11:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T322618)', diff saved to https://phabricator.wikimedia.org/P42574 and previous config saved to /var/cache/conftool/dbconfig/20221208-071104-ladsgroup.json [07:12:55] (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:15:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P42575 and previous config saved to /var/cache/conftool/dbconfig/20221208-071554-ladsgroup.json [07:26:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P42576 and previous config saved to /var/cache/conftool/dbconfig/20221208-072611-ladsgroup.json [07:31:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P42577 and previous config saved to /var/cache/conftool/dbconfig/20221208-073101-ladsgroup.json [07:31:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [07:31:05] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [07:31:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [07:31:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T322618)', diff saved to https://phabricator.wikimedia.org/P42578 and previous config saved to /var/cache/conftool/dbconfig/20221208-073122-ladsgroup.json [07:41:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P42579 and previous config saved to /var/cache/conftool/dbconfig/20221208-074117-ladsgroup.json [07:56:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T322618)', diff saved to https://phabricator.wikimedia.org/P42580 and previous config saved to /var/cache/conftool/dbconfig/20221208-075624-ladsgroup.json [07:56:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2122.codfw.wmnet with reason: Maintenance [07:56:29] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [07:56:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2122.codfw.wmnet with reason: Maintenance [07:56:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T322618)', diff saved to https://phabricator.wikimedia.org/P42581 and previous config saved to /var/cache/conftool/dbconfig/20221208-075645-ladsgroup.json [07:57:59] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: session-c1108.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:59:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T322618)', diff saved to https://phabricator.wikimedia.org/P42582 and previous config saved to /var/cache/conftool/dbconfig/20221208-075901-ladsgroup.json [08:00:05] Amir1, apergos, and jnuche: That opportune time is upon us again. Time for a UTC morning backport and config training deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221208T0800). [08:00:14] morning! I bet you can predict what I'm about to say: no trainees signed up for the window and no patches scheduled for deployment either. have a great day, everybody! [08:11:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T322618)', diff saved to https://phabricator.wikimedia.org/P42583 and previous config saved to /var/cache/conftool/dbconfig/20221208-081138-ladsgroup.json [08:11:43] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [08:14:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P42584 and previous config saved to /var/cache/conftool/dbconfig/20221208-081408-ladsgroup.json [08:18:11] (03PS10) 10Awight: kartotherian: add kartotherian chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/531699 (https://phabricator.wikimedia.org/T231006) (owner: 10Mathew.onipe) [08:18:26] (03CR) 10Awight: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/865654 (owner: 10Awight) [08:18:40] (03CR) 10Awight: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/865653 (https://phabricator.wikimedia.org/T323360) (owner: 10Awight) [08:24:26] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/865738 (https://phabricator.wikimedia.org/T324585) (owner: 10JHathaway) [08:26:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P42585 and previous config saved to /var/cache/conftool/dbconfig/20221208-082644-ladsgroup.json [08:29:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P42586 and previous config saved to /var/cache/conftool/dbconfig/20221208-082914-ladsgroup.json [08:30:48] (03PS1) 10Muehlenhoff: Make ganeti5005 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/866255 (https://phabricator.wikimedia.org/T324610) [08:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [08:36:52] (03CR) 10Filippo Giunchedi: [C: 03+1] netmon: Remove the netmon2001 instance as passive node [puppet] - 10https://gerrit.wikimedia.org/r/865695 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [08:37:24] (03PS5) 10ClĂ©ment Goubert: P:mediawiki::php:monitoring: Retry opcache probe [puppet] - 10https://gerrit.wikimedia.org/r/865580 (https://phabricator.wikimedia.org/T324649) [08:39:12] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! See nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/865631 (https://phabricator.wikimedia.org/T324684) (owner: 10Cwhite) [08:39:20] (03PS6) 10ClĂ©ment Goubert: P:mediawiki::php:monitoring: Retry opcache probe [puppet] - 10https://gerrit.wikimedia.org/r/865580 (https://phabricator.wikimedia.org/T324649) [08:40:11] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti5005 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/866255 (https://phabricator.wikimedia.org/T324610) (owner: 10Muehlenhoff) [08:41:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P42587 and previous config saved to /var/cache/conftool/dbconfig/20221208-084151-ladsgroup.json [08:44:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T322618)', diff saved to https://phabricator.wikimedia.org/P42588 and previous config saved to /var/cache/conftool/dbconfig/20221208-084421-ladsgroup.json [08:44:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance [08:44:25] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [08:44:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance [08:44:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T322618)', diff saved to https://phabricator.wikimedia.org/P42589 and previous config saved to /var/cache/conftool/dbconfig/20221208-084442-ladsgroup.json [08:46:25] good morning, I will have a few puppet patches to get merged today in order to bring up contint1002 (replacement for contint1001 which is faulty). They should be straightforward but I am not sure whom I should bother [08:46:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T322618)', diff saved to https://phabricator.wikimedia.org/P42590 and previous config saved to /var/cache/conftool/dbconfig/20221208-084659-ladsgroup.json [08:47:22] should I poke the clinic duty SRE jhathaway or should I poke claime who is on call and in service ops ;) [08:47:40] I can check em out [08:47:52] :] [08:47:53] DM me, I'll go get a coffee [08:48:04] {solved} thank you ! [08:53:38] (03PS1) 10Filippo Giunchedi: sre: exclude confd-reload-vcl from textfile staleness [alerts] - 10https://gerrit.wikimedia.org/r/866264 (https://phabricator.wikimedia.org/T314118) [08:56:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T322618)', diff saved to https://phabricator.wikimedia.org/P42591 and previous config saved to /var/cache/conftool/dbconfig/20221208-085657-ladsgroup.json [08:57:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [08:57:03] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [08:57:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [08:57:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:57:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:57:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T322618)', diff saved to https://phabricator.wikimedia.org/P42592 and previous config saved to /var/cache/conftool/dbconfig/20221208-085724-ladsgroup.json [09:02:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P42593 and previous config saved to /var/cache/conftool/dbconfig/20221208-090205-ladsgroup.json [09:09:37] (03CR) 10ClĂ©ment Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38644/console" [puppet] - 10https://gerrit.wikimedia.org/r/865681 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [09:11:15] (03CR) 10ClĂ©ment Goubert: [V: 03+1 C: 03+2] contint: add contint1002 as a scap target [puppet] - 10https://gerrit.wikimedia.org/r/865681 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [09:13:48] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10serviceops-collab, 10Patch-For-Review: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10hashar) [09:14:05] hashar: merged and deployed on deploy1002 [09:14:54] (03CR) 10ClĂ©ment Goubert: P:mediawiki::php:monitoring: Retry opcache probe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865580 (https://phabricator.wikimedia.org/T324649) (owner: 10ClĂ©ment Goubert) [09:16:27] (03CR) 10Hashar: "For the context, Daniel has enabled the role on the server yesterday night and proposed to push this one as well. I elected to do this cha" [puppet] - 10https://gerrit.wikimedia.org/r/865681 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [09:17:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P42594 and previous config saved to /var/cache/conftool/dbconfig/20221208-091712-ladsgroup.json [09:17:35] !log hashar@deploy1002 Started deploy [integration/docroot@2e0d44b]: Warm up contint1002 and test php-fpm restart # T313832 [09:17:38] T313832: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 [09:17:39] !log hashar@deploy1002 Finished deploy [integration/docroot@2e0d44b]: Warm up contint1002 and test php-fpm restart # T313832 (duration: 00m 03s) [09:20:09] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10serviceops-collab, 10Patch-For-Review: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10hashar) [09:20:36] (03CR) 10JMeybohm: pki: Add intermediates for wikikube and wikikube staging (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/865591 (owner: 10JMeybohm) [09:23:19] 10SRE, 10Infrastructure-Foundations, 10vm-requests: CODFW: 1 VM requested for test of reimaging cookbook - https://phabricator.wikimedia.org/T324744 (10SLyngshede-WMF) [09:23:33] Hello we shall be proceeding with the varnishkafka certs renewal T323771. Disabling puppet on all cp hosts for a while, kindly let me know if there's any issue [09:23:34] T323771: Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 [09:23:51] 10SRE, 10Infrastructure-Foundations, 10vm-requests: CODFW: 1 VM requested for test of reimaging cookbook - https://phabricator.wikimedia.org/T324744 (10SLyngshede-WMF) p:05Triage→03Low a:03SLyngshede-WMF [09:24:44] steve_munene: ack, thanks for the heads up [09:24:53] !log hashar@deploy1002 Started deploy [zuul/deploy@4c6859c]: Install Zuul virtualenv on contint1002 # T313832 [09:24:56] T313832: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 [09:25:00] !log hashar@deploy1002 Finished deploy [zuul/deploy@4c6859c]: Install Zuul virtualenv on contint1002 # T313832 (duration: 00m 07s) [09:29:58] (03CR) 10ClĂ©ment Goubert: [C: 03+2] scripts/run_ci_locally.sh: Fix arm Mac docker platform warning [puppet] - 10https://gerrit.wikimedia.org/r/823122 (owner: 10ClĂ©ment Goubert) [09:32:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T322618)', diff saved to https://phabricator.wikimedia.org/P42595 and previous config saved to /var/cache/conftool/dbconfig/20221208-093218-ladsgroup.json [09:32:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2159.codfw.wmnet with reason: Maintenance [09:32:23] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [09:32:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2159.codfw.wmnet with reason: Maintenance [09:32:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [09:32:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [09:32:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T322618)', diff saved to https://phabricator.wikimedia.org/P42596 and previous config saved to /var/cache/conftool/dbconfig/20221208-093255-ladsgroup.json [09:35:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T322618)', diff saved to https://phabricator.wikimedia.org/P42597 and previous config saved to /var/cache/conftool/dbconfig/20221208-093511-ladsgroup.json [09:38:07] !log contint1001: manually stopped and masked zuul-merger. It is under maintenance mode in Icinga # T313832 [09:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:10] T313832: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 [09:38:56] (03PS1) 10Muehlenhoff: Fix hiera config for ganeti5005 [puppet] - 10https://gerrit.wikimedia.org/r/866273 [09:43:23] !log contint1002: stopped puppet and manually started zuul-merger. I am monitoring it cause last time we have bring up a new one it had some issues here and there # T313832 [09:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:26] T313832: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 [09:46:34] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.makevm for new host test-reimage2001.codfw.wmnet [09:46:36] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [09:48:51] (03CR) 10David Caro: [C: 04-1] webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [09:49:08] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM test-reimage2001.codfw.wmnet - slyngshede@cumin1001" [09:50:14] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM test-reimage2001.codfw.wmnet - slyngshede@cumin1001" [09:50:14] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:50:14] !log slyngshede@cumin1001 START - Cookbook sre.dns.wipe-cache test-reimage2001.codfw.wmnet on all recursors [09:50:17] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) test-reimage2001.codfw.wmnet on all recursors [09:50:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P42598 and previous config saved to /var/cache/conftool/dbconfig/20221208-095017-ladsgroup.json [09:50:50] (03PS1) 10Hashar: contint: move zuul-merger from contint1001 to contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/866277 (https://phabricator.wikimedia.org/T313832) [09:51:11] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/866277 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [09:52:02] there might be some CI failure occuring cause I have started a new zuul-merger on contint1002 [09:54:13] (03CR) 10Muehlenhoff: [C: 03+2] Fix hiera config for ganeti5005 [puppet] - 10https://gerrit.wikimedia.org/r/866273 (owner: 10Muehlenhoff) [09:55:14] (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/output/866277/1496/contint1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/866277 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [09:56:22] !log restarting varnishkafka-webrequest.service on host cp1075 T323771 [09:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:26] T323771: Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 [09:56:43] (03PS5) 10Cathal Mooney: Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) [09:57:16] (03CR) 10CI reject: [V: 04-1] Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [09:57:29] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host test-reimage2001.codfw.wmnet [09:57:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T322618)', diff saved to https://phabricator.wikimedia.org/P42599 and previous config saved to /var/cache/conftool/dbconfig/20221208-095741-ladsgroup.json [09:57:44] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [09:58:00] (03PS6) 10Cathal Mooney: Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) [09:58:31] (03CR) 10CI reject: [V: 04-1] Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [09:59:02] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:59:08] RECOVERY - Puppet CA expired certs on puppetmaster1001 is OK: OK: all puppet agent certs fine https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [09:59:56] (03CR) 10ClĂ©ment Goubert: [C: 03+2] contint: move zuul-merger from contint1001 to contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/866277 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [10:01:17] (03PS1) 10Slyngshede: dhcp: add test-reimage2001 Ganeti VM. [puppet] - 10https://gerrit.wikimedia.org/r/866278 [10:01:35] !log Deploying puppet enforcement of zuul-merger on contint1002 [10:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:48] (03PS7) 10Cathal Mooney: Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) [10:02:20] (03CR) 10CI reject: [V: 04-1] Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [10:03:01] (03PS2) 10Slyngshede: dhcp: add test-reimage2001 Ganeti VM. [puppet] - 10https://gerrit.wikimedia.org/r/866278 [10:04:20] 10SRE-swift-storage: Update Debian rclone package to 1.60.0 - https://phabricator.wikimedia.org/T322547 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon https://tracker.debian.org/news/1396077/accepted-rclone-1601dfsg-1-source-into-unstable/ [10:04:22] 10SRE-swift-storage: Swiftrepl doesn't work on bullseye (and swiftrepl.conf is deployed by hand) - https://phabricator.wikimedia.org/T299125 (10MatthewVernon) [10:05:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P42600 and previous config saved to /var/cache/conftool/dbconfig/20221208-100524-ladsgroup.json [10:05:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5005.eqsin.wmnet [10:05:53] (03PS8) 10Cathal Mooney: Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) [10:06:42] (03CR) 10CI reject: [V: 04-1] Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [10:10:39] (03PS9) 10Cathal Mooney: Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) [10:11:11] (03CR) 10CI reject: [V: 04-1] Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [10:11:14] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P42602 and previous config saved to /var/cache/conftool/dbconfig/20221208-101247-ladsgroup.json [10:16:08] (03PS10) 10Cathal Mooney: Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) [10:16:42] (03CR) 10CI reject: [V: 04-1] Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [10:18:52] !log contint1002: activated Icinga monitoring , all services are up and running # T313832 [10:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:56] T313832: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 [10:19:04] (03CR) 10Muehlenhoff: dhcp: add test-reimage2001 Ganeti VM. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866278 (owner: 10Slyngshede) [10:20:28] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10serviceops-collab, 10Patch-For-Review: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10hashar) [10:20:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T322618)', diff saved to https://phabricator.wikimedia.org/P42603 and previous config saved to /var/cache/conftool/dbconfig/20221208-102030-ladsgroup.json [10:20:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance [10:20:35] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [10:20:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance [10:20:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P42604 and previous config saved to /var/cache/conftool/dbconfig/20221208-102052-ladsgroup.json [10:21:07] Re enabling puppet on cp hosts T323771 [10:21:08] T323771: Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 [10:21:51] steve_munene: log it ;) [10:22:00] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:22:56] (03PS11) 10Cathal Mooney: Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) [10:23:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P42605 and previous config saved to /var/cache/conftool/dbconfig/20221208-102308-ladsgroup.json [10:23:23] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10serviceops-collab, 10Patch-For-Review: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10hashar) 05Open→03Resolved a:03hashar contint1002 is now attached as a Jenkins agent and run... [10:23:27] (03CR) 10CI reject: [V: 04-1] Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [10:23:32] jouncebot: nowandnext [10:23:33] No deployments scheduled for the next 0 hour(s) and 36 minute(s) [10:23:33] In 0 hour(s) and 36 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221208T1100) [10:24:20] (03PS3) 10Slyngshede: dhcp: add test-reimage2001 Ganeti VM. [puppet] - 10https://gerrit.wikimedia.org/r/866278 [10:24:25] (03CR) 10Ladsgroup: [C: 03+2] Set externallinks migration to WRITE_BOTH in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865828 (https://phabricator.wikimedia.org/T321662) (owner: 10Ladsgroup) [10:24:52] (03PS4) 10Slyngshede: dhcp: add test-reimage2001 Ganeti VM. [puppet] - 10https://gerrit.wikimedia.org/r/866278 [10:25:38] (03Merged) 10jenkins-bot: Set externallinks migration to WRITE_BOTH in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865828 (https://phabricator.wikimedia.org/T321662) (owner: 10Ladsgroup) [10:25:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5005.eqsin.wmnet [10:26:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865828 (https://phabricator.wikimedia.org/T321662) (owner: 10Ladsgroup) [10:26:42] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:865828|Set externallinks migration to WRITE_BOTH in testwiki (T321662)]] [10:26:45] T321662: Enable write both for externallinks in beta and production - https://phabricator.wikimedia.org/T321662 [10:27:14] (03CR) 10Slyngshede: dhcp: add test-reimage2001 Ganeti VM. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866278 (owner: 10Slyngshede) [10:27:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P42606 and previous config saved to /var/cache/conftool/dbconfig/20221208-102754-ladsgroup.json [10:28:40] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:865828|Set externallinks migration to WRITE_BOTH in testwiki (T321662)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [10:29:19] (03PS1) 10Hashar: contint: remove references to contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/866280 (https://phabricator.wikimedia.org/T324698) [10:35:43] !log batch restarting varnishkafka-eventlogging.service in batches of 3 30 seconds in between [10:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:00] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:865828|Set externallinks migration to WRITE_BOTH in testwiki (T321662)]] (duration: 09m 17s) [10:36:03] T321662: Enable write both for externallinks in beta and production - https://phabricator.wikimedia.org/T321662 [10:36:03] (03PS1) 10Stang: Revert "Revert "specieswiki: Install GeoData extension"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865763 [10:36:49] (03PS2) 10Stang: specieswiki: Install GeoData extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865763 (https://phabricator.wikimedia.org/T324348) [10:37:07] (03PS3) 10Stang: specieswiki: Install GeoData extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865763 (https://phabricator.wikimedia.org/T324348) [10:37:13] (03CR) 10Thiemo Kreuz (WMDE): "Is it possible to make the commit message explain this a little better? Essentially, how did this became unused?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/865654 (owner: 10Awight) [10:38:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P42608 and previous config saved to /var/cache/conftool/dbconfig/20221208-103815-ladsgroup.json [10:43:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T322618)', diff saved to https://phabricator.wikimedia.org/P42609 and previous config saved to /var/cache/conftool/dbconfig/20221208-104300-ladsgroup.json [10:43:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [10:43:04] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [10:43:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [10:43:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P42610 and previous config saved to /var/cache/conftool/dbconfig/20221208-104322-ladsgroup.json [10:43:57] !log batch restarting varnishkafka-eventlogging.service in batches of 3 30 seconds in between T323771 [10:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:00] T323771: Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 [10:44:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P42611 and previous config saved to /var/cache/conftool/dbconfig/20221208-104432-ladsgroup.json [10:47:48] (03PS2) 10Effie Mouzeli: site: Productionise mc20[39-55] [puppet] - 10https://gerrit.wikimedia.org/r/865736 (https://phabricator.wikimedia.org/T293012) [10:50:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5005.eqsin.wmnet to cluster eqsin and group 1 [10:50:51] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti5005.eqsin.wmnet to cluster eqsin and group 1 [10:51:39] (03CR) 10Ladsgroup: [C: 03+1] restbase: fix deployment-prep services [puppet] - 10https://gerrit.wikimedia.org/r/865586 (owner: 10Hnowlan) [10:52:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/866278 (owner: 10Slyngshede) [10:53:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P42612 and previous config saved to /var/cache/conftool/dbconfig/20221208-105321-ladsgroup.json [10:53:28] (03CR) 10Hnowlan: [C: 03+2] restbase: fix deployment-prep services [puppet] - 10https://gerrit.wikimedia.org/r/865586 (owner: 10Hnowlan) [10:54:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:54:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5005.eqsin.wmnet to cluster eqsin and group 1 [10:55:13] ~ [10:56:37] !log batch restarting varnishkafka-statsv.service in batches of 3 30 seconds in between T323771 [10:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:41] T323771: Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 [10:57:26] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti5005.eqsin.wmnet to cluster eqsin and group 1 [10:57:59] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: exclude confd-reload-vcl from textfile staleness [alerts] - 10https://gerrit.wikimedia.org/r/866264 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [10:58:54] ^Goodby 48 crits :D [10:59:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:59:17] lol yeah [10:59:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P42613 and previous config saved to /var/cache/conftool/dbconfig/20221208-105938-ladsgroup.json [11:00:04] mvolz: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Citoid / Zotero . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221208T1100). [11:02:06] (03PS3) 10Awight: Remove some unused LAMP config [deployment-charts] - 10https://gerrit.wikimedia.org/r/865654 [11:02:16] (03CR) 10Awight: Remove some unused LAMP config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865654 (owner: 10Awight) [11:02:40] (NodeTextfileStale) resolved: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:04:31] (03PS3) 10Effie Mouzeli: site: Productionise mc20[39-55] [puppet] - 10https://gerrit.wikimedia.org/r/865736 (https://phabricator.wikimedia.org/T293012) [11:04:42] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Remove some unused LAMP config [deployment-charts] - 10https://gerrit.wikimedia.org/r/865654 (owner: 10Awight) [11:08:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P42614 and previous config saved to /var/cache/conftool/dbconfig/20221208-110828-ladsgroup.json [11:08:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [11:08:32] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [11:08:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [11:08:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P42615 and previous config saved to /var/cache/conftool/dbconfig/20221208-110849-ladsgroup.json [11:08:59] (03PS7) 10ClĂ©ment Goubert: P:mediawiki::php:monitoring: Retry opcache probe [puppet] - 10https://gerrit.wikimedia.org/r/865580 (https://phabricator.wikimedia.org/T324649) [11:09:32] (03CR) 10ClĂ©ment Goubert: [V: 03+1] P:mediawiki::php:monitoring: Retry opcache probe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865580 (https://phabricator.wikimedia.org/T324649) (owner: 10ClĂ©ment Goubert) [11:09:53] !log batch restarting varnishkafka-webrequest.service in batches of 3 30 seconds in between T323771 [11:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:56] T323771: Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 [11:10:15] (03CR) 10Muehlenhoff: [C: 03+2] prometheus: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/865648 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:11:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P42616 and previous config saved to /var/cache/conftool/dbconfig/20221208-111105-ladsgroup.json [11:11:21] (03CR) 10ClĂ©ment Goubert: [V: 03+1 C: 03+2] P:mediawiki::php:monitoring: Retry opcache probe [puppet] - 10https://gerrit.wikimedia.org/r/865580 (https://phabricator.wikimedia.org/T324649) (owner: 10ClĂ©ment Goubert) [11:12:28] (03CR) 10Effie Mouzeli: [C: 03+2] site: Productionise mc20[39-55] [puppet] - 10https://gerrit.wikimedia.org/r/865736 (https://phabricator.wikimedia.org/T293012) (owner: 10Effie Mouzeli) [11:12:58] 10SRE, 10LDAP-Access-Requests: Logstash access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10Nikerabbit) >>! In T318209#8451675, @jhathaway wrote: > @Nikerabbit when does their contract expire, so I can document it in our user database? June 30, 2023. [11:14:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P42617 and previous config saved to /var/cache/conftool/dbconfig/20221208-111444-ladsgroup.json [11:19:00] (03PS2) 10KartikMistry: Update cxserver to 2022-12-06-121330-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/865063 (https://phabricator.wikimedia.org/T321781) [11:19:12] (03PS1) 10Effie Mouzeli: site: Productionise mc2039 too [puppet] - 10https://gerrit.wikimedia.org/r/866298 [11:20:16] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@73d1267]: Create dag generating weekly snapshot of HDFS usage - analytics_test [airflow-dags@73d1267] [11:20:17] (03CR) 10KartikMistry: Update cxserver to 2022-12-06-121330-production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865063 (https://phabricator.wikimedia.org/T321781) (owner: 10KartikMistry) [11:20:26] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@73d1267]: Create dag generating weekly snapshot of HDFS usage - analytics_test [airflow-dags@73d1267] (duration: 00m 09s) [11:20:51] (03CR) 10Effie Mouzeli: [C: 03+2] site: Productionise mc2039 too [puppet] - 10https://gerrit.wikimedia.org/r/866298 (owner: 10Effie Mouzeli) [11:21:40] !log drain ganeti5002 for eventual decom T324610 [11:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:44] T324610: ganeti500[567] implementation tracking for serviceops - https://phabricator.wikimedia.org/T324610 [11:22:54] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@73d1267]: Create dag generating weekly snapshot of HDFS usage - analytics [airflow-dags@73d1267] [11:23:12] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@73d1267]: Create dag generating weekly snapshot of HDFS usage - analytics [airflow-dags@73d1267] (duration: 00m 18s) [11:26:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P42618 and previous config saved to /var/cache/conftool/dbconfig/20221208-112612-ladsgroup.json [11:29:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P42619 and previous config saved to /var/cache/conftool/dbconfig/20221208-112951-ladsgroup.json [11:29:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:29:56] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [11:30:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:30:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [11:30:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [11:30:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T322618)', diff saved to https://phabricator.wikimedia.org/P42620 and previous config saved to /var/cache/conftool/dbconfig/20221208-113030-ladsgroup.json [11:32:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T322618)', diff saved to https://phabricator.wikimedia.org/P42621 and previous config saved to /var/cache/conftool/dbconfig/20221208-113240-ladsgroup.json [11:33:35] 10SRE, 10CX-cxserver, 10Language-Team (Language-2022-October-December), 10Patch-For-Review: cxserver: Update Flores/NLLB-200 MT secret in Production - https://phabricator.wikimedia.org/T324534 (10KartikMistry) [11:33:48] (03PS1) 10Muehlenhoff: miscweb: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/866302 (https://phabricator.wikimedia.org/T135991) [11:35:08] claime: Amir1 Please see: https://phabricator.wikimedia.org/T324534 (We can deploy it early next week and I need to deploy server once it is done, as December end is our switch deadline from the MT provider - Meta) [11:35:32] Let me know if any other information needed - on the task, would be nice. [11:36:33] I can help with that [11:36:54] Let's coordinate so I make sure I do it right [11:37:05] When does work for you? [11:37:13] can do the helmfile deployment if needed [11:39:43] (03PS1) 10Muehlenhoff: piwik: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/866329 (https://phabricator.wikimedia.org/T135991) [11:41:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P42622 and previous config saved to /var/cache/conftool/dbconfig/20221208-114120-ladsgroup.json [11:42:00] (03CR) 10ClĂ©ment Goubert: [C: 03+1] thumbor: move replicas to main values, use swift discovery [deployment-charts] - 10https://gerrit.wikimedia.org/r/865595 (owner: 10Hnowlan) [11:44:51] (03CR) 10Hnowlan: [C: 03+2] thumbor: move replicas to main values, use swift discovery [deployment-charts] - 10https://gerrit.wikimedia.org/r/865595 (owner: 10Hnowlan) [11:47:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P42623 and previous config saved to /var/cache/conftool/dbconfig/20221208-114748-ladsgroup.json [11:50:07] (03Merged) 10jenkins-bot: thumbor: move replicas to main values, use swift discovery [deployment-charts] - 10https://gerrit.wikimedia.org/r/865595 (owner: 10Hnowlan) [11:51:34] 10SRE, 10SRE-Access-Requests, 10ops-codfw, 10Patch-For-Review: Access request for datacenter-ops group - https://phabricator.wikimedia.org/T324585 (10Aklapper) @jhathaway: I am no authority; I can only point out that using self-created SUL accounts often creates problems for verification of further access... [11:51:40] (03PS1) 10Effie Mouzeli: mediawiki::mcrouter_wancache Add mc2055 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/866334 (https://phabricator.wikimedia.org/T293012) [11:52:32] (03PS1) 10Muehlenhoff: aphlict: Enable profile::auto_restarts::service for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/866335 (https://phabricator.wikimedia.org/T135991) [11:54:47] (03PS2) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add mc2055 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/866334 (https://phabricator.wikimedia.org/T293012) [11:56:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P42624 and previous config saved to /var/cache/conftool/dbconfig/20221208-115627-ladsgroup.json [11:56:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2182.codfw.wmnet with reason: Maintenance [11:56:31] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [11:56:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2182.codfw.wmnet with reason: Maintenance [11:57:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T322618)', diff saved to https://phabricator.wikimedia.org/P42625 and previous config saved to /var/cache/conftool/dbconfig/20221208-115659-ladsgroup.json [11:59:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T322618)', diff saved to https://phabricator.wikimedia.org/P42626 and previous config saved to /var/cache/conftool/dbconfig/20221208-115915-ladsgroup.json [12:00:08] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P42627 and previous config saved to /var/cache/conftool/dbconfig/20221208-120255-ladsgroup.json [12:02:57] (03PS1) 10Muehlenhoff: an-web: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/866337 (https://phabricator.wikimedia.org/T135991) [12:03:12] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:06:44] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:07:54] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:08:04] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: ganeti500[567] implementation tracking for serviceops - https://phabricator.wikimedia.org/T324610 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:08:42] 10SRE, 10ops-eqsin, 10Infrastructure-Foundations: ganeti500[567] implementation tracking - https://phabricator.wikimedia.org/T324610 (10MoritzMuehlenhoff) [12:10:19] (03PS5) 10Slyngshede: dhcp: add test-reimage2001 Ganeti VM. [puppet] - 10https://gerrit.wikimedia.org/r/866278 [12:10:44] (03CR) 10Slyngshede: dhcp: add test-reimage2001 Ganeti VM. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866278 (owner: 10Slyngshede) [12:11:50] (03PS6) 10Slyngshede: dhcp: add test-reimage2001 Ganeti VM. [puppet] - 10https://gerrit.wikimedia.org/r/866278 [12:12:13] (03PS7) 10Slyngshede: dhcp: add test-reimage2001 Ganeti VM. [puppet] - 10https://gerrit.wikimedia.org/r/866278 [12:12:21] (03CR) 10Muehlenhoff: [C: 03+1] dhcp: add test-reimage2001 Ganeti VM. [puppet] - 10https://gerrit.wikimedia.org/r/866278 (owner: 10Slyngshede) [12:13:28] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:14:12] (03PS1) 10Stang: frwikiversity: Set wgRestrictDisplayTitle to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866339 (https://phabricator.wikimedia.org/T324277) [12:14:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P42628 and previous config saved to /var/cache/conftool/dbconfig/20221208-121422-ladsgroup.json [12:14:53] (03CR) 10Slyngshede: [C: 03+2] dhcp: add test-reimage2001 Ganeti VM. [puppet] - 10https://gerrit.wikimedia.org/r/866278 (owner: 10Slyngshede) [12:18:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T322618)', diff saved to https://phabricator.wikimedia.org/P42629 and previous config saved to /var/cache/conftool/dbconfig/20221208-121801-ladsgroup.json [12:18:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [12:18:06] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [12:18:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [12:18:22] Amir1: Thanks. If you've time now, it should be OK too else we can finalize on the task which time/window works for you next week. [12:18:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T322618)', diff saved to https://phabricator.wikimedia.org/P42630 and previous config saved to /var/cache/conftool/dbconfig/20221208-121823-ladsgroup.json [12:20:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T322618)', diff saved to https://phabricator.wikimedia.org/P42631 and previous config saved to /var/cache/conftool/dbconfig/20221208-122032-ladsgroup.json [12:22:36] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [12:25:47] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [12:29:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P42632 and previous config saved to /var/cache/conftool/dbconfig/20221208-122928-ladsgroup.json [12:34:36] (03CR) 10Raymond Ndibe: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [12:35:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P42633 and previous config saved to /var/cache/conftool/dbconfig/20221208-123538-ladsgroup.json [12:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [12:37:32] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/864739 (owner: 10L10n-bot) [12:37:34] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/862864 (owner: 10L10n-bot) [12:37:36] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/866340 (owner: 10L10n-bot) [12:37:38] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/861837 (owner: 10L10n-bot) [12:37:40] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/866358 (owner: 10L10n-bot) [12:40:35] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/866358 (owner: 10L10n-bot) [12:43:52] (03PS1) 10Slyngshede: site.pp: role::test for test-reimage2001. [puppet] - 10https://gerrit.wikimedia.org/r/866359 (https://phabricator.wikimedia.org/T324744) [12:44:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T322618)', diff saved to https://phabricator.wikimedia.org/P42634 and previous config saved to /var/cache/conftool/dbconfig/20221208-124435-ladsgroup.json [12:44:40] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [12:45:07] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/864739 (owner: 10L10n-bot) [12:45:14] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/862864 (owner: 10L10n-bot) [12:45:22] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/866340 (owner: 10L10n-bot) [12:45:32] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/861837 (owner: 10L10n-bot) [12:49:33] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti5002.eqsin.wmnet with reason: Remove for eventual decom [12:49:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti5002.eqsin.wmnet with reason: Remove for eventual decom [12:50:44] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:50:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P42635 and previous config saved to /var/cache/conftool/dbconfig/20221208-125045-ladsgroup.json [12:51:39] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/866359 (https://phabricator.wikimedia.org/T324744) (owner: 10Slyngshede) [12:52:24] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:53:44] (03CR) 10Slyngshede: [C: 03+2] site.pp: role::test for test-reimage2001. [puppet] - 10https://gerrit.wikimedia.org/r/866359 (https://phabricator.wikimedia.org/T324744) (owner: 10Slyngshede) [13:00:42] (03CR) 10Muehlenhoff: update role_contacts for thanos (front|back)end (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865679 (owner: 10Herron) [13:05:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T322618)', diff saved to https://phabricator.wikimedia.org/P42637 and previous config saved to /var/cache/conftool/dbconfig/20221208-130551-ladsgroup.json [13:05:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1191.eqiad.wmnet with reason: Maintenance [13:05:56] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [13:06:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1191.eqiad.wmnet with reason: Maintenance [13:06:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T322618)', diff saved to https://phabricator.wikimedia.org/P42638 and previous config saved to /var/cache/conftool/dbconfig/20221208-130612-ladsgroup.json [13:07:51] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync [13:08:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T322618)', diff saved to https://phabricator.wikimedia.org/P42639 and previous config saved to /var/cache/conftool/dbconfig/20221208-130822-ladsgroup.json [13:09:36] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [13:10:22] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:13:36] (03CR) 10ClĂ©ment Goubert: [C: 03+2] contint: remove references to contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/866280 (https://phabricator.wikimedia.org/T324698) (owner: 10Hashar) [13:14:01] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/864770 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [13:15:23] (03CR) 10JMeybohm: [C: 03+1] knative: import new upstream version 1.7.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/861349 (https://phabricator.wikimedia.org/T323793) (owner: 10Elukey) [13:17:34] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:19:09] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@455d142]: Hotfix on HDFS usage (Unicode in comment) - analytics_test [airflow-dags@455d142] [13:19:18] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@455d142]: Hotfix on HDFS usage (Unicode in comment) - analytics_test [airflow-dags@455d142] (duration: 00m 09s) [13:20:05] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@455d142]: Hotfix on HDFS usage (Remove the specific unicode char in comment) - analytics [airflow-dags@455d142] [13:20:21] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@455d142]: Hotfix on HDFS usage (Remove the specific unicode char in comment) - analytics [airflow-dags@455d142] (duration: 00m 15s) [13:22:30] 10SRE, 10SRE-Access-Requests, 10ops-codfw, 10Patch-For-Review: Access request for datacenter-ops group - https://phabricator.wikimedia.org/T324585 (10Papaul) @Aklapper thanks for the clarification [13:23:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P42640 and previous config saved to /var/cache/conftool/dbconfig/20221208-132329-ladsgroup.json [13:26:44] (03CR) 10ClĂ©ment Goubert: [C: 03+2] P:docker::builder: Add otelcol-contrib uid mapping [puppet] - 10https://gerrit.wikimedia.org/r/865623 (owner: 10ClĂ©ment Goubert) [13:28:24] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:29:48] (03CR) 10ClĂ©ment Goubert: [V: 03+2 C: 03+2] Add a new production image for otelcol [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857672 (https://phabricator.wikimedia.org/T320552) (owner: 10ClĂ©ment Goubert) [13:29:52] (03CR) 10JMeybohm: [C: 04-1] "Not sure how you feel about it but for other upstream charts bringing in a bunch of CRDs I did split those crds.yaml files into one file p" [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) (owner: 10Elukey) [13:31:42] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:13] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T323339 (10Papaul) 05Open→03Resolved a:03Papaul This was already resolved on https://phabricator.wikimedia.org/T321254 [13:38:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P42641 and previous config saved to /var/cache/conftool/dbconfig/20221208-133835-ladsgroup.json [13:43:26] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync [13:43:34] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [13:43:58] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T324752 (10phaultfinder) [13:48:04] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:49:06] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:36] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:51:45] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::mcrouter_wancache: Add mc2055 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/866334 (https://phabricator.wikimedia.org/T293012) (owner: 10Effie Mouzeli) [13:53:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T322618)', diff saved to https://phabricator.wikimedia.org/P42642 and previous config saved to /var/cache/conftool/dbconfig/20221208-135341-ladsgroup.json [13:53:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance [13:53:46] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [13:53:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance [13:54:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T322618)', diff saved to https://phabricator.wikimedia.org/P42643 and previous config saved to /var/cache/conftool/dbconfig/20221208-135402-ladsgroup.json [13:56:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T322618)', diff saved to https://phabricator.wikimedia.org/P42644 and previous config saved to /var/cache/conftool/dbconfig/20221208-135611-ladsgroup.json [13:57:04] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for SDelbecque - https://phabricator.wikimedia.org/T324753 (10SDelbecque-WMF) [13:58:55] (03PS1) 10Jgiannelos: chromium-render: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/866373 [14:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221208T1400) [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221208T1400). [14:00:05] cirno: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:35] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: bump image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/865583 (owner: 10KarlBeecken) [14:00:37] o/ [14:00:44] I’d prefer if someone else could deploy, I haven’t had lunch yet 😅 [14:01:24] (03CR) 10Jgiannelos: [C: 03+2] chromium-render: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/866373 (owner: 10Jgiannelos) [14:02:02] (03CR) 10Stang: "Please run this command before merge this patch:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865763 (https://phabricator.wikimedia.org/T324348) (owner: 10Stang) [14:05:25] (03Merged) 10jenkins-bot: mobileapps: bump image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/865583 (owner: 10KarlBeecken) [14:06:09] (03Merged) 10jenkins-bot: chromium-render: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/866373 (owner: 10Jgiannelos) [14:07:36] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply [14:08:55] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply [14:08:56] (03CR) 10Herron: [C: 03+2] update role_contacts for thanos (front|back)end (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865679 (owner: 10Herron) [14:09:03] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply [14:10:58] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply [14:11:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P42645 and previous config saved to /var/cache/conftool/dbconfig/20221208-141118-ladsgroup.json [14:13:11] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply [14:15:08] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:45] (03PS1) 10Effie Mouzeli: tegola-vector-tiles: use new tegola swift container in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/866377 (https://phabricator.wikimedia.org/T314472) [14:16:50] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [14:17:26] anyone else around who can deploy cirno’s changes? (GeoData on specieswiki, and $wgRestrictDisplayTitle on frwikiversity) [14:17:28] * Lucas_WMDE about to go afk [14:18:43] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:18:55] unfortunately there's no one around available to deploy...? [14:19:13] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:19:19] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [14:20:08] maybe I can deploy later if you’re still around [14:20:25] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [14:20:31] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:21:07] (03CR) 10Muehlenhoff: [C: 03+1] "PCC looks fine: https://puppet-compiler.wmflabs.org/output/865731/38647/" [puppet] - 10https://gerrit.wikimedia.org/r/865731 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan) [14:21:16] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [14:22:00] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:22:29] that's ok, I could wait at here [14:24:13] (03PS1) 10Effie Mouzeli: hieradata: enable maps replication and tile_generation timers [puppet] - 10https://gerrit.wikimedia.org/r/866379 (https://phabricator.wikimedia.org/T314472) [14:26:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P42646 and previous config saved to /var/cache/conftool/dbconfig/20221208-142625-ladsgroup.json [14:29:38] (03PS6) 10Raymond Ndibe: webservice cli: allow for deployment of custom harbor images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) [14:30:36] (03CR) 10Jgiannelos: [C: 03+1] hieradata: enable maps replication and tile_generation timers [puppet] - 10https://gerrit.wikimedia.org/r/866379 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli) [14:32:00] (03CR) 10Jgiannelos: tegola-vector-tiles: use new tegola swift container in eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/866377 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli) [14:38:04] (03PS7) 10Raymond Ndibe: webservice cli: allow for deployment of custom harbor images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) [14:38:32] (03CR) 10Raymond Ndibe: webservice cli: allow for deployment of custom harbor images (035 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [14:40:07] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:40:36] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:41:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T322618)', diff saved to https://phabricator.wikimedia.org/P42647 and previous config saved to /var/cache/conftool/dbconfig/20221208-144131-ladsgroup.json [14:41:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance [14:41:35] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [14:41:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance [14:41:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T322618)', diff saved to https://phabricator.wikimedia.org/P42648 and previous config saved to /var/cache/conftool/dbconfig/20221208-144152-ladsgroup.json [14:46:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T322618)', diff saved to https://phabricator.wikimedia.org/P42649 and previous config saved to /var/cache/conftool/dbconfig/20221208-144602-ladsgroup.json [14:47:23] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [14:47:41] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [14:47:42] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [14:47:54] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [14:47:55] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [14:48:23] (03PS1) 10Muehlenhoff: Extend access for trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/866384 [14:48:57] (03PS2) 10Effie Mouzeli: tegola-vector-tiles: use new tegola swift container in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/866377 (https://phabricator.wikimedia.org/T314472) [14:49:55] (03CR) 10Muehlenhoff: update role_contacts for thanos (front|back)end (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865679 (owner: 10Herron) [14:50:07] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/866384 (owner: 10Muehlenhoff) [14:50:52] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [14:50:53] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [14:51:31] (03CR) 10Effie Mouzeli: tegola-vector-tiles: use new tegola swift container in eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/866377 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli) [14:52:07] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [14:52:08] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:52:13] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:52:14] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:52:22] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:53:28] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [14:53:44] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [14:53:45] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [14:54:13] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [14:56:21] (03CR) 10Jgiannelos: [C: 03+1] tegola-vector-tiles: use new tegola swift container in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/866377 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli) [14:58:21] (03PS3) 10Effie Mouzeli: tegola-vector-tiles: use new tegola swift container in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/866377 (https://phabricator.wikimedia.org/T314472) [14:59:40] !log Restarting Gerrit replica TWICE on gerrit2002.wikimedia.org to apply `-Dh2.maxCompactTime` and get it to trigger compaction # T323754 [14:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:44] T323754: Investigate Gerrit h2 cache being way too large - https://phabricator.wikimedia.org/T323754 [15:01:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P42650 and previous config saved to /var/cache/conftool/dbconfig/20221208-150109-ladsgroup.json [15:02:07] (03PS1) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add mc2054 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/866387 (https://phabricator.wikimedia.org/T293012) [15:03:35] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::mcrouter_wancache: Add mc2054 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/866387 (https://phabricator.wikimedia.org/T293012) (owner: 10Effie Mouzeli) [15:05:16] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [15:05:20] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [15:05:21] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [15:05:22] !log eevans@cumin1001 START - Cookbook sre.hosts.decommission for hosts restbase-dev2001.codfw.wmnet [15:05:25] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [15:05:26] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [15:05:30] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [15:05:31] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [15:05:46] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [15:05:47] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:07:15] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:07:16] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [15:07:20] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: use new tegola swift container in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/866377 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli) [15:08:09] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [15:08:21] !log jiji@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [15:08:22] !log jiji@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [15:08:31] !log jiji@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [15:08:32] !log jiji@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [15:08:36] !log jiji@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [15:08:36] !log jiji@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [15:08:40] !log jiji@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [15:08:40] !log jiji@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [15:08:41] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [15:08:43] Power outage [15:08:56] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [15:08:57] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [15:08:58] On UPS so should be ok [15:09:11] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [15:09:43] Yep, it's back, just long enough that I have to reset all the damn appliance clocks [15:10:12] !log eevans@cumin1001 START - Cookbook sre.dns.netbox [15:10:26] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:33] i am restarting Gerrit to apply a JVM property change [15:11:01] oh hashar [15:11:08] can you tell us when it is ok ? [15:11:21] yeah it is back up in a minute usually [15:11:45] restarting it once more [15:11:45] cool [15:12:11] it is restarting [15:12:37] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti5002.eqsin.wmnet [15:12:44] !log Restarted Gerrit TWICE on gerrit1001.wikimedia.org to apply `-Dh2.maxCompactTime` and get it to trigger compaction # T323754 [15:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:47] T323754: Investigate Gerrit h2 cache being way too large - https://phabricator.wikimedia.org/T323754 [15:12:48] effie: Gerrit is back [15:12:57] \m/! [15:13:22] !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase-dev2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001" [15:15:07] !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase-dev2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001" [15:15:08] !log eevans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:15:08] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts restbase-dev2001.codfw.wmnet [15:15:34] (03PS4) 10Effie Mouzeli: tegola-vector-tiles: use new tegola swift container in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/866377 (https://phabricator.wikimedia.org/T314472) [15:15:41] (03PS1) 10Muehlenhoff: Update ganeti references for Blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/866426 [15:16:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P42653 and previous config saved to /var/cache/conftool/dbconfig/20221208-151616-ladsgroup.json [15:17:35] (03CR) 10Muehlenhoff: [C: 03+2] Update ganeti references for Blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/866426 (owner: 10Muehlenhoff) [15:18:04] (03CR) 10Herron: [C: 03+1] "LGTM!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [15:19:53] (03CR) 10Effie Mouzeli: [V: 03+2] tegola-vector-tiles: use new tegola swift container in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/866377 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli) [15:21:25] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [15:21:32] 10SRE-OnFire, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (GitLab III: GitLab in LA đŸȘƒ), and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) [15:22:20] claime: you might remember Gerrit running out of disk space a couple weeks ago. I found a way to get it to garbage collect some oversized caches. They went from 12G/8.2G down to just 500 M each https://phabricator.wikimedia.org/T323754#8454319 ;) [15:22:29] moar disk space [15:22:34] Nice job [15:22:58] and I have learned a few things about the H2 Database [15:24:37] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti5002.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [15:25:28] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2002.codfw.wmnet with OS bullseye [15:25:32] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be2002.codfw.wmnet with OS bullseye [15:25:56] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [15:26:57] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [15:27:23] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [15:27:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti5002.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [15:27:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:27:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti5002.eqsin.wmnet [15:28:00] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti5002.eqsin.wmnet` - ganeti5002.eqsin.wmnet (**WARN**) - Downti... [15:28:54] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:29:10] (03PS1) 10Muehlenhoff: Remove remaining Puppet references to ganeti5002 (decommed) [puppet] - 10https://gerrit.wikimedia.org/r/866428 [15:29:58] (03PS2) 10Effie Mouzeli: hieradata: enable maps replication and tile_generation timers [puppet] - 10https://gerrit.wikimedia.org/r/866379 (https://phabricator.wikimedia.org/T314472) [15:31:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T322618)', diff saved to https://phabricator.wikimedia.org/P42654 and previous config saved to /var/cache/conftool/dbconfig/20221208-153123-ladsgroup.json [15:31:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:31:27] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [15:31:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:32:35] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: enable maps replication and tile_generation timers [puppet] - 10https://gerrit.wikimedia.org/r/866379 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli) [15:35:21] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [15:37:00] (03PS2) 10Cwhite: logstash: move alertmanager severity field to labels.alert_severity [puppet] - 10https://gerrit.wikimedia.org/r/865631 (https://phabricator.wikimedia.org/T324684) [15:37:15] (03CR) 10Muehlenhoff: [C: 03+2] Remove remaining Puppet references to ganeti5002 (decommed) [puppet] - 10https://gerrit.wikimedia.org/r/866428 (owner: 10Muehlenhoff) [15:37:32] (03CR) 10Cwhite: logstash: move alertmanager severity field to labels.alert_severity (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865631 (https://phabricator.wikimedia.org/T324684) (owner: 10Cwhite) [15:39:49] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:40:18] godog: something to do on thanos-fe1001 issues ? [15:40:33] (03PS1) 10Stang: extwiki: Add new logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866432 (https://phabricator.wikimedia.org/T318766) [15:42:22] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2002.codfw.wmnet with reason: host reimage [15:42:55] (03CR) 10Bking: [C: 03+2] wdqs data-reload.py: fix usage comment [cookbooks] - 10https://gerrit.wikimedia.org/r/865788 (owner: 10Bking) [15:45:24] PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_zuul-merger.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:30] (03PS1) 10Bking: Revert "wdqs data-reload.py: fix usage comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/866466 [15:45:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2002.codfw.wmnet with reason: host reimage [15:45:43] hoy icinga, contint is downtimed what are you doing [15:46:25] (03CR) 10David Caro: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/865619 (owner: 10JMeybohm) [15:47:39] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vpoundstone - WMF - https://phabricator.wikimedia.org/T314676 (10VirginiaPoundstone) 05Resolved→03Open Thank you @Marostegui ! I now have access to datahub. Hooray! Now, I do not have access to idp.wikimendia.org. Pretty su... [15:47:49] (03CR) 10Bking: [C: 03+2] Revert "wdqs data-reload.py: fix usage comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/866466 (owner: 10Bking) [15:48:03] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 365 days, 0:00:00 on contint1001.wikimedia.org with reason: awaiting decom [15:48:17] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 365 days, 0:00:00 on contint1001.wikimedia.org with reason: awaiting decom [15:50:15] hashar: FYI ^ [15:50:49] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10MatthewVernon) [15:55:06] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:56:30] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:02:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2002.codfw.wmnet with OS bullseye [16:02:12] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2002.codfw.wmnet with OS bullseye completed: - thanos-be2002 (**PASS**) - Downtimed on Icinga/Alertmanager... [16:08:57] !log eevans@cumin1001 START - Cookbook sre.dns.netbox [16:08:58] !log eevans@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [16:10:02] !log eevans@cumin1001 START - Cookbook sre.dns.netbox [16:12:22] !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rename restbase-dev2001 to cassandra-dev2001 - eevans@cumin1001" [16:12:29] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for SDelbecque - https://phabricator.wikimedia.org/T324753 (10RBrounley_WMF) Approved. (if needed) [16:13:23] !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rename restbase-dev2001 to cassandra-dev2001 - eevans@cumin1001" [16:13:24] !log eevans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:14:11] !log eevans@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cassandra-dev2001 [16:14:50] !log eevans@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cassandra-dev2001 [16:16:30] PROBLEM - swift eqiad container availability low on alert1001 is CRITICAL: cluster=thanos instance=thanos-fe1001 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad [16:18:11] (03CR) 10JHathaway: [C: 03+2] Add Jennifer Hancock to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/865738 (https://phabricator.wikimedia.org/T324585) (owner: 10JHathaway) [16:18:18] RECOVERY - swift eqiad container availability low on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad [16:18:20] (03PS3) 10JHathaway: Add Jennifer Hancock to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/865738 (https://phabricator.wikimedia.org/T324585) [16:18:25] (03CR) 10JHathaway: [V: 03+2] Add Jennifer Hancock to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/865738 (https://phabricator.wikimedia.org/T324585) (owner: 10JHathaway) [16:18:38] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38649/console" [puppet] - 10https://gerrit.wikimedia.org/r/865799 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [16:18:50] (03PS4) 10BBlack: Remove legacy varnish-fe + ats-tls conftool keys [puppet] - 10https://gerrit.wikimedia.org/r/863338 (https://phabricator.wikimedia.org/T324336) [16:21:30] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/863338/38650/cp2042.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/863338 (https://phabricator.wikimedia.org/T324336) (owner: 10BBlack) [16:22:44] (03CR) 10Ssingh: [C: 03+1] "Looks good! Checked PCC for at least the cp-hosts changes." [puppet] - 10https://gerrit.wikimedia.org/r/863338 (https://phabricator.wikimedia.org/T324336) (owner: 10BBlack) [16:24:15] (03PS1) 10Eevans: Rename codfw restbase-dev nodes to cassandra-dev [puppet] - 10https://gerrit.wikimedia.org/r/866438 (https://phabricator.wikimedia.org/T324113) [16:24:49] claime: well done :-] [16:27:00] (03CR) 10BBlack: [C: 03+2] Remove legacy varnish-fe + ats-tls conftool keys [puppet] - 10https://gerrit.wikimedia.org/r/863338 (https://phabricator.wikimedia.org/T324336) (owner: 10BBlack) [16:33:50] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T321719 (10phaultfinder) [16:34:10] 10SRE, 10Traffic: Replace edge cache conftool entries 'varnish-fe' and 'ats-tls' with singular 'cdn' - https://phabricator.wikimedia.org/T324336 (10BBlack) 05Open→03Resolved a:03BBlack This is completed now. AFAIK all relevant scripts/automations/etc were updated to match. The conftool `service` keys f... [16:35:30] (03PS1) 10Ssingh: install_server: remove obsolete cp hosts partman config [puppet] - 10https://gerrit.wikimedia.org/r/866440 (https://phabricator.wikimedia.org/T323830) [16:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [16:38:37] (03CR) 10BBlack: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/866440 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [16:39:27] (03CR) 10Hnowlan: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/866438 (https://phabricator.wikimedia.org/T324113) (owner: 10Eevans) [16:39:42] (03PS1) 10Ssingh: site.pp: update LVS hosts in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/866441 (https://phabricator.wikimedia.org/T317247) [16:41:00] (03PS1) 10Jgiannelos: maps: Use new swift container for eqiad pregeneration [puppet] - 10https://gerrit.wikimedia.org/r/866442 (https://phabricator.wikimedia.org/T314472) [16:42:27] (03CR) 10Ssingh: [C: 03+2] install_server: remove obsolete cp hosts partman config [puppet] - 10https://gerrit.wikimedia.org/r/866440 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [16:45:05] (03CR) 10Eevans: Rename codfw restbase-dev nodes to cassandra-dev (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866438 (https://phabricator.wikimedia.org/T324113) (owner: 10Eevans) [16:51:27] (03CR) 10David Caro: [C: 03+1] "This works for me! \o/" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [16:54:53] (03PS7) 10JMeybohm: k8s: Add support for PKI with k8s >= 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943) [16:56:27] (03PS1) 10JMeybohm: k8s: Remove authz_mode hiera key [puppet] - 10https://gerrit.wikimedia.org/r/866444 (https://phabricator.wikimedia.org/T307943) [16:57:04] (03PS2) 10JMeybohm: k8s: Remove authz_mode hiera key [puppet] - 10https://gerrit.wikimedia.org/r/866444 (https://phabricator.wikimedia.org/T307943) [16:58:31] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38652/console" [puppet] - 10https://gerrit.wikimedia.org/r/866444 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [16:58:53] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38651/console" [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [17:00:04] jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221208T1700). [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:00:56] (03PS4) 10JMeybohm: pki: Allow to override the default expiry per intermediate [puppet] - 10https://gerrit.wikimedia.org/r/865075 [17:00:58] (03PS4) 10JMeybohm: pki: Add intermediates for wikikube and wikikube staging [puppet] - 10https://gerrit.wikimedia.org/r/865591 [17:01:00] (03PS8) 10JMeybohm: k8s: Add support for PKI with k8s >= 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943) [17:01:02] (03PS3) 10JMeybohm: k8s: Remove authz_mode hiera key [puppet] - 10https://gerrit.wikimedia.org/r/866444 (https://phabricator.wikimedia.org/T307943) [17:03:24] (03CR) 10JMeybohm: kubeadm: Declare /etc/kubernetes directory resource directly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865619 (owner: 10JMeybohm) [17:08:36] (03CR) 10David Caro: [C: 03+1] kubeadm: Declare /etc/kubernetes directory resource directly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865619 (owner: 10JMeybohm) [17:09:19] (03CR) 10David Caro: [C: 03+1] kubeadm: Declare /etc/kubernetes directory resource directly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865619 (owner: 10JMeybohm) [17:10:30] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [17:11:21] (03PS1) 10Hnowlan: conftool: add kubernetes nodes as thumbor nodes [puppet] - 10https://gerrit.wikimedia.org/r/866445 (https://phabricator.wikimedia.org/T233196) [17:12:20] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [17:13:19] (03PS2) 10Hnowlan: conftool: add kubernetes nodes as thumbor nodes [puppet] - 10https://gerrit.wikimedia.org/r/866445 (https://phabricator.wikimedia.org/T233196) [17:14:12] (03CR) 10JMeybohm: kubeadm: Declare /etc/kubernetes directory resource directly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865619 (owner: 10JMeybohm) [17:19:27] 10SRE, 10Traffic-Icebox: backport ipvsadm>=1.30 to buster-wikimedia or buster-backports - https://phabricator.wikimedia.org/T263788 (10BCornwall) Considering there's an ongoing effort to upgrade traffic hosts to Buster (T321309), is this necessary any more? I do see a [[ https://debmonitor.wikimedia.org/packag... [17:26:25] 10SRE, 10Traffic-Icebox: backport ipvsadm>=1.30 to buster-wikimedia or buster-backports - https://phabricator.wikimedia.org/T263788 (10CDanis) No, the only thing remaining was just making sure there were no surprise incompatibilities or other issues. With any upgrade to Buster or later, this should indeed be... [17:31:14] 10SRE, 10Traffic-Icebox: Switch to Maglev hashing ('mh') on LVS hosts - https://phabricator.wikimedia.org/T263797 (10BCornwall) [17:31:21] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [17:43:14] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [17:45:04] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [17:50:30] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [17:54:10] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [17:58:50] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T321719 (10phaultfinder) [18:00:04] bd808: Your horoscope predicts another unfortunate Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221208T1800). [18:06:13] * bd808 should have something to deploy and thus looks around to remember what [18:06:27] (03PS1) 10Tsevener: Add event stream config for ios.talk_page_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866489 (https://phabricator.wikimedia.org/T324340) [18:11:24] (03PS3) 10Ottomata: flink-kubernetes-operator - Initial commit of upstream helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/865100 (https://phabricator.wikimedia.org/T324576) [18:11:26] (03PS8) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [18:22:00] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:28:42] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [18:36:02] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [18:41:40] (03PS10) 10Ottomata: flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) [18:45:50] (03CR) 10Cwhite: [C: 03+2] logstash: move alertmanager severity field to labels.alert_severity [puppet] - 10https://gerrit.wikimedia.org/r/865631 (https://phabricator.wikimedia.org/T324684) (owner: 10Cwhite) [18:46:54] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [18:48:42] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [18:57:55] (03PS1) 10Papaul: Change sretest2002 partman to test nmve [puppet] - 10https://gerrit.wikimedia.org/r/866492 (https://phabricator.wikimedia.org/T322578) [19:00:05] ^demon and dancy: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221208T1900). [19:00:29] 10SRE, 10Traffic-Icebox, 10Sustainability (Incident Followup): LVS should handle losing a NIC on eqiad and codfw - https://phabricator.wikimedia.org/T286924 (10BCornwall) @Vgutierrez Since there is a project to replace LVS in the horizon, is this still worth pursuing? [19:08:19] (03CR) 10Raymond Ndibe: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [19:08:49] jouncebot: nowandnext [19:08:49] For the next 1 hour(s) and 51 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221208T1900) [19:08:50] In 1 hour(s) and 51 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221208T2100) [19:25:48] (03PS1) 10Andrew Bogott: trove-guestagent.conf.erb: catch guest agents up with rabbitmq refactors [puppet] - 10https://gerrit.wikimedia.org/r/866494 [19:29:15] (03CR) 10Raymond Ndibe: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [19:39:58] (03CR) 10Bking: [C: 03+2] Add extra-analysis-ukrainian and bump extra plugins to 7.10.2-wmf4 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/859064 (https://phabricator.wikimedia.org/T322776) (owner: 10DCausse) [19:45:05] (03PS6) 10Southparkfan: remote syslog: allow hiera config of rsyslog TLS CA [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [19:45:07] (03CR) 10Ssingh: [C: 03+1] "Looks good to me, specifically the late_command.sh part! I will let Brandon confirm if the same cacheproxy config should be used, or somet" [puppet] - 10https://gerrit.wikimedia.org/r/866492 (https://phabricator.wikimedia.org/T322578) (owner: 10Papaul) [19:45:52] (03PS1) 10Andrea Denisse: librenms: Lower the TTL for LibreNMS [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695) [19:46:10] (03CR) 10Papaul: [C: 03+2] Change sretest2002 partman to test nmve [puppet] - 10https://gerrit.wikimedia.org/r/866492 (https://phabricator.wikimedia.org/T322578) (owner: 10Papaul) [19:48:53] (03PS2) 10Andrea Denisse: netmon: Remove the netmon2001 instance as passive node [puppet] - 10https://gerrit.wikimedia.org/r/865695 (https://phabricator.wikimedia.org/T322695) [19:53:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2002.codfw.wmnet with OS bullseye [19:53:46] 10SRE, 10ops-codfw, 10Patch-For-Review: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye [19:55:31] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Remove the netmon2001 instance as passive node [puppet] - 10https://gerrit.wikimedia.org/r/865695 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [19:55:35] (03CR) 10Ryan Kemper: "Built here: https://apt.wikimedia.org/wikimedia/dists/bullseye-wikimedia/thirdparty/elastic710/" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/859064 (https://phabricator.wikimedia.org/T322776) (owner: 10DCausse) [19:59:12] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776 [19:59:16] T322776: Deploy Ukrainian Analyzer Plugin - https://phabricator.wikimedia.org/T322776 [19:59:53] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776 [20:02:08] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 5 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: yellow, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 5, active_shards: 5, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 5, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, [20:02:08] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:03:44] (03PS1) 10TrainBranchBot: group2 wikis to 1.40.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866498 (https://phabricator.wikimedia.org/T320518) [20:03:46] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.40.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866498 (https://phabricator.wikimedia.org/T320518) (owner: 10TrainBranchBot) [20:04:27] (03Merged) 10jenkins-bot: group2 wikis to 1.40.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866498 (https://phabricator.wikimedia.org/T320518) (owner: 10TrainBranchBot) [20:04:49] (03CR) 10Andrew Bogott: [C: 03+2] trove-guestagent.conf.erb: catch guest agents up with rabbitmq refactors [puppet] - 10https://gerrit.wikimedia.org/r/866494 (owner: 10Andrew Bogott) [20:05:39] (03PS1) 10Ssingh: certspotter: temporarily disable certspotter (and the systemd timer) [puppet] - 10https://gerrit.wikimedia.org/r/866499 (https://phabricator.wikimedia.org/T318911) [20:06:38] (03PS1) 10Ryan Kemper: elastic: no longer need es6-specific stuff [cookbooks] - 10https://gerrit.wikimedia.org/r/866500 [20:07:17] (03CR) 10Gehel: [C: 03+2] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/866500 (owner: 10Ryan Kemper) [20:07:32] (03CR) 10Ryan Kemper: [V: 03+2] elastic: no longer need es6-specific stuff [cookbooks] - 10https://gerrit.wikimedia.org/r/866500 (owner: 10Ryan Kemper) [20:07:39] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38653/console" [puppet] - 10https://gerrit.wikimedia.org/r/866499 (https://phabricator.wikimedia.org/T318911) (owner: 10Ssingh) [20:08:33] (03CR) 10Bking: [C: 03+2] add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [20:08:39] (03CR) 10Bking: [V: 03+2 C: 03+2] add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [20:09:03] (03CR) 10Ssingh: [V: 03+1 C: 03+2] certspotter: temporarily disable certspotter (and the systemd timer) [puppet] - 10https://gerrit.wikimedia.org/r/866499 (https://phabricator.wikimedia.org/T318911) (owner: 10Ssingh) [20:09:13] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776 [20:09:17] T322776: Deploy Ukrainian Analyzer Plugin - https://phabricator.wikimedia.org/T322776 [20:12:26] !log demon@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.40.0-wmf.13 refs T320518 [20:12:29] T320518: 1.40.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T320518 [20:13:06] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 159 threshold =0.15 breach: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 159, active_shards: 159, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 159, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, num [20:13:06] n_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:16:46] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866501 (https://phabricator.wikimedia.org/T128546) [20:17:23] !log T323064 Merged https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/862178 and deployed new dashboard, visible here: https://grafana.wikimedia.org/d/slo-wdqs-tmpl/wdqs-slos-grizzly-template?orgId=1 [20:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:28] T323064: Create WDQS Uptime SLO && WDQS/WCQS update lag SLO dashboards in Grizzly - https://phabricator.wikimedia.org/T323064 [20:18:37] (03Abandoned) 10Ssingh: [In case of emergency/Stage 3] depool eqsin for hardware refresh [dns] - 10https://gerrit.wikimedia.org/r/856664 (owner: 10Ssingh) [20:21:18] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_search:platform.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:21:42] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on 6 hosts with reason: Plugin upgrade for T322776 [20:21:46] T322776: Deploy Ukrainian Analyzer Plugin - https://phabricator.wikimedia.org/T322776 [20:21:58] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 6 hosts with reason: Plugin upgrade for T322776 [20:22:11] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776 [20:22:18] (03PS1) 10Bartosz DziewoƄski: Deemphasize "Learn more about this page" link [extensions/DiscussionTools] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866467 (https://phabricator.wikimedia.org/T324702) [20:22:24] (03PS1) 10Bartosz DziewoƄski: Reinitialize edit links after page content is reloaded [extensions/MobileFrontend] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866468 (https://phabricator.wikimedia.org/T324686) [20:22:56] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:25:48] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [20:27:26] !log bking@cumin2002 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [20:27:35] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload [20:28:30] (03PS1) 10Bartosz DziewoƄski: Start mobile DiscussionTools A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866502 (https://phabricator.wikimedia.org/T321961) [20:29:18] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 159 threshold =0.15 breach: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 159, active_shards: 159, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 159, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, num [20:29:18] n_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:29:36] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 5 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 5, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 5, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, [20:29:36] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:30:10] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:30:48] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [20:31:02] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776 [20:31:05] T322776: Deploy Ukrainian Analyzer Plugin - https://phabricator.wikimedia.org/T322776 [20:33:50] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:34:54] !log [Cloudelastic] Cleaned up stale (not running but files not removed) elasticsearch 6 units which broke the previous rolling upgrade run on cloudelastic1005 [20:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:18] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776 [20:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [20:48:26] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: codfw VM for VRTS - https://phabricator.wikimedia.org/T324030 (10Arnoldokoth) Resolved: https://netbox.wikimedia.org/virtualization/virtual-machines/539/ [20:49:11] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: codfw VM for VRTS - https://phabricator.wikimedia.org/T324030 (10Arnoldokoth) 05In progress→03Resolved a:03Arnoldokoth [20:50:25] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2002.codfw.wmnet with OS bullseye [20:50:29] 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye executed with errors: - sretest2002 (**FAIL**) - Downtimed on Icinga/Alert... [20:57:05] (03PS1) 10Jdrewniak: Add elwiki and arwiki to desktop-improvements group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866505 (https://phabricator.wikimedia.org/T322391) [21:00:04] brennen and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221208T2100). [21:00:04] jdrewniak, cirno, and MatmaRex: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:20] o/ I can deploy [21:00:23] hi [21:00:27] o/ [21:00:47] I can do my portal deploy first [21:00:56] o/ [21:01:27] jan_drewniak: please do :) all yours [21:01:48] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866501 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [21:02:29] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866501 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [21:03:40] (ping me when you're done please?) [21:03:47] TheresNoTime: will do [21:10:43] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:866501| Bumping portals to master (T128546)]] (duration: 07m 07s) [21:10:47] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [21:12:43] ... still syncing... [21:13:00] no worries, will you also self-serve https://gerrit.wikimedia.org/r/c/866505/ ? [21:13:06] yes [21:13:12] ack [21:15:05] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776 [21:15:09] T322776: Deploy Ukrainian Analyzer Plugin - https://phabricator.wikimedia.org/T322776 [21:17:39] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:866501| Bumping portals to master (T128546)]] (duration: 06m 55s) [21:17:42] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [21:17:48] ok the portal one finally done [21:18:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866505 (https://phabricator.wikimedia.org/T322391) (owner: 10Jdrewniak) [21:18:48] (03Merged) 10jenkins-bot: Add elwiki and arwiki to desktop-improvements group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866505 (https://phabricator.wikimedia.org/T322391) (owner: 10Jdrewniak) [21:19:02] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:866505|Add elwiki and arwiki to desktop-improvements group (T322391)]] [21:19:06] T322391: [S] Deploy Vector 2022 skin to next set of Wikipedias - https://phabricator.wikimedia.org/T322391 [21:21:02] !log jdrewniak@deploy1002 jdrewniak and jdrewniak: Backport for [[gerrit:866505|Add elwiki and arwiki to desktop-improvements group (T322391)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:23:03] (03CR) 10Samtar: [C: 03+2] "starting merge for deploy" [extensions/WikimediaMaintenance] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865748 (https://phabricator.wikimedia.org/T324348) (owner: 10Stang) [21:23:07] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776 [21:23:11] T322776: Deploy Ukrainian Analyzer Plugin - https://phabricator.wikimedia.org/T322776 [21:23:59] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776 [21:25:32] (03Merged) 10jenkins-bot: createExtensionTables: Add extension GeoData [extensions/WikimediaMaintenance] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865748 (https://phabricator.wikimedia.org/T324348) (owner: 10Stang) [21:25:36] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vpoundstone - WMF - https://phabricator.wikimedia.org/T314676 (10BCornwall) Hi, @VirginiaPoundstone! You should be able to log in to Turnilo with your Wikitech user/pass. Does that work? [21:26:58] TheresNoTime: there are a few backports scheduled, do you mind +2-ing them ahead of time, so that we don't have to wait for CI when it's our turn? [21:27:14] MatmaRex: sure [21:27:34] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:866505|Add elwiki and arwiki to desktop-improvements group (T322391)]] (duration: 08m 31s) [21:27:38] T322391: [S] Deploy Vector 2022 skin to next set of Wikipedias - https://phabricator.wikimedia.org/T322391 [21:27:46] (03CR) 10Samtar: [C: 03+2] "starting merge for deploy" [extensions/DiscussionTools] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866467 (https://phabricator.wikimedia.org/T324702) (owner: 10Bartosz DziewoƄski) [21:27:51] (03CR) 10Samtar: [C: 03+2] "starting merge for deploy" [extensions/MobileFrontend] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866468 (https://phabricator.wikimedia.org/T324686) (owner: 10Bartosz DziewoƄski) [21:28:04] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:28:13] TheresNoTime: ok!! and 30 minutes later I'm done. [21:28:22] jan_drewniak: :D thanks [21:28:26] cirno: ready? [21:28:42] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:29:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/WikimediaMaintenance] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865748 (https://phabricator.wikimedia.org/T324348) (owner: 10Stang) [21:29:42] !log samtar@deploy1002 Started scap: Backport for [[gerrit:865748|createExtensionTables: Add extension GeoData (T324348)]] [21:29:46] T324348: Add Extension:GeoData to Wikispecies wiki - https://phabricator.wikimedia.org/T324348 [21:31:24] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.275 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:31:28] !log samtar@deploy1002 samtar and stang: Backport for [[gerrit:865748|createExtensionTables: Add extension GeoData (T324348)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:31:38] syncing that [21:32:02] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49122 bytes in 0.144 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:32:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2002.codfw.wmnet with OS bullseye [21:32:59] 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye [21:34:02] (03Merged) 10jenkins-bot: Deemphasize "Learn more about this page" link [extensions/DiscussionTools] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866467 (https://phabricator.wikimedia.org/T324702) (owner: 10Bartosz DziewoƄski) [21:34:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2002.codfw.wmnet with reason: host reimage [21:37:34] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2002.codfw.wmnet with reason: host reimage [21:37:43] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:865748|createExtensionTables: Add extension GeoData (T324348)]] (duration: 08m 01s) [21:37:46] T324348: Add Extension:GeoData to Wikispecies wiki - https://phabricator.wikimedia.org/T324348 [21:38:02] (03PS4) 10Samtar: specieswiki: Install GeoData extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865763 (https://phabricator.wikimedia.org/T324348) (owner: 10Stang) [21:39:36] !log T324348 : `[samtar@mwmaint1002 ~]$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php specieswiki geodata` [21:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865763 (https://phabricator.wikimedia.org/T324348) (owner: 10Stang) [21:41:19] cirno: are you available to test your patches? [21:41:39] (03Merged) 10jenkins-bot: specieswiki: Install GeoData extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865763 (https://phabricator.wikimedia.org/T324348) (owner: 10Stang) [21:41:48] yeah [21:41:53] :) [21:42:49] (03Merged) 10jenkins-bot: Reinitialize edit links after page content is reloaded [extensions/MobileFrontend] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866468 (https://phabricator.wikimedia.org/T324686) (owner: 10Bartosz DziewoƄski) [21:43:05] TheresNoTime: I noticed the task, is such output an error [21:44:21] !log samtar@deploy1002 Started scap: Backport for [[gerrit:865763|specieswiki: Install GeoData extension (T324348)]] [21:44:25] T324348: Add Extension:GeoData to Wikispecies wiki - https://phabricator.wikimedia.org/T324348 [21:44:47] cirno: doesn't seem to be, no — just saying the table already existed [21:45:37] (03PS2) 10Samtar: frwikiversity: Set wgRestrictDisplayTitle to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866339 (https://phabricator.wikimedia.org/T324277) (owner: 10Stang) [21:46:08] !log samtar@deploy1002 samtar and stang: Backport for [[gerrit:865763|specieswiki: Install GeoData extension (T324348)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [21:46:18] cirno: please test ^ [21:46:48] looking [21:47:42] Extension "GeoData" appears in Special:Version, and parser function "coordinates" works, so LGTM [21:47:48] syncing [21:48:19] cirno: are you happy to test 866339 and 866432 together after this? [21:49:09] TheresNoTime: it it's possible, I'm ok to do so [21:49:24] be a bit quicker :) [21:53:38] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:865763|specieswiki: Install GeoData extension (T324348)]] (duration: 09m 16s) [21:53:42] T324348: Add Extension:GeoData to Wikispecies wiki - https://phabricator.wikimedia.org/T324348 [21:53:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866339 (https://phabricator.wikimedia.org/T324277) (owner: 10Stang) [21:53:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866432 (https://phabricator.wikimedia.org/T318766) (owner: 10Stang) [21:54:29] (03Merged) 10jenkins-bot: frwikiversity: Set wgRestrictDisplayTitle to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866339 (https://phabricator.wikimedia.org/T324277) (owner: 10Stang) [21:54:32] (03Merged) 10jenkins-bot: extwiki: Add new logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866432 (https://phabricator.wikimedia.org/T318766) (owner: 10Stang) [21:54:49] !log samtar@deploy1002 Started scap: Backport for [[gerrit:866339|frwikiversity: Set wgRestrictDisplayTitle to false (T324277)]], [[gerrit:866432|extwiki: Add new logo (T318766)]] [21:54:55] T318766: Requesting logo change for ext.wikipedia.org - https://phabricator.wikimedia.org/T318766 [21:54:55] T324277: Set $wgRestrictDisplayTitle to false on fr.wikiversity - https://phabricator.wikimedia.org/T324277 [21:55:24] 10SRE, 10Traffic-Icebox: backport ipvsadm>=1.30 to buster-wikimedia or buster-backports - https://phabricator.wikimedia.org/T263788 (10BCornwall) I spoke with @BBlack and he suggests that, since it's relatively simple to upgrade the remaining hosts to 1.31 we should do that, at the very least, for consistency... [21:55:45] PROBLEM - Check systemd state on mw1358 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:56:36] !log samtar@deploy1002 samtar and stang: Backport for [[gerrit:866339|frwikiversity: Set wgRestrictDisplayTitle to false (T324277)]], [[gerrit:866432|extwiki: Add new logo (T318766)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:56:51] cirno: okay, both of those are live on mwdebug, can you test each? [21:56:58] looking [21:57:51] (03PS9) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [21:59:51] wgRestrictDisplayTitle works a expected, the logo on extwiki changed (on vector skin) [21:59:55] (03PS1) 10Ottomata: [WIP] - flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [21:59:58] *as [22:00:04] syncing [22:00:41] (03CR) 10CI reject: [V: 04-1] [WIP] - flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [22:00:58] MatmaRex: are you okay to hang on? and if so, can you test both your `wmf.13` patches at the same time? [22:01:12] TheresNoTime: yes and yes, if you're okay with that [22:03:09] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2002.codfw.wmnet with OS bullseye [22:03:14] 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye executed with errors: - sretest2002 (**FAIL**) - Removed from Puppet and P... [22:05:49] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:05:53] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:866339|frwikiversity: Set wgRestrictDisplayTitle to false (T324277)]], [[gerrit:866432|extwiki: Add new logo (T318766)]] (duration: 11m 04s) [22:05:57] T318766: Requesting logo change for ext.wikipedia.org - https://phabricator.wikimedia.org/T318766 [22:05:58] T324277: Set $wgRestrictDisplayTitle to false on fr.wikiversity - https://phabricator.wikimedia.org/T324277 [22:06:05] cirno: should be live :) [22:06:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866467 (https://phabricator.wikimedia.org/T324702) (owner: 10Bartosz DziewoƄski) [22:06:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/MobileFrontend] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866468 (https://phabricator.wikimedia.org/T324686) (owner: 10Bartosz DziewoƄski) [22:06:25] !log samtar@deploy1002 Started scap: Backport for [[gerrit:866467|Deemphasize "Learn more about this page" link (T324702)]], [[gerrit:866468|Reinitialize edit links after page content is reloaded (T324686)]] [22:06:29] T324702: Deemphasize treatment of "Learn more about this page" Link - https://phabricator.wikimedia.org/T324702 [22:06:30] T324686: [Regression ?] The section edit icon on MobileFrontend stops working after posting a reply on production - https://phabricator.wikimedia.org/T324686 [22:06:35] TheresNoTime: please purge the logo files [22:06:40] cirno: ack [22:07:29] (we all hope T322370 got resolved by someone in the future :) [22:07:29] T322370: Automagically purge static images if present in a 'scap backport'-ed patch - https://phabricator.wikimedia.org/T322370 [22:08:06] cirno: done, can you check? [22:08:11] !log samtar@deploy1002 samtar and matmarex: Backport for [[gerrit:866467|Deemphasize "Learn more about this page" link (T324702)]], [[gerrit:866468|Reinitialize edit links after page content is reloaded (T324686)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [22:08:14] looking [22:08:30] MatmaRex: those two wmf.13 patches are live on mwdebug [22:08:37] yeah, look fine [22:08:46] thanks, looking [22:08:46] (ack) [22:09:11] (03PS2) 10Samtar: Start mobile DiscussionTools A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866502 (https://phabricator.wikimedia.org/T321961) (owner: 10Bartosz DziewoƄski) [22:10:31] TheresNoTime: looks good [22:10:38] syncing [22:16:31] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:866467|Deemphasize "Learn more about this page" link (T324702)]], [[gerrit:866468|Reinitialize edit links after page content is reloaded (T324686)]] (duration: 10m 06s) [22:16:37] T324702: Deemphasize treatment of "Learn more about this page" Link - https://phabricator.wikimedia.org/T324702 [22:16:37] T324686: [Regression ?] The section edit icon on MobileFrontend stops working after posting a reply on production - https://phabricator.wikimedia.org/T324686 [22:16:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866502 (https://phabricator.wikimedia.org/T321961) (owner: 10Bartosz DziewoƄski) [22:16:55] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:17:33] (03Merged) 10jenkins-bot: Start mobile DiscussionTools A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866502 (https://phabricator.wikimedia.org/T321961) (owner: 10Bartosz DziewoƄski) [22:17:48] !log samtar@deploy1002 Started scap: Backport for [[gerrit:866502|Start mobile DiscussionTools A/B test (T321961)]] [22:17:51] T321961: [Config Change] Start mobile DiscussionTools A/B test - https://phabricator.wikimedia.org/T321961 [22:19:33] !log samtar@deploy1002 samtar and matmarex: Backport for [[gerrit:866502|Start mobile DiscussionTools A/B test (T321961)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [22:19:37] MatmaRex: ^ is live on mwdebug, can you test? [22:19:46] yeah [22:21:04] (03PS2) 10Eevans: Rename codfw restbase-dev nodes to cassandra-dev [puppet] - 10https://gerrit.wikimedia.org/r/866438 (https://phabricator.wikimedia.org/T324113) [22:21:16] (03CR) 10Eevans: [C: 03+2] Rename codfw restbase-dev nodes to cassandra-dev [puppet] - 10https://gerrit.wikimedia.org/r/866438 (https://phabricator.wikimedia.org/T324113) (owner: 10Eevans) [22:21:52] TheresNoTime: looks good as well [22:21:58] syncing :) [22:22:00] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:25:27] 10SRE, 10Traffic-Icebox: Switch to Maglev hashing ('mh') on LVS hosts - https://phabricator.wikimedia.org/T263797 (10BCornwall) [22:25:52] 10SRE, 10Traffic-Icebox: backport ipvsadm>=1.30 to buster-wikimedia or buster-backports - https://phabricator.wikimedia.org/T263788 (10BCornwall) 05Open→03Resolved a:03BCornwall All the remaining hosts are now running the backported package. [22:27:46] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:866502|Start mobile DiscussionTools A/B test (T321961)]] (duration: 09m 57s) [22:27:50] T321961: [Config Change] Start mobile DiscussionTools A/B test - https://phabricator.wikimedia.org/T321961 [22:27:55] all live MatmaRex :) [22:28:02] thanks [22:28:19] sorry about the patch overload, and thanks for staying longer :) [22:28:27] no problem! :) [22:28:44] !log close UTC late backport and config training (+28m) [22:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:55] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776 [22:29:58] T322776: Deploy Ukrainian Analyzer Plugin - https://phabricator.wikimedia.org/T322776 [22:37:20] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host cassandra-dev2001.codfw.wmnet with OS buster [22:40:48] 10SRE, 10Traffic-Icebox: Switch to Maglev hashing ('mh') on LVS hosts - https://phabricator.wikimedia.org/T263797 (10BCornwall) a:03BCornwall [22:45:56] (03PS1) 10Ladsgroup: File pages: Add mobile targets to modules that are silently being removed [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866469 (https://phabricator.wikimedia.org/T324723) [22:46:19] cwhite: until when are you planning to be around? [22:47:01] (03PS2) 10Ladsgroup: File pages: Add mobile targets to modules that are silently being removed [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866469 (https://phabricator.wikimedia.org/T324723) [22:47:07] (03CR) 10Ladsgroup: [C: 03+2] File pages: Add mobile targets to modules that are silently being removed [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866469 (https://phabricator.wikimedia.org/T324723) (owner: 10Ladsgroup) [22:47:19] jouncebot: nowandnext [22:47:20] No deployments scheduled for the next 9 hour(s) and 12 minute(s) [22:47:20] In 9 hour(s) and 12 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221209T0800) [22:47:31] Amir1: until 00:00Z [22:47:57] cool, I'm backporting a fix that should reduce the flood [22:48:12] did it page? [22:48:22] any related ticket? [22:49:13] It did not page and has no related ticket. I happened to notice it while tending to the logging cluster. [22:49:58] awesome [22:55:49] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2001.codfw.wmnet with reason: host reimage [22:58:51] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2001.codfw.wmnet with reason: host reimage [23:00:54] (03Merged) 10jenkins-bot: File pages: Add mobile targets to modules that are silently being removed [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866469 (https://phabricator.wikimedia.org/T324723) (owner: 10Ladsgroup) [23:01:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866469 (https://phabricator.wikimedia.org/T324723) (owner: 10Ladsgroup) [23:02:10] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:866469|File pages: Add mobile targets to modules that are silently being removed (T324723 T320518)]] [23:02:15] T320518: 1.40.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T320518 [23:02:16] T324723: Fix the most common "Module not loadable on target mobile" warnings (December 2022) - https://phabricator.wikimedia.org/T324723 [23:03:58] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:866469|File pages: Add mobile targets to modules that are silently being removed (T324723 T320518)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [23:05:13] works fine, let's go [23:08:25] (03PS1) 10Ladsgroup: Make wikibase.client.init module target mobile [extensions/Wikibase] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866470 (https://phabricator.wikimedia.org/T235712) [23:08:43] (03CR) 10Ladsgroup: [C: 03+2] Make wikibase.client.init module target mobile [extensions/Wikibase] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866470 (https://phabricator.wikimedia.org/T235712) (owner: 10Ladsgroup) [23:10:56] (03PS1) 10Andrew Bogott: trove: update DNS hack for Yoga [puppet] - 10https://gerrit.wikimedia.org/r/866517 [23:11:23] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:866469|File pages: Add mobile targets to modules that are silently being removed (T324723 T320518)]] (duration: 09m 12s) [23:11:28] T320518: 1.40.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T320518 [23:11:28] T324723: Fix the most common "Module not loadable on target mobile" warnings (December 2022) - https://phabricator.wikimedia.org/T324723 [23:11:50] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10colewhite) I'm not a kafka expert, but this seems like a reasonable place to start. Pre-creating the topics is definitely the way... [23:11:52] (03PS1) 10Sbailey: enable migrate namespace function on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866518 (https://phabricator.wikimedia.org/T299612) [23:13:03] !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - eevans@cumin1001" [23:14:28] !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - eevans@cumin1001" [23:14:29] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2001.codfw.wmnet with OS buster [23:15:46] (03PS2) 10Andrew Bogott: trove: update DNS hack for Yoga [puppet] - 10https://gerrit.wikimedia.org/r/866517 [23:17:00] (03CR) 10Andrew Bogott: [C: 03+2] trove: update DNS hack for Yoga [puppet] - 10https://gerrit.wikimedia.org/r/866517 (owner: 10Andrew Bogott) [23:17:38] (03CR) 10Sbailey: "Override the dark launch config variable LinterMigrateNamespaceStage in InitializeSettings-labs.php to allow testing of the maintenance sc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866518 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [23:22:34] (03Merged) 10jenkins-bot: Make wikibase.client.init module target mobile [extensions/Wikibase] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866470 (https://phabricator.wikimedia.org/T235712) (owner: 10Ladsgroup) [23:23:18] cwhite: the first one made its impact, the second one is being deployed https://grafana.wikimedia.org/d/000000561/logstash?from=now-6h&orgId=1&to=now&viewPanel=45 [23:23:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866470 (https://phabricator.wikimedia.org/T235712) (owner: 10Ladsgroup) [23:23:32] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:866470|Make wikibase.client.init module target mobile (T235712)]] [23:23:37] T235712: Fix the most common "Module not loadable on target mobile" warnings (Oct 2019) - https://phabricator.wikimedia.org/T235712 [23:25:13] (03CR) 10Cwhite: librenms: Lower the TTL for LibreNMS (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [23:25:22] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:866470|Make wikibase.client.init module target mobile (T235712)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [23:31:58] trending the right direction :) [23:32:15] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:866470|Make wikibase.client.init module target mobile (T235712)]] (duration: 08m 42s) [23:32:19] T235712: Fix the most common "Module not loadable on target mobile" warnings (Oct 2019) - https://phabricator.wikimedia.org/T235712 [23:54:41] (03PS1) 10Dduvall: P:gitlab::runner: Add environment variable for kokkuri [puppet] - 10https://gerrit.wikimedia.org/r/866520