[00:41:04] 10SRE, 10Infrastructure-Foundations, 10Mail: Evaluate whether and how to route abuse@ emails to Legal - https://phabricator.wikimedia.org/T302549 (10RLazarus) p:05Triage→03Low [00:51:21] (03PS5) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) [00:52:05] (03PS6) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) [00:53:09] (03PS7) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) [00:55:05] (03CR) 10Razzi: "Ok I was inspired by @Elukey to actually make the cookbook automated, and with helpful input from @Majavah and @Volans I'm getting somewhe" [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [00:55:51] (03CR) 10jerkins-bot: [V: 04-1] Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [01:02:47] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:37:32] (JobUnavailable) firing: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:40:36] (JobUnavailable) resolved: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:44:59] (03PS1) 10Ebernhardson: query_service: pass cookies on to blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/765667 (https://phabricator.wikimedia.org/T293462) [01:47:18] (03PS2) 10Ebernhardson: query_service: pass cookies on to blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/765667 (https://phabricator.wikimedia.org/T293462) [01:47:20] (03CR) 10Ebernhardson: query_service: pass cookies on to blazegraph (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765667 (https://phabricator.wikimedia.org/T293462) (owner: 10Ebernhardson) [01:47:53] (03CR) 10jerkins-bot: [V: 04-1] query_service: pass cookies on to blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/765667 (https://phabricator.wikimedia.org/T293462) (owner: 10Ebernhardson) [01:48:57] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:53:34] (03PS3) 10Ebernhardson: query_service: pass cookies on to blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/765667 (https://phabricator.wikimedia.org/T293462) [01:54:19] (03CR) 10Ebernhardson: "Tested by manually applying change to codfw hosts and seeing my username come through the kafka topics, this might finally be the last ste" [puppet] - 10https://gerrit.wikimedia.org/r/765667 (https://phabricator.wikimedia.org/T293462) (owner: 10Ebernhardson) [01:54:39] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/765667 (https://phabricator.wikimedia.org/T293462) (owner: 10Ebernhardson) [02:04:33] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10MZMcBride) p:05Triage→03High [02:15:05] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:37:13] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:47:49] (Juniper alarm active) firing: Alert for device lsw3-codfw.mgmt.codfw.wmnet - Juniper alarm active - https://alerts.wikimedia.org [03:04:17] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10AndyRussG) Wooohooo thanks so much once again, everyone!!! :) :) [03:22:47] (Processor usage over 85%) firing: Alert for device scs-ulsfo.mgmt.ulsfo.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [03:32:47] (Processor usage over 85%) firing: (2) Alert for device scs-eqsin.mgmt.eqsin.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [03:49:39] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:52:47] (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [04:06:57] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:15:57] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:19:57] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:25:48] (03CR) 10Gergő Tisza: GLAM event: add wgGECampaigns and wgGECampaignTopics configs for beta cluster (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765349 (https://phabricator.wikimedia.org/T301029) (owner: 10MewOphaswongse) [04:40:01] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:43:55] 10SRE, 10observability: Investigate "Ops Monitor (WMF)" wiki account and associated global user group - https://phabricator.wikimedia.org/T302552 (10Legoktm) [05:49:05] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:04:49] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:15:23] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:47:49] (Juniper alarm active) firing: Alert for device lsw3-codfw.mgmt.codfw.wmnet - Juniper alarm active - https://alerts.wikimedia.org [06:52:15] (03CR) 10Giuseppe Lavagetto: conftool: add request-actions / request-patterns (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763486 (owner: 10Giuseppe Lavagetto) [06:52:34] (03CR) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763557 (owner: 10Giuseppe Lavagetto) [07:06:03] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:13:27] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:28:53] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Legoktm) >>! In T292322#7711900, @Joe wrote: > @tstarling @Legoktm do you think we can enable this on commons as well? The only negative effect will be to... [07:41:38] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) >>! In T292322#7737027, @Legoktm wrote: >>>! In T292322#7711900, @Joe wrote: >> @tstarling @Legoktm do you think we can enable this on commons as wel... [07:46:59] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:50:48] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: ditch automatic icmp probes for service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/765548 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [07:52:47] (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [07:54:09] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:58:18] (03CR) 10Filippo Giunchedi: "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/765567 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220225T0800) [08:01:22] 10SRE, 10Developer-Advocacy, 10Gerrit, 10serviceops: Remove port 29418 from cloning process - https://phabricator.wikimedia.org/T37611 (10hashar) 05Open→03Declined This was an idea that floated around in the early day of us adopting Gerrit. The point was to save the hassle of having to use `ssh -p 294... [08:14:47] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:18:35] (03PS3) 10Majavah: P:wmcs::prometheus: deploy alert rule from ops/alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/765567 (https://phabricator.wikimedia.org/T302493) [08:19:13] (03CR) 10Majavah: P:wmcs::prometheus: deploy alert rule from ops/alerts.git (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/765567 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [08:19:56] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33991/console" [puppet] - 10https://gerrit.wikimedia.org/r/765567 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [08:29:32] (03PS2) 10Muehlenhoff: Add drmrs to Hiera list of datacentres [puppet] - 10https://gerrit.wikimedia.org/r/737328 [08:32:24] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:39:27] (03PS1) 10Muehlenhoff: Readd packages needed for node but with conditional for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/765690 [08:39:37] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:40:06] (03CR) 10jerkins-bot: [V: 04-1] Readd packages needed for node but with conditional for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/765690 (owner: 10Muehlenhoff) [08:45:23] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:45:27] (03PS2) 10Muehlenhoff: Readd packages needed for node but with conditional for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/765690 [08:51:00] (03CR) 10Muehlenhoff: "FYI, I readded the two packages guarded for bullseye and later in https://gerrit.wikimedia.org/r/c/operations/puppet/+/765690/" [puppet] - 10https://gerrit.wikimedia.org/r/765648 (owner: 10Jbond) [08:51:07] (03CR) 10Muehlenhoff: [C: 03+2] Readd packages needed for node but with conditional for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/765690 (owner: 10Muehlenhoff) [09:05:02] (03CR) 10Ayounsi: [C: 04-1] "One typo then lgtm! We can deploy on Monday" [homer/public] - 10https://gerrit.wikimedia.org/r/765568 (https://phabricator.wikimedia.org/T302315) (owner: 10Cathal Mooney) [09:12:45] 10SRE, 10Discovery-Search (Current work): /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10MoritzMuehlenhoff) That sounds like a very promising lead, nice detective work! I think we can test following as a fix: /var/run/elasticsearch gets shipped via /usr/lib/tmpf... [09:12:49] (Juniper alarm active) resolved: Alert for device lsw3-codfw.mgmt.codfw.wmnet - Juniper alarm active - https://alerts.wikimedia.org [09:20:37] (03CR) 10MVernon: [C: 03+1] "LGTM; as clinic duty person shall I +2 and merge also?" [puppet] - 10https://gerrit.wikimedia.org/r/765589 (https://phabricator.wikimedia.org/T302250) (owner: 10Dzahn) [09:20:39] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:22:51] (03CR) 10Ayounsi: [C: 04-1] "I made the current mechanism to stop advertising publicly the anycast prefixes if the local anycast servers are offline for any reasons." [homer/public] - 10https://gerrit.wikimedia.org/r/765568 (https://phabricator.wikimedia.org/T302315) (owner: 10Cathal Mooney) [09:28:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet [09:28:09] (03PS1) 10Majavah: admin: update my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/766063 [09:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:34] (03CR) 10Muehlenhoff: [C: 03+2] admin: update my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/766063 (owner: 10Majavah) [09:33:11] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/765567 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [09:34:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet [09:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:01] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::haproxy: Use HAProxy 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/765299 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:36:51] (03CR) 10MVernon: [C: 03+2] admin: add ammarpad to ldap_only admins (nda) [puppet] - 10https://gerrit.wikimedia.org/r/765589 (https://phabricator.wikimedia.org/T302250) (owner: 10Dzahn) [09:37:02] (03PS2) 10MVernon: admin: add ammarpad to ldap_only admins (nda) [puppet] - 10https://gerrit.wikimedia.org/r/765589 (https://phabricator.wikimedia.org/T302250) (owner: 10Dzahn) [09:43:56] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Logstash Access for Ammarpad - https://phabricator.wikimedia.org/T302250 (10MatthewVernon) 05In progress→03Resolved a:03MatthewVernon Hi, I've done this now. Thanks, Matthew [09:44:35] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:12] (03PS4) 10Muehlenhoff: sre.ganeti.addnode: Validate bridge config of the switches [cookbooks] - 10https://gerrit.wikimedia.org/r/765309 [09:48:15] (03PS1) 10Muehlenhoff: Add ganeti2029 as new node in codfw [puppet] - 10https://gerrit.wikimedia.org/r/766065 (https://phabricator.wikimedia.org/T298998) [09:48:24] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10LSobanski) Sounds good to me, CC'ing @MatthewVernon for visibility. [09:58:29] (03PS3) 10Filippo Giunchedi: logstash: add blackbox-exporter filter config [puppet] - 10https://gerrit.wikimedia.org/r/765476 (https://phabricator.wikimedia.org/T291946) [09:59:07] (03PS1) 10Vgutierrez: cache::haproxy: Provide a haproxy-restart script [puppet] - 10https://gerrit.wikimedia.org/r/766069 (https://phabricator.wikimedia.org/T290005) [10:00:01] (03CR) 10Muehlenhoff: sre.ganeti.addnode: Validate bridge config of the switches (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/765309 (owner: 10Muehlenhoff) [10:00:08] (03PS5) 10Muehlenhoff: sre.ganeti.addnode: Validate bridge config of the switches [cookbooks] - 10https://gerrit.wikimedia.org/r/765309 [10:00:21] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33992/console" [puppet] - 10https://gerrit.wikimedia.org/r/766069 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:03:15] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:05:11] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 66 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:10:27] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:11:33] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 65 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:12:05] (03CR) 10Muehlenhoff: [C: 03+2] Add ganeti2029 as new node in codfw [puppet] - 10https://gerrit.wikimedia.org/r/766065 (https://phabricator.wikimedia.org/T298998) (owner: 10Muehlenhoff) [10:13:39] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Suboptimal anycast routing from leaf switches - https://phabricator.wikimedia.org/T302315 (10ayounsi) Current status is that this is virtually solved (removing the last software blocker for drmrs), the CR above will be needed to allow adver... [10:17:04] !log rolling upgrade to HAProxy 2.4.13 on HAProxy cache nodes - T290005 [10:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:12] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:20:13] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2029.codfw.wmnet to ganeti01.svc.codfw.wmnet [10:22:13] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2029.codfw.wmnet to ganeti01.svc.codfw.wmnet [10:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:57] (03CR) 10Ayounsi: "Wow, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [10:26:19] (03CR) 10MMandere: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/766069 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:27:43] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2029.codfw.wmnet with reason: Enable virtualisation in BIOS [10:27:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2029.codfw.wmnet with reason: Enable virtualisation in BIOS [10:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet [10:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:20] (03PS1) 10David Caro: openstack:galera:node: make sure prometheus-mysqld-exporter is running [puppet] - 10https://gerrit.wikimedia.org/r/766071 (https://phabricator.wikimedia.org/T302557) [10:38:30] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33994/console" [puppet] - 10https://gerrit.wikimedia.org/r/766071 (https://phabricator.wikimedia.org/T302557) (owner: 10David Caro) [10:41:00] !log enabled virtualisation in BIOS for ganeti2029 T298998 [10:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:06] T298998: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998 [10:42:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet [10:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2029.codfw.wmnet to ganeti01.svc.codfw.wmnet [10:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:58] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::haproxy: Provide a haproxy-restart script [puppet] - 10https://gerrit.wikimedia.org/r/766069 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:44:07] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:59] (03PS1) 10Vgutierrez: site: Reimage cp4025 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/766073 (https://phabricator.wikimedia.org/T290005) [10:50:51] (03CR) 10Tchanders: [C: 03+1] Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [10:53:17] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp4025 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/766073 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:54:19] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp4025.ulsfo.wmnet with OS buster [10:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:31] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp4025.ulsfo.wmnet with OS buster [11:00:36] (JobUnavailable) firing: (2) Reduced availability for job cache_envoy in ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [11:04:04] (03CR) 10Ayounsi: "Thanks for looking at it. FYI the Netbox error is caught by the network report." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/765581 (owner: 10Volans) [11:04:08] !log added ganeti2029 to codfw Ganeti cluster T298998 [11:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:14] T298998: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998 [11:07:49] 10SRE, 10Wiki Loves Monuments 2022, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Request for creation: WLM-Network Mailing List - https://phabricator.wikimedia.org/T302510 (10Ciell) For the purpose of the pilot, let's make it public and with archive please. [11:10:57] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4025.ulsfo.wmnet with reason: host reimage [11:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:41] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:13:42] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4025.ulsfo.wmnet with reason: host reimage [11:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:38] !log re-activate BGP session to Seabone in esams [11:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:33] (03PS1) 10Hokwelum: Add IP address to bringyour mirror and this was a request from Brien the mirror contact person [puppet] - 10https://gerrit.wikimedia.org/r/766076 [11:28:35] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/766076 (owner: 10Hokwelum) [11:29:03] (03CR) 10jerkins-bot: [V: 04-1] Add IP address to bringyour mirror and this was a request from Brien the mirror contact person [puppet] - 10https://gerrit.wikimedia.org/r/766076 (owner: 10Hokwelum) [11:29:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet [11:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet [11:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:04] (03PS2) 10Hokwelum: Add IP address to bringyour mirror [puppet] - 10https://gerrit.wikimedia.org/r/766076 [11:40:19] !log pool cp4025 running HAProxy as TLS termination layer - T290005 T271421 [11:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:27] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:40:27] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [11:40:46] (03CR) 10ArielGlenn: [C: 03+2] Add IP address to bringyour mirror [puppet] - 10https://gerrit.wikimedia.org/r/766076 (owner: 10Hokwelum) [11:41:21] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4025.ulsfo.wmnet with OS buster [11:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:38] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4025.ulsfo.wmnet with OS buster c... [11:42:43] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez) 05In progress→03Resolved envoy instances are currently being reimaged as HAProxy ones. We're cleaning up and pausing the envoyproxy experiment [11:42:49] (03PS4) 10Cathal Mooney: Change CR policy for creating aggregate Anycast routes [homer/public] - 10https://gerrit.wikimedia.org/r/765568 (https://phabricator.wikimedia.org/T302315) [11:42:57] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [11:45:36] (JobUnavailable) resolved: (2) Reduced availability for job cache_envoy in ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [11:46:14] (03PS1) 10Vgutierrez: site: Reimage cp2040 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/766078 (https://phabricator.wikimedia.org/T290005) [11:47:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2029.codfw.wmnet to ganeti01.svc.codfw.wmnet [11:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:43] (03CR) 10Cathal Mooney: "Thanks Arzhel. Fixed up the semi-colon, and put down some other comments. Unsure if you think we should merge this or not? I'm open to " [homer/public] - 10https://gerrit.wikimedia.org/r/765568 (https://phabricator.wikimedia.org/T302315) (owner: 10Cathal Mooney) [11:52:47] (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [11:52:53] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2040 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/766078 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [11:53:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2030.codfw.wmnet to ganeti01.svc.codfw.wmnet [11:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:54] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp2040.codfw.wmnet with OS buster [11:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:06] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp2040.codfw.wmnet with OS buster [11:54:28] (03CR) 10Cathal Mooney: wmf-netbox: fix UnboundLocalError (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/765581 (owner: 10Volans) [11:54:45] (03PS1) 10Muehlenhoff: Add ganeti2030 to list of codfw Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/766079 [11:55:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2030.codfw.wmnet to ganeti01.svc.codfw.wmnet [11:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:01] (03CR) 10Muehlenhoff: [C: 03+2] Add ganeti2030 to list of codfw Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/766079 (owner: 10Muehlenhoff) [12:00:36] (JobUnavailable) firing: (3) Reduced availability for job cache_envoy in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [12:11:59] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2040.codfw.wmnet with reason: host reimage [12:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:30] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10User-jbond: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10Volans) Although it does not do what we need, some logic to download the lists from multiple clouds can be gath... [12:12:32] (JobUnavailable) resolved: Reduced availability for job cache_envoy in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [12:12:58] (03PS1) 10Hnowlan: restbase-dev: change role of new hosts [puppet] - 10https://gerrit.wikimedia.org/r/766082 (https://phabricator.wikimedia.org/T295375) [12:13:48] (03CR) 10Cathal Mooney: [C: 03+1] sre.ganeti.addnode: Validate bridge config of the switches [cookbooks] - 10https://gerrit.wikimedia.org/r/765309 (owner: 10Muehlenhoff) [12:14:43] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2040.codfw.wmnet with reason: host reimage [12:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:32] (JobUnavailable) firing: Reduced availability for job cache_envoy in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [12:25:36] (JobUnavailable) resolved: Reduced availability for job cache_envoy in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [12:31:04] (03CR) 10Ayounsi: Change CR policy for creating aggregate Anycast routes (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/765568 (https://phabricator.wikimedia.org/T302315) (owner: 10Cathal Mooney) [12:32:45] !log pool cp2040 running HAProxy as TLS termination layer - T290005 T271421 [12:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:53] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:32:53] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [12:34:04] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [12:37:42] (03PS1) 10MMandere: varnish: remove obsolete repo path reference [puppet] - 10https://gerrit.wikimedia.org/r/766088 (https://phabricator.wikimedia.org/T302579) [12:38:31] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2040.codfw.wmnet with OS buster [12:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:44] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp2040.codfw.wmnet with OS buster c... [12:38:48] (03PS5) 10Cathal Mooney: Change CR policy for creating aggregate Anycast routes [homer/public] - 10https://gerrit.wikimedia.org/r/765568 (https://phabricator.wikimedia.org/T302315) [12:39:08] (03CR) 10Cathal Mooney: "Thanks for feedback, policy term name updated." [homer/public] - 10https://gerrit.wikimedia.org/r/765568 (https://phabricator.wikimedia.org/T302315) (owner: 10Cathal Mooney) [12:39:14] !log drain instances off ganeti2007 T302577 [12:39:16] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6] (dev): updating wmf-proxy-dashboard [12:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:20] T302577: decommission ganeti2007 - https://phabricator.wikimedia.org/T302577 [12:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:54] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6] (dev): updating wmf-proxy-dashboard (duration: 00m 37s) [12:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:28] (03CR) 10MMandere: "Sample showing container downloading varnish6 and dependencies from main component:" [puppet] - 10https://gerrit.wikimedia.org/r/766088 (https://phabricator.wikimedia.org/T302579) (owner: 10MMandere) [12:43:46] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:44:00] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6]: updating wmf-proxy-dashboard on eqiad1 [12:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:11] (03CR) 10Vivian Rook: [C: 03+1] openstack:galera:node: make sure prometheus-mysqld-exporter is running [puppet] - 10https://gerrit.wikimedia.org/r/766071 (https://phabricator.wikimedia.org/T302557) (owner: 10David Caro) [12:45:52] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:46:04] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: updating wmf-proxy-dashboard on eqiad1 (duration: 02m 04s) [12:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:16] (03CR) 10Ayounsi: [C: 03+1] Change CR policy for creating aggregate Anycast routes [homer/public] - 10https://gerrit.wikimedia.org/r/765568 (https://phabricator.wikimedia.org/T302315) (owner: 10Cathal Mooney) [13:13:04] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:29:44] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6] (dev): debugging deployment process [13:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:50] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6] (dev): debugging deployment process (duration: 00m 05s) [13:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:24] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6] (dev): debugging deployment process [13:30:28] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6] (dev): debugging deployment process [13:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:34] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6] (dev): debugging deployment process (duration: 00m 06s) [13:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:24] (03PS2) 10Krinkle: Expand log level of DBConnection messages from 'error' to 'warning' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764359 (https://phabricator.wikimedia.org/T281451) [13:33:29] (03PS3) 10Krinkle: Expand log level of DBConnection messages from 'error' to 'warning' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764359 (https://phabricator.wikimedia.org/T281451) [13:35:37] (03CR) 10David Caro: [V: 03+1 C: 03+2] openstack:galera:node: make sure prometheus-mysqld-exporter is running [puppet] - 10https://gerrit.wikimedia.org/r/766071 (https://phabricator.wikimedia.org/T302557) (owner: 10David Caro) [13:46:41] hello please hold on any netbox changes for a few minutes, we're restoring a backup after I clicked the wrong button [13:48:57] !log restoring psql-all-dbs-20220225.sql.gz into netbox [13:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:50] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:52:00] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6]: deploying wmf-proxy-dashboard and wmf-puppet-dashboard changes for real after fixing the scap config [13:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:52] netbox backup has been restored, all looks good, it shoud be good to resume normal operations [13:54:50] (03CR) 10Vgutierrez: [C: 03+1] varnish: remove obsolete repo path reference [puppet] - 10https://gerrit.wikimedia.org/r/766088 (https://phabricator.wikimedia.org/T302579) (owner: 10MMandere) [13:56:50] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: deploying wmf-proxy-dashboard and wmf-puppet-dashboard changes for real after fixing the scap config (duration: 04m 50s) [13:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:33] (03PS1) 10Vgutierrez: site: Reimage cp5005 as cache::haproxy_upload [puppet] - 10https://gerrit.wikimedia.org/r/766102 (https://phabricator.wikimedia.org/T290005) [13:59:54] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:02:17] (03CR) 10MMandere: [C: 03+2] varnish: remove obsolete repo path reference [puppet] - 10https://gerrit.wikimedia.org/r/766088 (https://phabricator.wikimedia.org/T302579) (owner: 10MMandere) [14:03:43] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp5005 as cache::haproxy_upload [puppet] - 10https://gerrit.wikimedia.org/r/766102 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:04:32] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp5005.eqsin.wmnet with OS buster [14:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:44] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp5005.eqsin.wmnet with OS buster [14:05:25] (03PS1) 10Muehlenhoff: Add repository component component/ganeti3 for Buster [puppet] - 10https://gerrit.wikimedia.org/r/766106 [14:05:55] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6]: fix wmf-puppet-dashboard routes [14:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:54] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:43] (03CR) 10Ayounsi: [C: 03+1] wmf-netbox: fix UnboundLocalError (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/765581 (owner: 10Volans) [14:13:42] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: fix wmf-puppet-dashboard routes (duration: 07m 47s) [14:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:02] (03PS1) 10Majavah: dynamicproxy: fix delete_records_for() method call [puppet] - 10https://gerrit.wikimedia.org/r/766107 [14:19:14] (03CR) 10Vivian Rook: [C: 03+2] dynamicproxy: fix delete_records_for() method call [puppet] - 10https://gerrit.wikimedia.org/r/766107 (owner: 10Majavah) [14:21:46] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi With the icmp probes gone I don'... [14:25:13] (03CR) 10Physikerwelt: [C: 03+1] mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/751439 (owner: 10PipelineBot) [14:28:52] (03PS1) 10Majavah: dynamicproxy: fix condition [puppet] - 10https://gerrit.wikimedia.org/r/766109 [14:30:27] (03CR) 10Muehlenhoff: [C: 03+2] Add repository component component/ganeti3 for Buster [puppet] - 10https://gerrit.wikimedia.org/r/766106 (owner: 10Muehlenhoff) [14:32:01] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5005.eqsin.wmnet with reason: host reimage [14:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:50] (03CR) 10Vivian Rook: [C: 03+2] dynamicproxy: fix condition [puppet] - 10https://gerrit.wikimedia.org/r/766109 (owner: 10Majavah) [14:35:28] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5005.eqsin.wmnet with reason: host reimage [14:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:11] 10Puppet, 10Horizon, 10Infrastructure-Foundations, 10Patch-For-Review: Invalid yaml in horizon hiera editor results in confusing error message - https://phabricator.wikimedia.org/T241999 (10Majavah) a:03Majavah The PS above updates the error to look like this: {F34966331} [15:19:24] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5005.eqsin.wmnet with OS buster [15:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:37] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp5005.eqsin.wmnet with OS buster c... [15:23:19] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 66 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:25:19] !log pool cp5005 running HAProxy as TLS termination layer - T290005 T271421 [15:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:26] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [15:25:26] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [15:34:31] (03PS1) 10Vgutierrez: site: Reimage cp3063 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/766119 (https://phabricator.wikimedia.org/T290005) [15:36:44] !log imported PHP 7.4 7.4.28-1+0~20220217.59+debian10~1.gbp1950+wmf1+buster1 to component/php74 for buster-wikimedia T271736 [15:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:51] T271736: Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 [15:37:17] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp3063 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/766119 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:38:25] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp3063.esams.wmnet with OS buster [15:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:37] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3063.esams.wmnet with OS buster [15:39:21] 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission cp4031 - https://phabricator.wikimedia.org/T301269 (10Vgutierrez) [15:39:43] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 64 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:40:03] PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:41:33] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:43:21] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:43:56] (03CR) 10Eevans: [C: 03+1] restbase-dev: change role of new hosts [puppet] - 10https://gerrit.wikimedia.org/r/766082 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [15:44:30] "upstream connect error or disconnect/reset before headers. reset reason: overflow" [15:44:52] You broke it Tamzin :P [15:44:57] I keep doing that [15:45:09] Hehehe [15:45:25] known issues? [15:45:35] Request from - via cp1081.eqiad.wmnet, ATS/8.0.8 [15:45:35] Error: 502, Next Hop Connection Failed at 2022-02-25 15:45:07 GMT [15:45:36] (JobUnavailable) firing: (2) Reduced availability for job cache_envoy in esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [15:45:37] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1081.eqiad.wmnet, cp1079.eqiad.wmnet, cp1085.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1083.eqiad.wmnet, cp1085.eqiad.wmnet, cp1079.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1085.eqiad.wmnet, cp1075.eqiad.wmnet, cp1079.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/ [15:45:37] al [15:45:38] Experiencing in here as well [15:45:40] UK [15:45:49] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [15:45:51] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [15:45:52] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is CRITICAL: 0.002598 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [15:45:56] API on meta is running fine-ish though [15:46:09] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:46:13] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:46:21] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:46:28] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_text layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [15:46:32] yo [15:46:43] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:46:43] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:46:44] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:46:48] Front end ddos? [15:46:53] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:05] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:05] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:05] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp5010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:09] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:09] looking at network [15:47:09] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:09] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:09] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 43.62 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:47:11] here [15:47:13] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:13] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:13] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:13] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:13] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:14] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:15] <_joe_> here we go again heh [15:47:17] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:17] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp5010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:19] gah [15:47:20] hey [15:47:23] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:23] <_joe_> XioNoX: not network [15:47:25] hi [15:47:33] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:33] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp5010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:35] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:35] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:35] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:35] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:35] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp5010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:42] (03CR) 10Ladsgroup: [C: 03+1] Expand log level of DBConnection messages from 'error' to 'warning' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764359 (https://phabricator.wikimedia.org/T281451) (owner: 10Krinkle) [15:47:43] RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:47:53] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:47:55] (ProbeHttpFailed) firing: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [15:47:55] (ProbeHttpFailed) firing: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [15:48:01] yeah network looks fine [15:48:03] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 50.21 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:48:13] here [15:48:26] same [15:48:28] here as well, cache busting again? [15:48:44] Acked the alerts [15:49:59] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [15:50:00] <_joe_> please not here [15:50:06] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.6898 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [15:50:09] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:50:11] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:50:13] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 79.46 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:50:14] <_joe_> and yes it's over [15:50:36] (JobUnavailable) resolved: (2) Reduced availability for job cache_envoy in esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [15:50:45] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:50:45] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:50:45] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:50:50] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [15:50:57] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:51:09] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:51:09] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.449 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:51:09] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp5010 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:51:13] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:51:13] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:51:13] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:51:17] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:51:17] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:51:17] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:51:19] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.453 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:51:19] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:51:19] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:51:33] :) [15:52:47] (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [15:52:51] (03PS1) 10ZPapierski: Replace Swift native API with S3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/766123 (https://phabricator.wikimedia.org/T302494) [15:52:55] (ProbeHttpFailed) resolved: (2) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [15:53:00] (ProbeHttpFailed) resolved: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [15:53:13] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:53:55] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:53:55] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:54:59] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:56:05] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:56:05] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.453 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:58:19] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:58:49] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:58:51] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:58:51] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:58:51] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:58:51] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp5010 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:58:51] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp5010 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:58:52] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp5010 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.452 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:59:07] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:00:17] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:02:35] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [16:03:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10Jclark-ctr) dumpsdata1006 E1 U19 port19 cableid#20220257 dumpsdata1006 F1 U19 port19 cableid#20220258 [16:03:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10Jclark-ctr) [16:05:57] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:06:33] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3063.esams.wmnet with reason: host reimage [16:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:04] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3063.esams.wmnet with reason: host reimage [16:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:33] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:12:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10Jclark-ctr) | name | rack| port | cableid elastic1089 E1 21 20220145 elastic1090 E1 22 20220146 elastic1091 E2 21 20220148 elastic1092 E2 22... [16:17:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Jclark-ctr) | name |rack_name |port |cableid ml-cache1001 E1 23 20220147 ml-cache1002 E2 23 20220137 ml-cache1003 F1 23 20220125 | [16:17:37] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:18:06] Can someone with root access on deploy1002 send me the contents of /var/lib/deploy-mwdebug/error ? [16:18:32] (or just make a copy of it that I can read from that machine) [16:18:53] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 66 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:23:57] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 64 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:27:32] 10SRE, 10serviceops: Renew puppet cert for etcd.codfw.wmnet - https://phabricator.wikimedia.org/T302153 (10Joe) 05Open→03Resolved I just removed the cert from puppet. [16:29:37] RECOVERY - Puppet CA expired certs on puppetmaster1001 is OK: OK: all puppet agent certs fine https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [16:35:06] !log pool cp3063 running HAProxy as TLS termination layer - T290005 T271421 [16:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:14] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [16:35:14] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [16:35:24] 10SRE, 10Wiki Loves Monuments 2022, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Request for creation: WLM-Network Mailing List - https://phabricator.wikimedia.org/T302510 (10Ladsgroup) 05Open→03Resolved https://lists.wikimedia.org/postorius/lists/wlm-network.lists.wikimedia.org [16:36:23] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3063.esams.wmnet with OS buster [16:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:35] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3063.esams.wmnet with OS buster c... [16:37:30] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [16:40:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [16:40:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [16:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T300992)', diff saved to https://phabricator.wikimedia.org/P21511 and previous config saved to /var/cache/conftool/dbconfig/20220225-164020-ladsgroup.json [16:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:35] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [16:40:57] RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:43:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T300992)', diff saved to https://phabricator.wikimedia.org/P21512 and previous config saved to /var/cache/conftool/dbconfig/20220225-164323-ladsgroup.json [16:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:03] (03PS2) 10Hnowlan: restbase-dev: change role of new hosts [puppet] - 10https://gerrit.wikimedia.org/r/766082 (https://phabricator.wikimedia.org/T295375) [16:45:10] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10observability, and 2 others: Upgrade Kafka to 2.x - https://phabricator.wikimedia.org/T300102 (10EChetty) [16:47:07] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10observability, and 2 others: Upgrade Kafka Risk Evaluation - https://phabricator.wikimedia.org/T302610 (10EChetty) [16:48:25] 10SRE, 10Data-Engineering, 10observability, 10serviceops, 10Epic: Upgrade Kafka to 2.x - https://phabricator.wikimedia.org/T300102 (10EChetty) [16:53:32] (03PS2) 10BBlack: eqiad lvs: add interfaces and IPs for rows E and F [puppet] - 10https://gerrit.wikimedia.org/r/765311 (https://phabricator.wikimedia.org/T301419) [16:53:34] (03PS1) 10BBlack: Reject invalid hex encoding in URIs [puppet] - 10https://gerrit.wikimedia.org/r/766162 [16:54:09] (03PS2) 10BBlack: Reject invalid hex encoding in URIs [puppet] - 10https://gerrit.wikimedia.org/r/766162 [16:56:05] (03CR) 10Zabe: "This seems to have already been done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/748757" [puppet] - 10https://gerrit.wikimedia.org/r/737328 (owner: 10Muehlenhoff) [16:56:10] (03PS3) 10BBlack: Reject invalid hex encoding in URIs [puppet] - 10https://gerrit.wikimedia.org/r/766162 [16:58:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P21513 and previous config saved to /var/cache/conftool/dbconfig/20220225-165828-ladsgroup.json [16:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:58] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6] (dev): wmf-puppet-dashboard updates: better error messages and code cleanup [17:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:55] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6] (dev): wmf-puppet-dashboard updates: better error messages and code cleanup (duration: 01m 57s) [17:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:43] 10SRE, 10LDAP-Access-Requests: Logstash Access for Ammarpad - https://phabricator.wikimedia.org/T302250 (10Dzahn) @Ammarpad Here is a list of things that come with the NDA group you have now: https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups#NDA_group [17:12:35] !log manual trigger of cirrus SaneitizeJobs for with 2hr refresh [17:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P21514 and previous config saved to /var/cache/conftool/dbconfig/20220225-171333-ladsgroup.json [17:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:14] (03PS9) 10Giuseppe Lavagetto: conftool: add request-actions / request-patterns [puppet] - 10https://gerrit.wikimedia.org/r/763486 [17:14:16] (03PS7) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 [17:21:15] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6]: wmf-puppet-dashboard updates: better error messages and code cleanup (prod) [17:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:00] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 67 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:28:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T300992)', diff saved to https://phabricator.wikimedia.org/P21515 and previous config saved to /var/cache/conftool/dbconfig/20220225-172837-ladsgroup.json [17:28:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [17:28:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [17:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:45] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [17:28:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T300992)', diff saved to https://phabricator.wikimedia.org/P21516 and previous config saved to /var/cache/conftool/dbconfig/20220225-172845-ladsgroup.json [17:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:36] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: wmf-puppet-dashboard updates: better error messages and code cleanup (prod) (duration: 08m 20s) [17:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T300992)', diff saved to https://phabricator.wikimedia.org/P21517 and previous config saved to /var/cache/conftool/dbconfig/20220225-173356-ladsgroup.json [17:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:03] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [17:37:36] 10Puppet, 10Horizon, 10Infrastructure-Foundations: Invalid yaml in horizon hiera editor results in confusing error message - https://phabricator.wikimedia.org/T241999 (10Majavah) 05Open→03Resolved [17:40:46] (03PS3) 10MewOphaswongse: GLAM event: add wgGECampaigns and wgGECampaignTopics configs for beta eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765349 (https://phabricator.wikimedia.org/T301029) [17:41:04] (03CR) 10MewOphaswongse: GLAM event: add wgGECampaigns and wgGECampaignTopics configs for beta eswiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765349 (https://phabricator.wikimedia.org/T301029) (owner: 10MewOphaswongse) [17:41:34] (03CR) 10jerkins-bot: [V: 04-1] GLAM event: add wgGECampaigns and wgGECampaignTopics configs for beta eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765349 (https://phabricator.wikimedia.org/T301029) (owner: 10MewOphaswongse) [17:42:12] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 65 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:42:18] (03PS4) 10MewOphaswongse: GLAM event: add wgGECampaigns and wgGECampaignTopics configs for beta eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765349 (https://phabricator.wikimedia.org/T301029) [17:46:34] 10SRE: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10SCherukuwada) [17:46:54] 10SRE: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10SCherukuwada) In case someone's wondering, DuckDuckGo doesn't actually have a webmaster console. Strange. [17:49:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P21518 and previous config saved to /var/cache/conftool/dbconfig/20220225-174901-ladsgroup.json [17:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:12] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:02:24] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:04:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P21519 and previous config saved to /var/cache/conftool/dbconfig/20220225-180406-ladsgroup.json [18:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:08] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 66 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:11:42] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:11:58] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:12:28] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 64 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:19:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T300992)', diff saved to https://phabricator.wikimedia.org/P21520 and previous config saved to /var/cache/conftool/dbconfig/20220225-181911-ladsgroup.json [18:19:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [18:19:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [18:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:18] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [18:19:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21521 and previous config saved to /var/cache/conftool/dbconfig/20220225-181918-ladsgroup.json [18:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21522 and previous config saved to /var/cache/conftool/dbconfig/20220225-182223-ladsgroup.json [18:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:09] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10SCherukuwada) Just filed https://phabricator.wikimedia.org/T302617 to start discussing domain ownership verification on various consoles with SRE. I'll be unavailable for a few days... [18:32:37] 10SRE-Access-Requests: Request Administrator Access to Google Search Console - https://phabricator.wikimedia.org/T302625 (10SCherukuwada) [18:37:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P21523 and previous config saved to /var/cache/conftool/dbconfig/20220225-183728-ladsgroup.json [18:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:00] (03CR) 10Gergő Tisza: [C: 03+1] GLAM event: add wgGECampaigns and wgGECampaignTopics configs for beta eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765349 (https://phabricator.wikimedia.org/T301029) (owner: 10MewOphaswongse) [18:46:25] (03CR) 10Gergő Tisza: [C: 03+2] GLAM event: add wgGECampaigns and wgGECampaignTopics configs for beta eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765349 (https://phabricator.wikimedia.org/T301029) (owner: 10MewOphaswongse) [18:47:04] (03Merged) 10jenkins-bot: GLAM event: add wgGECampaigns and wgGECampaignTopics configs for beta eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765349 (https://phabricator.wikimedia.org/T301029) (owner: 10MewOphaswongse) [18:52:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P21524 and previous config saved to /var/cache/conftool/dbconfig/20220225-185233-ladsgroup.json [18:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:08] 10SRE: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10jcrespo) The answer to most of this was, for the most part, established at T298723- verification will be done through DNS in all cases- as it was done from Google (for centralization, uniformization a... [19:03:23] 10SRE: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10jcrespo) Checking the DNS records, there seems to be entries only of the 7 or so top level domains (e.g. wikipedia.org), and maybe that was enough for all subdomains? Do you know if that would work fo... [19:05:26] 10SRE: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10SCherukuwada) You're absolutely right to be concerned about traffic from search engines. That said, I'm familiar enough with how this works to be comfortable owning it, and my PM counterpart and I (an... [19:07:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21525 and previous config saved to /var/cache/conftool/dbconfig/20220225-190737-ladsgroup.json [19:07:40] 10SRE: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10jcrespo) Thanks to you for working on this! [19:07:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [19:07:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [19:07:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [19:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:47] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [19:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [19:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [19:09:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [19:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21526 and previous config saved to /var/cache/conftool/dbconfig/20220225-190939-ladsgroup.json [19:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21527 and previous config saved to /var/cache/conftool/dbconfig/20220225-191144-ladsgroup.json [19:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P21528 and previous config saved to /var/cache/conftool/dbconfig/20220225-192649-ladsgroup.json [19:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:23] (03CR) 10CDanis: [C: 03+1] "This looks reasonable for v0, just some nits." [puppet] - 10https://gerrit.wikimedia.org/r/763486 (owner: 10Giuseppe Lavagetto) [19:28:39] (03CR) 10CDanis: varnish/frontend: consume etcd data for dynamic banning of requests. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763557 (owner: 10Giuseppe Lavagetto) [19:38:01] (03PS1) 10BBlack: Cache Badtitle 400s for 60s in varnish-fe [puppet] - 10https://gerrit.wikimedia.org/r/766187 [19:40:12] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3065 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [19:41:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P21529 and previous config saved to /var/cache/conftool/dbconfig/20220225-194153-ladsgroup.json [19:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:08] (03PS2) 10BBlack: Cache Badtitle 400s for 60s in varnish-fe [puppet] - 10https://gerrit.wikimedia.org/r/766187 [19:47:22] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.08065 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [19:52:47] (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [19:56:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21530 and previous config saved to /var/cache/conftool/dbconfig/20220225-195658-ladsgroup.json [19:57:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [19:57:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [19:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:05] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [19:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:58:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [19:59:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [19:59:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T300992)', diff saved to https://phabricator.wikimedia.org/P21531 and previous config saved to /var/cache/conftool/dbconfig/20220225-195917-ladsgroup.json [19:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T300992)', diff saved to https://phabricator.wikimedia.org/P21532 and previous config saved to /var/cache/conftool/dbconfig/20220225-200322-ladsgroup.json [20:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:28] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [20:10:54] (03CR) 10BBlack: [C: 03+2] Cache Badtitle 400s for 60s in varnish-fe [puppet] - 10https://gerrit.wikimedia.org/r/766187 (owner: 10BBlack) [20:14:35] (03PS8) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) [20:17:20] (03CR) 10jerkins-bot: [V: 04-1] Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [20:18:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P21533 and previous config saved to /var/cache/conftool/dbconfig/20220225-201826-ladsgroup.json [20:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:15] Need to roll a quick wdqs deploy to get the latest version of WDQS out. Apologies for the friday noise: [20:22:26] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.104`. Pre-deploy tests passing on canary `wdqs1003` [20:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:53] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@5d384a5]: 0.3.104 [20:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:48] !log [WDQS Deploy] Tests passing following deploy of `0.3.104` on canary `wdqs1003`; proceeding to rest of fleet [20:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:11] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@5d384a5]: 0.3.104 (duration: 07m 18s) [20:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:39] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [20:31:44] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [20:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:50] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [20:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P21534 and previous config saved to /var/cache/conftool/dbconfig/20220225-203331-ladsgroup.json [20:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:32] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [20:38:36] ^ Looking. Suspect it's just a transient alert, but checking to make sure [20:40:22] (03CR) 10Dzahn: "fwiw: Debugged this in cloud VPS and the reason is a missing Apache module providing SetEnvIfNoCase" [deployment-charts] - 10https://gerrit.wikimedia.org/r/765382 (owner: 10JMeybohm) [20:42:17] (03CR) 10Dzahn: "and the reason we use that is this:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/765382 (owner: 10JMeybohm) [20:45:24] WDQS updates are processing fine in eqiad, it might be that prometheus is having trouble talking to the blazegraph exporter. Digging a bit more [20:47:58] (03PS9) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) [20:48:19] 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10sbassett) [20:48:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T300992)', diff saved to https://phabricator.wikimedia.org/P21535 and previous config saved to /var/cache/conftool/dbconfig/20220225-204836-ladsgroup.json [20:48:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [20:48:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [20:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:44] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [20:48:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21536 and previous config saved to /var/cache/conftool/dbconfig/20220225-204844-ladsgroup.json [20:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:44] (03CR) 10jerkins-bot: [V: 04-1] Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [20:51:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21537 and previous config saved to /var/cache/conftool/dbconfig/20220225-205149-ladsgroup.json [20:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:01] (03PS10) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) [21:01:00] !log [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good. Still looking into `Reduced availability for job jmx_wdqs_updater`; will try restarting blazegraph exporters in eqiad [21:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:12] !log [WDQS] Restarted wdqs eqiad exporters: `ryankemper@cumin1001:~$ sudo -E cumin -b 1 'wdqs1*' 'systemctl restart prometheus-blazegraph-exporter-wdqs-blazegraph.service'` [21:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:28] (03CR) 10jerkins-bot: [V: 04-1] Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [21:02:48] (03PS1) 10Dzahn: load apache module setenvif [container/miscweb] - 10https://gerrit.wikimedia.org/r/766190 (https://phabricator.wikimedia.org/T300171) [21:03:33] (03CR) 10Dzahn: [C: 03+2] load apache module setenvif [container/miscweb] - 10https://gerrit.wikimedia.org/r/766190 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [21:05:49] (03Merged) 10jenkins-bot: load apache module setenvif [container/miscweb] - 10https://gerrit.wikimedia.org/r/766190 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [21:06:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P21538 and previous config saved to /var/cache/conftool/dbconfig/20220225-210654-ladsgroup.json [21:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:44] (03PS1) 10Dzahn: miscweb: bumb staging to 2022-02-25-210804-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/766192 [21:21:12] (03CR) 10Dzahn: [C: 03+2] miscweb: bumb staging to 2022-02-25-210804-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/766192 (owner: 10Dzahn) [21:22:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P21539 and previous config saved to /var/cache/conftool/dbconfig/20220225-212159-ladsgroup.json [21:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:06] (03Merged) 10jenkins-bot: miscweb: bumb staging to 2022-02-25-210804-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/766192 (owner: 10Dzahn) [21:37:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21540 and previous config saved to /var/cache/conftool/dbconfig/20220225-213704-ladsgroup.json [21:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:11] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [21:49:00] 10Puppet, 10Infrastructure-Foundations, 10good first task: Routinator: use tmpfs - https://phabricator.wikimedia.org/T300955 (10TheresNoTime) Just a note, https://routinator.docs.nlnetlabs.nl/en/latest/installation-notes.html#using-tmpfs-for-the-rpki-cache moved to https://routinator.docs.nlnetlabs.nl/en/lat... [22:06:46] (03PS2) 10BryanDavis: toolforge: redirect legacy ru_monuments to ru-monuments [puppet] - 10https://gerrit.wikimedia.org/r/762900 (https://phabricator.wikimedia.org/T301720) [22:10:29] (03CR) 10Cwhite: "Thanks for putting this together!" [puppet] - 10https://gerrit.wikimedia.org/r/765476 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [22:27:32] (03PS1) 10RLazarus: varnish: Add an explicit "apt install docker.io" step to tests/README.md [puppet] - 10https://gerrit.wikimedia.org/r/766194 [22:40:22] (03PS11) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) [22:42:45] (03CR) 10jerkins-bot: [V: 04-1] Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [22:53:21] (03PS1) 10Razzi: Test change [cookbooks] - 10https://gerrit.wikimedia.org/r/766197 [22:53:50] (03CR) 10Razzi: "Just to see if CI is broken on master, to compare with https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/760880" [cookbooks] - 10https://gerrit.wikimedia.org/r/766197 (owner: 10Razzi) [22:55:51] (03CR) 10jerkins-bot: [V: 04-1] Test change [cookbooks] - 10https://gerrit.wikimedia.org/r/766197 (owner: 10Razzi) [23:03:36] (03CR) 10Razzi: "It appears CI is failing on master; I made a trivial change for https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/766197 and sure en" [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [23:14:33] (03CR) 10CDanis: [C: 03+1] varnish: Add an explicit "apt install docker.io" step to tests/README.md (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/766194 (owner: 10RLazarus) [23:17:43] (03PS2) 10RLazarus: varnish: Add an explicit "apt install docker.io" step to tests/README.md [puppet] - 10https://gerrit.wikimedia.org/r/766194 [23:18:36] (03CR) 10RLazarus: [C: 03+2] varnish: Add an explicit "apt install docker.io" step to tests/README.md (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/766194 (owner: 10RLazarus) [23:30:30] !log dzahn@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [23:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:52] !log dzahn@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [23:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:34] 10SRE, 10Discovery-Search (Current work): /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10bking) Thanks Moritz! I started [[ https://github.com/inflatador/ansible-deployment-prep/blob/main/roles/elastic/tasks/main.yml | writing a playbook ]] to make the changes a... [23:45:28] (03CR) 10Dzahn: "deployed on staging: curl --compressed --resolve "15.wikipedia.org:4111:staging.svc.eqiad.wmnet" 'https://15.wikipedia.org'" [deployment-charts] - 10https://gerrit.wikimedia.org/r/766192 (owner: 10Dzahn) [23:52:47] (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85% - https://alerts.wikimedia.org