[00:41:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: Evaluate whether and how to route abuse@ emails to Legal - https://phabricator.wikimedia.org/T302549 (10RLazarus) p:05Triage→03Low
[00:51:21] <wikibugs>	 (03PS5) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026)
[00:52:05] <wikibugs>	 (03PS6) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026)
[00:53:09] <wikibugs>	 (03PS7) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026)
[00:55:05] <wikibugs>	 (03CR) 10Razzi: "Ok I was inspired by @Elukey to actually make the cookbook automated, and with helpful input from @Majavah and @Volans I'm getting somewhe" [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi)
[00:55:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi)
[01:02:47] <icinga-wm>	 PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:37:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[01:40:36] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[01:44:59] <wikibugs>	 (03PS1) 10Ebernhardson: query_service: pass cookies on to blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/765667 (https://phabricator.wikimedia.org/T293462)
[01:47:18] <wikibugs>	 (03PS2) 10Ebernhardson: query_service: pass cookies on to blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/765667 (https://phabricator.wikimedia.org/T293462)
[01:47:20] <wikibugs>	 (03CR) 10Ebernhardson: query_service: pass cookies on to blazegraph (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765667 (https://phabricator.wikimedia.org/T293462) (owner: 10Ebernhardson)
[01:47:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] query_service: pass cookies on to blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/765667 (https://phabricator.wikimedia.org/T293462) (owner: 10Ebernhardson)
[01:48:57] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:53:34] <wikibugs>	 (03PS3) 10Ebernhardson: query_service: pass cookies on to blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/765667 (https://phabricator.wikimedia.org/T293462)
[01:54:19] <wikibugs>	 (03CR) 10Ebernhardson: "Tested by manually applying change to codfw hosts and seeing my username come through the kafka topics, this might finally be the last ste" [puppet] - 10https://gerrit.wikimedia.org/r/765667 (https://phabricator.wikimedia.org/T293462) (owner: 10Ebernhardson)
[01:54:39] <wikibugs>	 (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/765667 (https://phabricator.wikimedia.org/T293462) (owner: 10Ebernhardson)
[02:04:33] <wikibugs>	 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10MZMcBride) p:05Triage→03High
[02:15:05] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:37:13] <icinga-wm>	 PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:47:49] <jinxer-wm>	 (Juniper alarm active) firing: Alert for device lsw3-codfw.mgmt.codfw.wmnet - Juniper alarm active   - https://alerts.wikimedia.org
[03:04:17] <wikibugs>	 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10AndyRussG) Wooohooo thanks so much once again, everyone!!! :) :)
[03:22:47] <jinxer-wm>	 (Processor usage over 85%) firing: Alert for device scs-ulsfo.mgmt.ulsfo.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[03:32:47] <jinxer-wm>	 (Processor usage over 85%) firing: (2) Alert for device scs-eqsin.mgmt.eqsin.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[03:49:39] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:52:47] <jinxer-wm>	 (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[04:06:57] <icinga-wm>	 RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:15:57] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:19:57] <icinga-wm>	 PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:25:48] <wikibugs>	 (03CR) 10Gergő Tisza: GLAM event: add wgGECampaigns and wgGECampaignTopics configs for beta cluster (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765349 (https://phabricator.wikimedia.org/T301029) (owner: 10MewOphaswongse)
[04:40:01] <icinga-wm>	 RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:43:55] <wikibugs>	 10SRE, 10observability: Investigate "Ops Monitor (WMF)" wiki account and associated global user group - https://phabricator.wikimedia.org/T302552 (10Legoktm)
[05:49:05] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:04:49] <icinga-wm>	 PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:15:23] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:47:49] <jinxer-wm>	 (Juniper alarm active) firing: Alert for device lsw3-codfw.mgmt.codfw.wmnet - Juniper alarm active   - https://alerts.wikimedia.org
[06:52:15] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: conftool: add request-actions / request-patterns (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763486 (owner: 10Giuseppe Lavagetto)
[06:52:34] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763557 (owner: 10Giuseppe Lavagetto)
[07:06:03] <icinga-wm>	 RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:13:27] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:28:53] <wikibugs>	 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Legoktm) >>! In T292322#7711900, @Joe wrote: > @tstarling @Legoktm do you think we can enable this on commons as well? The only negative effect will be to...
[07:41:38] <wikibugs>	 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) >>! In T292322#7737027, @Legoktm wrote: >>>! In T292322#7711900, @Joe wrote: >> @tstarling @Legoktm do you think we can enable this on commons as wel...
[07:46:59] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:50:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: ditch automatic icmp probes for service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/765548 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[07:52:47] <jinxer-wm>	 (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[07:54:09] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:58:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/765567 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah)
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220225T0800)
[08:01:22] <wikibugs>	 10SRE, 10Developer-Advocacy, 10Gerrit, 10serviceops: Remove port 29418 from cloning process - https://phabricator.wikimedia.org/T37611 (10hashar) 05Open→03Declined This was an idea that floated around in the early day of us adopting Gerrit.  The point was to save the hassle of having to use `ssh -p 294...
[08:14:47] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:18:35] <wikibugs>	 (03PS3) 10Majavah: P:wmcs::prometheus: deploy alert rule from ops/alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/765567 (https://phabricator.wikimedia.org/T302493)
[08:19:13] <wikibugs>	 (03CR) 10Majavah: P:wmcs::prometheus: deploy alert rule from ops/alerts.git (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/765567 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah)
[08:19:56] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33991/console" [puppet] - 10https://gerrit.wikimedia.org/r/765567 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah)
[08:29:32] <wikibugs>	 (03PS2) 10Muehlenhoff: Add drmrs to Hiera list of datacentres [puppet] - 10https://gerrit.wikimedia.org/r/737328
[08:32:24] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:39:27] <wikibugs>	 (03PS1) 10Muehlenhoff: Readd packages needed for node but with conditional for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/765690
[08:39:37] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:40:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Readd packages needed for node but with conditional for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/765690 (owner: 10Muehlenhoff)
[08:45:23] <icinga-wm>	 PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:45:27] <wikibugs>	 (03PS2) 10Muehlenhoff: Readd packages needed for node but with conditional for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/765690
[08:51:00] <wikibugs>	 (03CR) 10Muehlenhoff: "FYI, I readded the two packages guarded for bullseye and later in https://gerrit.wikimedia.org/r/c/operations/puppet/+/765690/" [puppet] - 10https://gerrit.wikimedia.org/r/765648 (owner: 10Jbond)
[08:51:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Readd packages needed for node but with conditional for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/765690 (owner: 10Muehlenhoff)
[09:05:02] <wikibugs>	 (03CR) 10Ayounsi: [C: 04-1] "One typo then lgtm! We can deploy on Monday" [homer/public] - 10https://gerrit.wikimedia.org/r/765568 (https://phabricator.wikimedia.org/T302315) (owner: 10Cathal Mooney)
[09:12:45] <wikibugs>	 10SRE, 10Discovery-Search (Current work): /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10MoritzMuehlenhoff) That sounds like a very promising lead, nice detective work! I think we can test following as a fix:  /var/run/elasticsearch gets shipped via /usr/lib/tmpf...
[09:12:49] <jinxer-wm>	 (Juniper alarm active) resolved: Alert for device lsw3-codfw.mgmt.codfw.wmnet - Juniper alarm active   - https://alerts.wikimedia.org
[09:20:37] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "LGTM; as clinic duty person shall I +2 and merge also?" [puppet] - 10https://gerrit.wikimedia.org/r/765589 (https://phabricator.wikimedia.org/T302250) (owner: 10Dzahn)
[09:20:39] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:22:51] <wikibugs>	 (03CR) 10Ayounsi: [C: 04-1] "I made the current mechanism to stop advertising publicly the anycast prefixes if the local anycast servers are offline for any reasons." [homer/public] - 10https://gerrit.wikimedia.org/r/765568 (https://phabricator.wikimedia.org/T302315) (owner: 10Cathal Mooney)
[09:28:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet
[09:28:09] <wikibugs>	 (03PS1) 10Majavah: admin: update my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/766063
[09:28:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] admin: update my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/766063 (owner: 10Majavah)
[09:33:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/765567 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah)
[09:34:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet
[09:34:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:01] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::haproxy: Use HAProxy 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/765299 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[09:36:51] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] admin: add ammarpad to ldap_only admins (nda) [puppet] - 10https://gerrit.wikimedia.org/r/765589 (https://phabricator.wikimedia.org/T302250) (owner: 10Dzahn)
[09:37:02] <wikibugs>	 (03PS2) 10MVernon: admin: add ammarpad to ldap_only admins (nda) [puppet] - 10https://gerrit.wikimedia.org/r/765589 (https://phabricator.wikimedia.org/T302250) (owner: 10Dzahn)
[09:43:56] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Logstash Access for Ammarpad - https://phabricator.wikimedia.org/T302250 (10MatthewVernon) 05In progress→03Resolved a:03MatthewVernon Hi, I've done this now. Thanks, Matthew
[09:44:35] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:48:12] <wikibugs>	 (03PS4) 10Muehlenhoff: sre.ganeti.addnode: Validate bridge config of the switches [cookbooks] - 10https://gerrit.wikimedia.org/r/765309
[09:48:15] <wikibugs>	 (03PS1) 10Muehlenhoff: Add ganeti2029 as new node in codfw [puppet] - 10https://gerrit.wikimedia.org/r/766065 (https://phabricator.wikimedia.org/T298998)
[09:48:24] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10LSobanski) Sounds good to me, CC'ing @MatthewVernon for visibility.
[09:58:29] <wikibugs>	 (03PS3) 10Filippo Giunchedi: logstash: add blackbox-exporter filter config [puppet] - 10https://gerrit.wikimedia.org/r/765476 (https://phabricator.wikimedia.org/T291946)
[09:59:07] <wikibugs>	 (03PS1) 10Vgutierrez: cache::haproxy: Provide a haproxy-restart script [puppet] - 10https://gerrit.wikimedia.org/r/766069 (https://phabricator.wikimedia.org/T290005)
[10:00:01] <wikibugs>	 (03CR) 10Muehlenhoff: sre.ganeti.addnode: Validate bridge config of the switches (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/765309 (owner: 10Muehlenhoff)
[10:00:08] <wikibugs>	 (03PS5) 10Muehlenhoff: sre.ganeti.addnode: Validate bridge config of the switches [cookbooks] - 10https://gerrit.wikimedia.org/r/765309
[10:00:21] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33992/console" [puppet] - 10https://gerrit.wikimedia.org/r/766069 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[10:03:15] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:05:11] <icinga-wm>	 PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 66 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[10:10:27] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:11:33] <icinga-wm>	 RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 65 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[10:12:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add ganeti2029 as new node in codfw [puppet] - 10https://gerrit.wikimedia.org/r/766065 (https://phabricator.wikimedia.org/T298998) (owner: 10Muehlenhoff)
[10:13:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Suboptimal anycast routing from leaf switches - https://phabricator.wikimedia.org/T302315 (10ayounsi) Current status is that this is virtually solved (removing the last software blocker for drmrs), the CR above will be needed to allow adver...
[10:17:04] <vgutierrez>	 !log rolling upgrade to HAProxy 2.4.13 on HAProxy cache nodes - T290005
[10:17:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:12] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[10:20:13] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:22:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2029.codfw.wmnet to ganeti01.svc.codfw.wmnet
[10:22:13] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2029.codfw.wmnet to ganeti01.svc.codfw.wmnet
[10:22:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:57] <wikibugs>	 (03CR) 10Ayounsi: "Wow, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[10:26:19] <wikibugs>	 (03CR) 10MMandere: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/766069 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[10:27:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2029.codfw.wmnet with reason: Enable virtualisation in BIOS
[10:27:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2029.codfw.wmnet with reason: Enable virtualisation in BIOS
[10:27:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet
[10:33:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:20] <wikibugs>	 (03PS1) 10David Caro: openstack:galera:node: make sure prometheus-mysqld-exporter is running [puppet] - 10https://gerrit.wikimedia.org/r/766071 (https://phabricator.wikimedia.org/T302557)
[10:38:30] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33994/console" [puppet] - 10https://gerrit.wikimedia.org/r/766071 (https://phabricator.wikimedia.org/T302557) (owner: 10David Caro)
[10:41:00] <moritzm>	 !log enabled virtualisation in BIOS for ganeti2029 T298998
[10:41:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:06] <stashbot>	 T298998: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998
[10:42:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet
[10:42:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2029.codfw.wmnet to ganeti01.svc.codfw.wmnet
[10:43:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:58] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::haproxy: Provide a haproxy-restart script [puppet] - 10https://gerrit.wikimedia.org/r/766069 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[10:44:07] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:48:59] <wikibugs>	 (03PS1) 10Vgutierrez: site: Reimage cp4025 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/766073 (https://phabricator.wikimedia.org/T290005)
[10:50:51] <wikibugs>	 (03CR) 10Tchanders: [C: 03+1] Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte)
[10:53:17] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp4025 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/766073 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[10:54:19] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp4025.ulsfo.wmnet with OS buster
[10:54:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:31] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp4025.ulsfo.wmnet with OS buster
[11:00:36] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cache_envoy in ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[11:04:04] <wikibugs>	 (03CR) 10Ayounsi: "Thanks for looking at it. FYI the Netbox error is caught by the network report." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/765581 (owner: 10Volans)
[11:04:08] <moritzm>	 !log added ganeti2029 to codfw Ganeti cluster T298998
[11:04:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:14] <stashbot>	 T298998: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998
[11:07:49] <wikibugs>	 10SRE, 10Wiki Loves Monuments 2022, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Request for creation: WLM-Network Mailing List - https://phabricator.wikimedia.org/T302510 (10Ciell) For the purpose of the pilot, let's make it public and with archive please.
[11:10:57] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4025.ulsfo.wmnet with reason: host reimage
[11:11:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:41] <icinga-wm>	 PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:13:42] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4025.ulsfo.wmnet with reason: host reimage
[11:13:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:38] <XioNoX>	 !log re-activate BGP session to Seabone in esams
[11:20:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:33] <wikibugs>	 (03PS1) 10Hokwelum: Add IP address to bringyour mirror and this was a request from Brien the mirror contact person [puppet] - 10https://gerrit.wikimedia.org/r/766076
[11:28:35] <wikibugs>	 (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/766076 (owner: 10Hokwelum)
[11:29:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add IP address to bringyour mirror and this was a request from Brien the mirror contact person [puppet] - 10https://gerrit.wikimedia.org/r/766076 (owner: 10Hokwelum)
[11:29:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet
[11:29:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet
[11:35:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:04] <wikibugs>	 (03PS2) 10Hokwelum: Add IP address to bringyour mirror [puppet] - 10https://gerrit.wikimedia.org/r/766076
[11:40:19] <vgutierrez>	 !log pool cp4025 running HAProxy as TLS termination layer - T290005 T271421
[11:40:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:40:27] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[11:40:27] <stashbot>	 T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421
[11:40:46] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] Add IP address to bringyour mirror [puppet] - 10https://gerrit.wikimedia.org/r/766076 (owner: 10Hokwelum)
[11:41:21] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4025.ulsfo.wmnet with OS buster
[11:41:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:38] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4025.ulsfo.wmnet with OS buster c...
[11:42:43] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez) 05In progress→03Resolved envoy instances are currently being reimaged as HAProxy ones. We're cleaning up and pausing the envoyproxy experiment
[11:42:49] <wikibugs>	 (03PS4) 10Cathal Mooney: Change CR policy for creating aggregate Anycast routes [homer/public] - 10https://gerrit.wikimedia.org/r/765568 (https://phabricator.wikimedia.org/T302315)
[11:42:57] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez)
[11:45:36] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job cache_envoy in ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[11:46:14] <wikibugs>	 (03PS1) 10Vgutierrez: site: Reimage cp2040 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/766078 (https://phabricator.wikimedia.org/T290005)
[11:47:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2029.codfw.wmnet to ganeti01.svc.codfw.wmnet
[11:47:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:43] <wikibugs>	 (03CR) 10Cathal Mooney: "Thanks Arzhel.  Fixed up the semi-colon, and put down some other comments.  Unsure if you think we should merge this or not?  I'm open to " [homer/public] - 10https://gerrit.wikimedia.org/r/765568 (https://phabricator.wikimedia.org/T302315) (owner: 10Cathal Mooney)
[11:52:47] <jinxer-wm>	 (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[11:52:53] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2040 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/766078 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[11:53:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2030.codfw.wmnet to ganeti01.svc.codfw.wmnet
[11:53:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:54] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp2040.codfw.wmnet with OS buster
[11:53:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:06] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp2040.codfw.wmnet with OS buster
[11:54:28] <wikibugs>	 (03CR) 10Cathal Mooney: wmf-netbox: fix UnboundLocalError (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/765581 (owner: 10Volans)
[11:54:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Add ganeti2030 to list of codfw Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/766079
[11:55:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2030.codfw.wmnet to ganeti01.svc.codfw.wmnet
[11:55:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add ganeti2030 to list of codfw Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/766079 (owner: 10Muehlenhoff)
[12:00:36] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cache_envoy in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[12:11:59] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2040.codfw.wmnet with reason: host reimage
[12:12:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10User-jbond: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10Volans) Although it does not do what we need, some logic to download the lists from multiple clouds can be gath...
[12:12:32] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job cache_envoy in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[12:12:58] <wikibugs>	 (03PS1) 10Hnowlan: restbase-dev: change role of new hosts [puppet] - 10https://gerrit.wikimedia.org/r/766082 (https://phabricator.wikimedia.org/T295375)
[12:13:48] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] sre.ganeti.addnode: Validate bridge config of the switches [cookbooks] - 10https://gerrit.wikimedia.org/r/765309 (owner: 10Muehlenhoff)
[12:14:43] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2040.codfw.wmnet with reason: host reimage
[12:14:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job cache_envoy in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[12:25:36] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job cache_envoy in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[12:31:04] <wikibugs>	 (03CR) 10Ayounsi: Change CR policy for creating aggregate Anycast routes (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/765568 (https://phabricator.wikimedia.org/T302315) (owner: 10Cathal Mooney)
[12:32:45] <vgutierrez>	 !log pool cp2040 running HAProxy as TLS termination layer - T290005 T271421
[12:32:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:53] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[12:32:53] <stashbot>	 T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421
[12:34:04] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez)
[12:37:42] <wikibugs>	 (03PS1) 10MMandere: varnish: remove obsolete repo path reference [puppet] - 10https://gerrit.wikimedia.org/r/766088 (https://phabricator.wikimedia.org/T302579)
[12:38:31] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2040.codfw.wmnet with OS buster
[12:38:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:44] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp2040.codfw.wmnet with OS buster c...
[12:38:48] <wikibugs>	 (03PS5) 10Cathal Mooney: Change CR policy for creating aggregate Anycast routes [homer/public] - 10https://gerrit.wikimedia.org/r/765568 (https://phabricator.wikimedia.org/T302315)
[12:39:08] <wikibugs>	 (03CR) 10Cathal Mooney: "Thanks for feedback, policy term name updated." [homer/public] - 10https://gerrit.wikimedia.org/r/765568 (https://phabricator.wikimedia.org/T302315) (owner: 10Cathal Mooney)
[12:39:14] <moritzm>	 !log drain instances off ganeti2007 T302577
[12:39:16] <logmsgbot>	 !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6] (dev): updating wmf-proxy-dashboard
[12:39:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:20] <stashbot>	 T302577: decommission ganeti2007 - https://phabricator.wikimedia.org/T302577
[12:39:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:54] <logmsgbot>	 !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6] (dev): updating wmf-proxy-dashboard (duration: 00m 37s)
[12:39:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:28] <wikibugs>	 (03CR) 10MMandere: "Sample showing container downloading varnish6 and dependencies from main component:" [puppet] - 10https://gerrit.wikimedia.org/r/766088 (https://phabricator.wikimedia.org/T302579) (owner: 10MMandere)
[12:43:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:44:00] <logmsgbot>	 !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6]: updating wmf-proxy-dashboard on eqiad1
[12:44:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:11] <wikibugs>	 (03CR) 10Vivian Rook: [C: 03+1] openstack:galera:node: make sure prometheus-mysqld-exporter is running [puppet] - 10https://gerrit.wikimedia.org/r/766071 (https://phabricator.wikimedia.org/T302557) (owner: 10David Caro)
[12:45:52] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:46:04] <logmsgbot>	 !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: updating wmf-proxy-dashboard on eqiad1 (duration: 02m 04s)
[12:46:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:56:16] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Change CR policy for creating aggregate Anycast routes [homer/public] - 10https://gerrit.wikimedia.org/r/765568 (https://phabricator.wikimedia.org/T302315) (owner: 10Cathal Mooney)
[13:13:04] <icinga-wm>	 RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:29:44] <logmsgbot>	 !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6] (dev): debugging deployment process
[13:29:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:50] <logmsgbot>	 !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6] (dev): debugging deployment process (duration: 00m 05s)
[13:29:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:24] <logmsgbot>	 !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6] (dev): debugging deployment process
[13:30:28] <logmsgbot>	 !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6] (dev): debugging deployment process
[13:30:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:34] <logmsgbot>	 !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6] (dev): debugging deployment process (duration: 00m 06s)
[13:30:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:24] <wikibugs>	 (03PS2) 10Krinkle: Expand log level of DBConnection messages from 'error' to 'warning' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764359 (https://phabricator.wikimedia.org/T281451)
[13:33:29] <wikibugs>	 (03PS3) 10Krinkle: Expand log level of DBConnection messages from 'error' to 'warning' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764359 (https://phabricator.wikimedia.org/T281451)
[13:35:37] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] openstack:galera:node: make sure prometheus-mysqld-exporter is running [puppet] - 10https://gerrit.wikimedia.org/r/766071 (https://phabricator.wikimedia.org/T302557) (owner: 10David Caro)
[13:46:41] <XioNoX>	 hello please hold on any netbox changes for a few minutes, we're restoring a backup after I clicked the wrong button
[13:48:57] <volans|off>	 !log restoring psql-all-dbs-20220225.sql.gz into netbox
[13:49:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:50] <icinga-wm>	 RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:52:00] <logmsgbot>	 !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6]: deploying wmf-proxy-dashboard and wmf-puppet-dashboard changes for real after fixing the scap config
[13:52:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:52] <volans|off>	 netbox backup has been restored, all looks good, it shoud be good to resume normal operations
[13:54:50] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] varnish: remove obsolete repo path reference [puppet] - 10https://gerrit.wikimedia.org/r/766088 (https://phabricator.wikimedia.org/T302579) (owner: 10MMandere)
[13:56:50] <logmsgbot>	 !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: deploying wmf-proxy-dashboard and wmf-puppet-dashboard changes for real after fixing the scap config (duration: 04m 50s)
[13:56:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:33] <wikibugs>	 (03PS1) 10Vgutierrez: site: Reimage cp5005 as cache::haproxy_upload [puppet] - 10https://gerrit.wikimedia.org/r/766102 (https://phabricator.wikimedia.org/T290005)
[13:59:54] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:02:17] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] varnish: remove obsolete repo path reference [puppet] - 10https://gerrit.wikimedia.org/r/766088 (https://phabricator.wikimedia.org/T302579) (owner: 10MMandere)
[14:03:43] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp5005 as cache::haproxy_upload [puppet] - 10https://gerrit.wikimedia.org/r/766102 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[14:04:32] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp5005.eqsin.wmnet with OS buster
[14:04:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:44] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp5005.eqsin.wmnet with OS buster
[14:05:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Add repository component component/ganeti3 for Buster [puppet] - 10https://gerrit.wikimedia.org/r/766106
[14:05:55] <logmsgbot>	 !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6]: fix wmf-puppet-dashboard routes
[14:05:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:54] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:11:43] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] wmf-netbox: fix UnboundLocalError (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/765581 (owner: 10Volans)
[14:13:42] <logmsgbot>	 !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: fix wmf-puppet-dashboard routes (duration: 07m 47s)
[14:13:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:02] <wikibugs>	 (03PS1) 10Majavah: dynamicproxy: fix delete_records_for() method call [puppet] - 10https://gerrit.wikimedia.org/r/766107
[14:19:14] <wikibugs>	 (03CR) 10Vivian Rook: [C: 03+2] dynamicproxy: fix delete_records_for() method call [puppet] - 10https://gerrit.wikimedia.org/r/766107 (owner: 10Majavah)
[14:21:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi With the icmp probes gone I don'...
[14:25:13] <wikibugs>	 (03CR) 10Physikerwelt: [C: 03+1] mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/751439 (owner: 10PipelineBot)
[14:28:52] <wikibugs>	 (03PS1) 10Majavah: dynamicproxy: fix condition [puppet] - 10https://gerrit.wikimedia.org/r/766109
[14:30:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add repository component component/ganeti3 for Buster [puppet] - 10https://gerrit.wikimedia.org/r/766106 (owner: 10Muehlenhoff)
[14:32:01] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5005.eqsin.wmnet with reason: host reimage
[14:32:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:50] <wikibugs>	 (03CR) 10Vivian Rook: [C: 03+2] dynamicproxy: fix condition [puppet] - 10https://gerrit.wikimedia.org/r/766109 (owner: 10Majavah)
[14:35:28] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5005.eqsin.wmnet with reason: host reimage
[14:35:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:11] <wikibugs>	 10Puppet, 10Horizon, 10Infrastructure-Foundations, 10Patch-For-Review: Invalid yaml in horizon hiera editor results in confusing error message - https://phabricator.wikimedia.org/T241999 (10Majavah) a:03Majavah The PS above updates the error to look like this: {F34966331}
[15:19:24] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5005.eqsin.wmnet with OS buster
[15:19:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:37] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp5005.eqsin.wmnet with OS buster c...
[15:23:19] <icinga-wm>	 PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 66 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:25:19] <vgutierrez>	 !log pool cp5005 running HAProxy as TLS termination layer - T290005 T271421
[15:25:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:26] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[15:25:26] <stashbot>	 T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421
[15:34:31] <wikibugs>	 (03PS1) 10Vgutierrez: site: Reimage cp3063 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/766119 (https://phabricator.wikimedia.org/T290005)
[15:36:44] <moritzm>	 !log imported PHP 7.4 7.4.28-1+0~20220217.59+debian10~1.gbp1950+wmf1+buster1 to component/php74 for buster-wikimedia T271736
[15:36:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:51] <stashbot>	 T271736: Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736
[15:37:17] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp3063 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/766119 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[15:38:25] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp3063.esams.wmnet with OS buster
[15:38:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:37] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3063.esams.wmnet with OS buster
[15:39:21] <wikibugs>	 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission cp4031 - https://phabricator.wikimedia.org/T301269 (10Vgutierrez)
[15:39:43] <icinga-wm>	 RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 64 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:40:03] <icinga-wm>	 PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:41:33] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:43:21] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:43:56] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] restbase-dev: change role of new hosts [puppet] - 10https://gerrit.wikimedia.org/r/766082 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan)
[15:44:30] <Tamzin>	 "upstream connect error or disconnect/reset before headers. reset reason: overflow"
[15:44:52] <firefly_wp>	 You broke it Tamzin :P
[15:44:57] <Tamzin>	 I keep doing that
[15:45:09] <firefly_wp>	 Hehehe
[15:45:25] <Tks4Fish>	 known issues?
[15:45:35] <Tks4Fish>	 Request from - via cp1081.eqiad.wmnet, ATS/8.0.8
[15:45:35] <Tks4Fish>	 Error: 502, Next Hop Connection Failed at 2022-02-25 15:45:07 GMT
[15:45:36] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cache_envoy in esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[15:45:37] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1081.eqiad.wmnet, cp1079.eqiad.wmnet, cp1085.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1083.eqiad.wmnet, cp1085.eqiad.wmnet, cp1079.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1085.eqiad.wmnet, cp1075.eqiad.wmnet, cp1079.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/
[15:45:37] <icinga-wm>	 al
[15:45:38] <Seddon>	 Experiencing in here as well
[15:45:40] <Seddon>	 UK
[15:45:49] <icinga-wm>	 PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton
[15:45:51] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[15:45:52] <icinga-wm>	 PROBLEM - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is CRITICAL: 0.002598 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver
[15:45:56] <Tks4Fish>	 API on meta is running fine-ish though
[15:46:09] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:46:13] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3127 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:46:21] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[15:46:28] <icinga-wm>	 PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_text layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1
[15:46:32] <XioNoX>	 yo
[15:46:43] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:46:43] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:46:44] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:46:48] <Seddon>	 Front end ddos?
[15:46:53] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:05] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:05] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:05] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3121 on cp5010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:09] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3127 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:09] <XioNoX>	 looking at network
[15:47:09] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:09] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:09] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 43.62 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[15:47:11] <taavi>	 here
[15:47:13] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:13] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:13] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3127 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:13] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3125 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:13] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3126 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:14] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3125 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:15] <_joe_>	 here we go again heh
[15:47:17] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3121 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:17] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3126 on cp5010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:19] <apergos>	 gah
[15:47:20] <herron>	 hey
[15:47:23] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:23] <_joe_>	 XioNoX: not network
[15:47:25] <sukhe>	 hi
[15:47:33] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3121 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:33] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3127 on cp5010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:35] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:35] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:35] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:35] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3123 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:35] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3123 on cp5010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:42] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Expand log level of DBConnection messages from 'error' to 'warning' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764359 (https://phabricator.wikimedia.org/T281451) (owner: 10Krinkle)
[15:47:43] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:47:53] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3127 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:47:55] <jinxer-wm>	 (ProbeHttpFailed) firing: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org
[15:47:55] <jinxer-wm>	 (ProbeHttpFailed) firing: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org
[15:48:01] <XioNoX>	 yeah network looks fine
[15:48:03] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 50.21 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[15:48:13] <Amir1>	 here
[15:48:26] <akosiaris>	 same
[15:48:28] <jhathaway>	 here as well, cache busting again?
[15:48:44] <sobanski>	 Acked the alerts
[15:49:59] <icinga-wm>	 RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[15:50:00] <_joe_>	 please not here
[15:50:06] <icinga-wm>	 RECOVERY - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.6898 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver
[15:50:09] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:50:11] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3127 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:50:13] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 79.46 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[15:50:14] <_joe_>	 and yes it's over
[15:50:36] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job cache_envoy in esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[15:50:45] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:50:45] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:50:45] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:50:50] <icinga-wm>	 RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1
[15:50:57] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:51:09] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:51:09] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.449 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:51:09] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3121 on cp5010 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:51:13] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:51:13] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3127 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:51:13] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:51:17] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:51:17] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:51:17] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3127 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:51:19] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.453 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:51:19] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3126 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:51:19] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:51:33] <TheresNoTime>	 :)
[15:52:47] <jinxer-wm>	 (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[15:52:51] <wikibugs>	 (03PS1) 10ZPapierski: Replace Swift native API with S3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/766123 (https://phabricator.wikimedia.org/T302494)
[15:52:55] <jinxer-wm>	 (ProbeHttpFailed) resolved: (2) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org
[15:53:00] <jinxer-wm>	 (ProbeHttpFailed) resolved: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org
[15:53:13] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[15:53:55] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:53:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:54:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:56:05] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3127 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:56:05] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.453 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:58:19] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:58:49] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:58:51] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3121 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:58:51] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3121 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:58:51] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3123 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:58:51] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3123 on cp5010 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:58:51] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3126 on cp5010 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:58:52] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3127 on cp5010 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.452 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:59:07] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:00:17] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:02:35] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[16:03:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10Jclark-ctr) dumpsdata1006 E1  U19  port19  cableid#20220257 dumpsdata1006 F1   U19 port19  cableid#20220258
[16:03:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10Jclark-ctr)
[16:05:57] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:06:33] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3063.esams.wmnet with reason: host reimage
[16:06:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:04] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3063.esams.wmnet with reason: host reimage
[16:10:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:33] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:12:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10Jclark-ctr) | name | rack| port | cableid elastic1089 E1 21 20220145 elastic1090 E1 22 20220146 elastic1091 E2 21 20220148 elastic1092 E2 22...
[16:17:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Jclark-ctr) | name |rack_name |port |cableid ml-cache1001 E1 23 20220147 ml-cache1002 E2 23 20220137 ml-cache1003 F1 23 20220125 |
[16:17:37] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[16:18:06] <dancy>	 Can someone with root access on deploy1002 send me the contents of /var/lib/deploy-mwdebug/error ?
[16:18:32] <dancy>	 (or just make a copy of it that I can read from that machine)
[16:18:53] <icinga-wm>	 PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 66 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:23:57] <icinga-wm>	 RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 64 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:27:32] <wikibugs>	 10SRE, 10serviceops: Renew puppet cert for etcd.codfw.wmnet - https://phabricator.wikimedia.org/T302153 (10Joe) 05Open→03Resolved I just removed the cert from puppet.
[16:29:37] <icinga-wm>	 RECOVERY - Puppet CA expired certs on puppetmaster1001 is OK: OK: all puppet agent certs fine https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate
[16:35:06] <vgutierrez>	 !log pool cp3063 running HAProxy as TLS termination layer - T290005 T271421
[16:35:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:14] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[16:35:14] <stashbot>	 T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421
[16:35:24] <wikibugs>	 10SRE, 10Wiki Loves Monuments 2022, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Request for creation: WLM-Network Mailing List - https://phabricator.wikimedia.org/T302510 (10Ladsgroup) 05Open→03Resolved https://lists.wikimedia.org/postorius/lists/wlm-network.lists.wikimedia.org
[16:36:23] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3063.esams.wmnet with OS buster
[16:36:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:35] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3063.esams.wmnet with OS buster c...
[16:37:30] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez)
[16:40:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[16:40:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[16:40:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T300992)', diff saved to https://phabricator.wikimedia.org/P21511 and previous config saved to /var/cache/conftool/dbconfig/20220225-164020-ladsgroup.json
[16:40:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:35] <stashbot>	 T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992
[16:40:57] <icinga-wm>	 RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:43:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T300992)', diff saved to https://phabricator.wikimedia.org/P21512 and previous config saved to /var/cache/conftool/dbconfig/20220225-164323-ladsgroup.json
[16:43:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:03] <wikibugs>	 (03PS2) 10Hnowlan: restbase-dev: change role of new hosts [puppet] - 10https://gerrit.wikimedia.org/r/766082 (https://phabricator.wikimedia.org/T295375)
[16:45:10] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10observability, and 2 others: Upgrade Kafka to 2.x - https://phabricator.wikimedia.org/T300102 (10EChetty)
[16:47:07] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10observability, and 2 others: Upgrade Kafka Risk Evaluation - https://phabricator.wikimedia.org/T302610 (10EChetty)
[16:48:25] <wikibugs>	 10SRE, 10Data-Engineering, 10observability, 10serviceops, 10Epic: Upgrade Kafka to 2.x - https://phabricator.wikimedia.org/T300102 (10EChetty)
[16:53:32] <wikibugs>	 (03PS2) 10BBlack: eqiad lvs: add interfaces and IPs for rows E and F [puppet] - 10https://gerrit.wikimedia.org/r/765311 (https://phabricator.wikimedia.org/T301419)
[16:53:34] <wikibugs>	 (03PS1) 10BBlack: Reject invalid hex encoding in URIs [puppet] - 10https://gerrit.wikimedia.org/r/766162
[16:54:09] <wikibugs>	 (03PS2) 10BBlack: Reject invalid hex encoding in URIs [puppet] - 10https://gerrit.wikimedia.org/r/766162
[16:56:05] <wikibugs>	 (03CR) 10Zabe: "This seems to have already been done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/748757" [puppet] - 10https://gerrit.wikimedia.org/r/737328 (owner: 10Muehlenhoff)
[16:56:10] <wikibugs>	 (03PS3) 10BBlack: Reject invalid hex encoding in URIs [puppet] - 10https://gerrit.wikimedia.org/r/766162
[16:58:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P21513 and previous config saved to /var/cache/conftool/dbconfig/20220225-165828-ladsgroup.json
[16:58:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:59:58] <logmsgbot>	 !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6] (dev): wmf-puppet-dashboard updates: better error messages and code cleanup
[17:00:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:01:55] <logmsgbot>	 !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6] (dev): wmf-puppet-dashboard updates: better error messages and code cleanup (duration: 01m 57s)
[17:01:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:03:43] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Logstash Access for Ammarpad - https://phabricator.wikimedia.org/T302250 (10Dzahn) @Ammarpad Here is a list of things that come with the NDA group you have now:  https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups#NDA_group
[17:12:35] <ebernhardson>	 !log manual trigger of cirrus SaneitizeJobs for with 2hr refresh
[17:12:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P21514 and previous config saved to /var/cache/conftool/dbconfig/20220225-171333-ladsgroup.json
[17:13:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:14] <wikibugs>	 (03PS9) 10Giuseppe Lavagetto: conftool: add request-actions / request-patterns [puppet] - 10https://gerrit.wikimedia.org/r/763486
[17:14:16] <wikibugs>	 (03PS7) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557
[17:21:15] <logmsgbot>	 !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6]: wmf-puppet-dashboard updates: better error messages and code cleanup (prod)
[17:21:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:00] <icinga-wm>	 PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 67 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[17:28:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T300992)', diff saved to https://phabricator.wikimedia.org/P21515 and previous config saved to /var/cache/conftool/dbconfig/20220225-172837-ladsgroup.json
[17:28:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance
[17:28:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance
[17:28:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:45] <stashbot>	 T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992
[17:28:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T300992)', diff saved to https://phabricator.wikimedia.org/P21516 and previous config saved to /var/cache/conftool/dbconfig/20220225-172845-ladsgroup.json
[17:28:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:36] <logmsgbot>	 !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: wmf-puppet-dashboard updates: better error messages and code cleanup (prod) (duration: 08m 20s)
[17:29:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T300992)', diff saved to https://phabricator.wikimedia.org/P21517 and previous config saved to /var/cache/conftool/dbconfig/20220225-173356-ladsgroup.json
[17:34:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:03] <stashbot>	 T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992
[17:37:36] <wikibugs>	 10Puppet, 10Horizon, 10Infrastructure-Foundations: Invalid yaml in horizon hiera editor results in confusing error message - https://phabricator.wikimedia.org/T241999 (10Majavah) 05Open→03Resolved
[17:40:46] <wikibugs>	 (03PS3) 10MewOphaswongse: GLAM event: add wgGECampaigns and wgGECampaignTopics configs for beta eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765349 (https://phabricator.wikimedia.org/T301029)
[17:41:04] <wikibugs>	 (03CR) 10MewOphaswongse: GLAM event: add wgGECampaigns and wgGECampaignTopics configs for beta eswiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765349 (https://phabricator.wikimedia.org/T301029) (owner: 10MewOphaswongse)
[17:41:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] GLAM event: add wgGECampaigns and wgGECampaignTopics configs for beta eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765349 (https://phabricator.wikimedia.org/T301029) (owner: 10MewOphaswongse)
[17:42:12] <icinga-wm>	 RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 65 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[17:42:18] <wikibugs>	 (03PS4) 10MewOphaswongse: GLAM event: add wgGECampaigns and wgGECampaignTopics configs for beta eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765349 (https://phabricator.wikimedia.org/T301029)
[17:46:34] <wikibugs>	 10SRE: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10SCherukuwada)
[17:46:54] <wikibugs>	 10SRE: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10SCherukuwada) In case someone's wondering, DuckDuckGo doesn't actually have a webmaster console. Strange.
[17:49:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P21518 and previous config saved to /var/cache/conftool/dbconfig/20220225-174901-ladsgroup.json
[17:49:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:12] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:02:24] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:04:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P21519 and previous config saved to /var/cache/conftool/dbconfig/20220225-180406-ladsgroup.json
[18:04:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:06:08] <icinga-wm>	 PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 66 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[18:11:42] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:11:58] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:12:28] <icinga-wm>	 RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 64 probes of 664 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[18:19:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T300992)', diff saved to https://phabricator.wikimedia.org/P21520 and previous config saved to /var/cache/conftool/dbconfig/20220225-181911-ladsgroup.json
[18:19:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[18:19:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[18:19:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:18] <stashbot>	 T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992
[18:19:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21521 and previous config saved to /var/cache/conftool/dbconfig/20220225-181918-ladsgroup.json
[18:19:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21522 and previous config saved to /var/cache/conftool/dbconfig/20220225-182223-ladsgroup.json
[18:22:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:09] <wikibugs>	 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10SCherukuwada) Just filed https://phabricator.wikimedia.org/T302617 to start discussing domain ownership verification on various consoles with SRE. I'll be unavailable for a few days...
[18:32:37] <wikibugs>	 10SRE-Access-Requests: Request Administrator Access to Google Search Console - https://phabricator.wikimedia.org/T302625 (10SCherukuwada)
[18:37:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P21523 and previous config saved to /var/cache/conftool/dbconfig/20220225-183728-ladsgroup.json
[18:37:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:00] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] GLAM event: add wgGECampaigns and wgGECampaignTopics configs for beta eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765349 (https://phabricator.wikimedia.org/T301029) (owner: 10MewOphaswongse)
[18:46:25] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] GLAM event: add wgGECampaigns and wgGECampaignTopics configs for beta eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765349 (https://phabricator.wikimedia.org/T301029) (owner: 10MewOphaswongse)
[18:47:04] <wikibugs>	 (03Merged) 10jenkins-bot: GLAM event: add wgGECampaigns and wgGECampaignTopics configs for beta eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765349 (https://phabricator.wikimedia.org/T301029) (owner: 10MewOphaswongse)
[18:52:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P21524 and previous config saved to /var/cache/conftool/dbconfig/20220225-185233-ladsgroup.json
[18:52:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:08] <wikibugs>	 10SRE: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10jcrespo) The answer to most of this was, for the most part, established at T298723- verification will be done through DNS in all cases- as it was done from Google (for centralization, uniformization a...
[19:03:23] <wikibugs>	 10SRE: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10jcrespo) Checking the DNS records, there seems to be entries only of the 7 or so top level domains (e.g. wikipedia.org), and maybe that was enough for all subdomains? Do you know if that would work fo...
[19:05:26] <wikibugs>	 10SRE: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10SCherukuwada) You're absolutely right to be concerned about traffic from search engines. That said, I'm familiar enough with how this works to be comfortable owning it, and my PM counterpart and I (an...
[19:07:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21525 and previous config saved to /var/cache/conftool/dbconfig/20220225-190737-ladsgroup.json
[19:07:40] <wikibugs>	 10SRE: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10jcrespo) Thanks to you for working on this!
[19:07:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance
[19:07:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance
[19:07:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance
[19:07:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:47] <stashbot>	 T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992
[19:07:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance
[19:07:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[19:09:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[19:09:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21526 and previous config saved to /var/cache/conftool/dbconfig/20220225-190939-ladsgroup.json
[19:09:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21527 and previous config saved to /var/cache/conftool/dbconfig/20220225-191144-ladsgroup.json
[19:11:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:26:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P21528 and previous config saved to /var/cache/conftool/dbconfig/20220225-192649-ladsgroup.json
[19:26:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:23] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "This looks reasonable for v0, just some nits." [puppet] - 10https://gerrit.wikimedia.org/r/763486 (owner: 10Giuseppe Lavagetto)
[19:28:39] <wikibugs>	 (03CR) 10CDanis: varnish/frontend: consume etcd data for dynamic banning of requests. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763557 (owner: 10Giuseppe Lavagetto)
[19:38:01] <wikibugs>	 (03PS1) 10BBlack: Cache Badtitle 400s for 60s in varnish-fe [puppet] - 10https://gerrit.wikimedia.org/r/766187
[19:40:12] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3065 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[19:41:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P21529 and previous config saved to /var/cache/conftool/dbconfig/20220225-194153-ladsgroup.json
[19:41:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:43:08] <wikibugs>	 (03PS2) 10BBlack: Cache Badtitle 400s for 60s in varnish-fe [puppet] - 10https://gerrit.wikimedia.org/r/766187
[19:47:22] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.08065 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[19:52:47] <jinxer-wm>	 (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[19:56:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21530 and previous config saved to /var/cache/conftool/dbconfig/20220225-195658-ladsgroup.json
[19:57:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[19:57:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[19:57:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:57:05] <stashbot>	 T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992
[19:57:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:57:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:58:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[19:58:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[19:58:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:58:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:59:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[19:59:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[19:59:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[19:59:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:59:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[19:59:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:59:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T300992)', diff saved to https://phabricator.wikimedia.org/P21531 and previous config saved to /var/cache/conftool/dbconfig/20220225-195917-ladsgroup.json
[19:59:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:59:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:59:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:03:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T300992)', diff saved to https://phabricator.wikimedia.org/P21532 and previous config saved to /var/cache/conftool/dbconfig/20220225-200322-ladsgroup.json
[20:03:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:03:28] <stashbot>	 T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992
[20:10:54] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Cache Badtitle 400s for 60s in varnish-fe [puppet] - 10https://gerrit.wikimedia.org/r/766187 (owner: 10BBlack)
[20:14:35] <wikibugs>	 (03PS8) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026)
[20:17:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi)
[20:18:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P21533 and previous config saved to /var/cache/conftool/dbconfig/20220225-201826-ladsgroup.json
[20:18:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:15] <ryankemper>	 Need to roll a quick wdqs deploy to get the latest version of WDQS out. Apologies for the friday noise:
[20:22:26] <ryankemper>	 !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.104`. Pre-deploy tests passing on canary `wdqs1003`
[20:22:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:53] <logmsgbot>	 !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@5d384a5]: 0.3.104
[20:22:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:48] <ryankemper>	 !log [WDQS Deploy] Tests passing following deploy of `0.3.104` on canary `wdqs1003`; proceeding to rest of fleet
[20:23:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:11] <logmsgbot>	 !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@5d384a5]: 0.3.104 (duration: 07m 18s)
[20:30:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:39] <ryankemper>	 !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'`
[20:31:44] <ryankemper>	 !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'`
[20:31:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:50] <ryankemper>	 !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'`
[20:31:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P21534 and previous config saved to /var/cache/conftool/dbconfig/20220225-203331-ladsgroup.json
[20:33:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[20:38:36] <ryankemper>	 ^ Looking. Suspect it's just a transient alert, but checking to make sure
[20:40:22] <wikibugs>	 (03CR) 10Dzahn: "fwiw: Debugged this in cloud VPS and the reason is a missing Apache module providing SetEnvIfNoCase" [deployment-charts] - 10https://gerrit.wikimedia.org/r/765382 (owner: 10JMeybohm)
[20:42:17] <wikibugs>	 (03CR) 10Dzahn: "and the reason we use that is this:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/765382 (owner: 10JMeybohm)
[20:45:24] <ryankemper>	 WDQS updates are processing fine in eqiad, it might be that prometheus is having trouble talking to the blazegraph exporter. Digging a bit more
[20:47:58] <wikibugs>	 (03PS9) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026)
[20:48:19] <wikibugs>	 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10sbassett)
[20:48:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T300992)', diff saved to https://phabricator.wikimedia.org/P21535 and previous config saved to /var/cache/conftool/dbconfig/20220225-204836-ladsgroup.json
[20:48:38] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[20:48:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[20:48:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:44] <stashbot>	 T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992
[20:48:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21536 and previous config saved to /var/cache/conftool/dbconfig/20220225-204844-ladsgroup.json
[20:48:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:51:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi)
[20:51:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21537 and previous config saved to /var/cache/conftool/dbconfig/20220225-205149-ladsgroup.json
[20:51:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:00:01] <wikibugs>	 (03PS10) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026)
[21:01:00] <ryankemper>	 !log [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good. Still looking into `Reduced availability for job jmx_wdqs_updater`; will try restarting blazegraph exporters in eqiad
[21:01:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:02:12] <ryankemper>	 !log [WDQS] Restarted wdqs eqiad exporters: `ryankemper@cumin1001:~$ sudo -E cumin -b 1 'wdqs1*' 'systemctl restart prometheus-blazegraph-exporter-wdqs-blazegraph.service'`
[21:02:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:02:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi)
[21:02:48] <wikibugs>	 (03PS1) 10Dzahn: load apache module setenvif [container/miscweb] - 10https://gerrit.wikimedia.org/r/766190 (https://phabricator.wikimedia.org/T300171)
[21:03:33] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] load apache module setenvif [container/miscweb] - 10https://gerrit.wikimedia.org/r/766190 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn)
[21:05:49] <wikibugs>	 (03Merged) 10jenkins-bot: load apache module setenvif [container/miscweb] - 10https://gerrit.wikimedia.org/r/766190 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn)
[21:06:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P21538 and previous config saved to /var/cache/conftool/dbconfig/20220225-210654-ladsgroup.json
[21:07:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:20:44] <wikibugs>	 (03PS1) 10Dzahn: miscweb: bumb staging to 2022-02-25-210804-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/766192
[21:21:12] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] miscweb: bumb staging to 2022-02-25-210804-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/766192 (owner: 10Dzahn)
[21:22:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P21539 and previous config saved to /var/cache/conftool/dbconfig/20220225-212159-ladsgroup.json
[21:22:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:25:06] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: bumb staging to 2022-02-25-210804-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/766192 (owner: 10Dzahn)
[21:37:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21540 and previous config saved to /var/cache/conftool/dbconfig/20220225-213704-ladsgroup.json
[21:37:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:37:11] <stashbot>	 T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992
[21:49:00] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10good first task: Routinator: use tmpfs - https://phabricator.wikimedia.org/T300955 (10TheresNoTime) Just a note, https://routinator.docs.nlnetlabs.nl/en/latest/installation-notes.html#using-tmpfs-for-the-rpki-cache moved to https://routinator.docs.nlnetlabs.nl/en/lat...
[22:06:46] <wikibugs>	 (03PS2) 10BryanDavis: toolforge: redirect legacy ru_monuments to ru-monuments [puppet] - 10https://gerrit.wikimedia.org/r/762900 (https://phabricator.wikimedia.org/T301720)
[22:10:29] <wikibugs>	 (03CR) 10Cwhite: "Thanks for putting this together!" [puppet] - 10https://gerrit.wikimedia.org/r/765476 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[22:27:32] <wikibugs>	 (03PS1) 10RLazarus: varnish: Add an explicit "apt install docker.io" step to tests/README.md [puppet] - 10https://gerrit.wikimedia.org/r/766194
[22:40:22] <wikibugs>	 (03PS11) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026)
[22:42:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi)
[22:53:21] <wikibugs>	 (03PS1) 10Razzi: Test change [cookbooks] - 10https://gerrit.wikimedia.org/r/766197
[22:53:50] <wikibugs>	 (03CR) 10Razzi: "Just to see if CI is broken on master, to compare with https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/760880" [cookbooks] - 10https://gerrit.wikimedia.org/r/766197 (owner: 10Razzi)
[22:55:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Test change [cookbooks] - 10https://gerrit.wikimedia.org/r/766197 (owner: 10Razzi)
[23:03:36] <wikibugs>	 (03CR) 10Razzi: "It appears CI is failing on master; I made a trivial change for https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/766197 and sure en" [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi)
[23:14:33] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] varnish: Add an explicit "apt install docker.io" step to tests/README.md (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/766194 (owner: 10RLazarus)
[23:17:43] <wikibugs>	 (03PS2) 10RLazarus: varnish: Add an explicit "apt install docker.io" step to tests/README.md [puppet] - 10https://gerrit.wikimedia.org/r/766194
[23:18:36] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] varnish: Add an explicit "apt install docker.io" step to tests/README.md (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/766194 (owner: 10RLazarus)
[23:30:30] <logmsgbot>	 !log dzahn@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply
[23:30:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:32:52] <logmsgbot>	 !log dzahn@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[23:32:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:43:34] <wikibugs>	 10SRE, 10Discovery-Search (Current work): /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10bking) Thanks Moritz!  I started [[ https://github.com/inflatador/ansible-deployment-prep/blob/main/roles/elastic/tasks/main.yml | writing a playbook ]] to make the changes a...
[23:45:28] <wikibugs>	 (03CR) 10Dzahn: "deployed on staging: curl --compressed --resolve "15.wikipedia.org:4111:staging.svc.eqiad.wmnet" 'https://15.wikipedia.org'" [deployment-charts] - 10https://gerrit.wikimedia.org/r/766192 (owner: 10Dzahn)
[23:52:47] <jinxer-wm>	 (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org