[00:38:39] <jinxer-wm>	 (NodeTextfileStale) firing: (12) Stale textfile for puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[01:23:59] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Tgr) Just as a heads-up: we recently increased AQS traffic from MediaWiki PHP code (T324675) which seems to work fine (it's causing some timeouts:...
[02:01:24] <icinga-wm>	 RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:21:45] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:54:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: dse-k8s-worker1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1007.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[04:34:01] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "I checked all of them with orch/dbctl and they all are correct." [dns] - 10https://gerrit.wikimedia.org/r/891552 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert)
[04:38:39] <jinxer-wm>	 (NodeTextfileStale) firing: (12) Stale textfile for puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:29:58] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Joe) >>! In T327920#8647335, @Tgr wrote: > Just as a heads-up: we recently increased AQS traffic from MediaWiki PHP code (T324675) which seems to w...
[06:54:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: dse-k8s-worker1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1007.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[07:18:55] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Tgr) MwHttpRequest (that is, Guzzle/php-curl) and the URLs from https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews. I don't know if RESTBa...
[07:42:25] <marostegui>	 !log Enable replication codfw -> eqiad on pcX T330619
[07:42:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:30] <stashbot>	 T330619: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T330619
[07:45:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add d-i config for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/891832 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[07:45:49] <wikibugs>	 (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/891839 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[07:49:31] <wikibugs>	 (03PS1) 10Elukey: admin_ng: add configuration for the istio knative local gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/892353 (https://phabricator.wikimedia.org/T327767)
[07:51:39] <wikibugs>	 (03PS2) 10Elukey: admin_ng: add configuration for the istio knative local gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/892353 (https://phabricator.wikimedia.org/T327767)
[07:53:54] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] Lower TTL on gitlab records to 300 seconds to facilitate failover [dns] - 10https://gerrit.wikimedia.org/r/891886 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney)
[07:55:15] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm, should be merged during the switchover" [puppet] - 10https://gerrit.wikimedia.org/r/891863 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney)
[07:57:17] <marostegui>	 !log Enable replication codfw -> eqiad on x1 T330619
[07:57:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:21] <stashbot>	 T330619: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T330619
[07:57:21] <wikibugs>	 (03CR) 10Muehlenhoff: "Merging since Simon is off this week." [puppet] - 10https://gerrit.wikimedia.org/r/891798 (https://phabricator.wikimedia.org/T330364) (owner: 10Slyngshede)
[07:57:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Access to analytics-privatedata-users for bscarone [puppet] - 10https://gerrit.wikimedia.org/r/891798 (https://phabricator.wikimedia.org/T330364) (owner: 10Slyngshede)
[08:00:05] <jouncebot>	 Amir1 and Urbanecm: Dear deployers, time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:01:42] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:03:19] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm, should be merged during the switchover. We should carefully rebase this once the TTL change is merged" [dns] - 10https://gerrit.wikimedia.org/r/891888 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney)
[08:05:58] <wikibugs>	 (03PS1) 10Muehlenhoff: Readd email address for reactivated bscarone account [puppet] - 10https://gerrit.wikimedia.org/r/892355 (https://phabricator.wikimedia.org/T330364)
[08:07:35] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/891837 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[08:17:59] <wikibugs>	 (03PS1) 10Nicolas Fraison: presto: add gc tag -XX:+PrintAdaptiveSizePolicy -XX:+PrintTenuringDistribution [puppet] - 10https://gerrit.wikimedia.org/r/892357
[08:21:34] <wikibugs>	 (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39838/console" [puppet] - 10https://gerrit.wikimedia.org/r/892357 (owner: 10Nicolas Fraison)
[08:21:57] <wikibugs>	 (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] presto: add gc tag -XX:+PrintAdaptiveSizePolicy -XX:+PrintTenuringDistribution [puppet] - 10https://gerrit.wikimedia.org/r/892357 (owner: 10Nicolas Fraison)
[08:25:31] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM! To be safe please disable puppet on an-coord100[12] before merging and test the change on an-test-coord1001 first." [puppet] - 10https://gerrit.wikimedia.org/r/891850 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison)
[08:29:13] <wikibugs>	 (03PS3) 10Nicolas Fraison: hive: add gc logs to hiveservers and hive-metastore [puppet] - 10https://gerrit.wikimedia.org/r/891850 (https://phabricator.wikimedia.org/T303168)
[08:30:36] <wikibugs>	 (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39839/console" [puppet] - 10https://gerrit.wikimedia.org/r/891850 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison)
[08:32:25] <marostegui>	 !log Enable replication codfw -> eqiad on es4 and es5 T330619
[08:32:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:30] <stashbot>	 T330619: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T330619
[08:36:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Readd email address for reactivated bscarone account [puppet] - 10https://gerrit.wikimedia.org/r/892355 (https://phabricator.wikimedia.org/T330364) (owner: 10Muehlenhoff)
[08:38:39] <jinxer-wm>	 (NodeTextfileStale) firing: (12) Stale textfile for puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[08:46:19] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @bscarone I have activated your access. You should have also gotte...
[08:51:36] <marostegui>	 !log Enable replication codfw -> eqiad on s2 T330619
[08:51:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:41] <stashbot>	 T330619: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T330619
[08:51:52] <marostegui>	 !log Disable GTID on es% x1 and s% on codfw masters T330619
[08:51:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:15] <wikibugs>	 (03PS2) 10EoghanGaffney: Lower TTL on gitlab records to 300 seconds to facilitate failover [dns] - 10https://gerrit.wikimedia.org/r/891886 (https://phabricator.wikimedia.org/T329931)
[08:53:10] <wikibugs>	 (03CR) 10Jelto: [C: 04-1] "see in-ine" [dns] - 10https://gerrit.wikimedia.org/r/891886 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney)
[08:54:57] <vgutierrez>	 !log test haproxy hardening in cp4045 - T323944
[08:55:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:01] <stashbot>	 T323944: haproxy: work on systemd unit hardening (cp hosts) - https://phabricator.wikimedia.org/T323944
[08:55:01] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: enable haproxy systemd hardening on cp4045 [puppet] - 10https://gerrit.wikimedia.org/r/863332 (https://phabricator.wikimedia.org/T323944) (owner: 10Ssingh)
[08:56:00] <moritzm>	 !log updating mw/codfw to PHP 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u2 T330270
[08:56:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:30] <wikibugs>	 (03PS2) 10EoghanGaffney: Update records for gitlab from gitlab1004 -> gitlab2002 [dns] - 10https://gerrit.wikimedia.org/r/891888 (https://phabricator.wikimedia.org/T329931)
[09:00:04] <jouncebot>	 hashar and jnuche: Time to snap out of that daydream and deploy Jenkins upgrade. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T0900).
[09:00:10] <wikibugs>	 (03PS3) 10EoghanGaffney: Update records for gitlab from gitlab1004 -> gitlab2002 [dns] - 10https://gerrit.wikimedia.org/r/891888 (https://phabricator.wikimedia.org/T329931)
[09:01:34] <wikibugs>	 (03PS3) 10EoghanGaffney: Lower TTL on gitlab records to 300 seconds to facilitate failover [dns] - 10https://gerrit.wikimedia.org/r/891886 (https://phabricator.wikimedia.org/T329931)
[09:01:43] <wikibugs>	 (03PS4) 10Jbond: Add bookworm pxelinux.cfg [puppet] - 10https://gerrit.wikimedia.org/r/891839 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[09:01:45] <wikibugs>	 (03PS1) 10Jbond: spdx: update spdx new files to ignore files regardless of path [puppet] - 10https://gerrit.wikimedia.org/r/892361
[09:02:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] spdx: update spdx new files to ignore files regardless of path [puppet] - 10https://gerrit.wikimedia.org/r/892361 (owner: 10Jbond)
[09:03:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/891839 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[09:03:40] <hashar>	 jelto: can you leave us gitlab up for a few minutes? jnuche and I would like to update the CI Jenkins right now
[09:04:13] <hashar>	 or maybe it is not even needed ;)
[09:04:14] <jelto>	 hashar: GitLab maintenance is planned for 10UTC, so in one hour
[09:04:19] <hashar>	 AH great
[09:04:46] <hashar>	 so we have ample time
[09:04:49] <hashar>	 thank you!
[09:06:24] <wikibugs>	 (03PS4) 10EoghanGaffney: Lower TTL on gitlab records to 300 seconds to facilitate failover [dns] - 10https://gerrit.wikimedia.org/r/891886 (https://phabricator.wikimedia.org/T329931)
[09:06:38] <wikibugs>	 (03CR) 10EoghanGaffney: Lower TTL on gitlab records to 300 seconds to facilitate failover (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/891886 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney)
[09:07:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:10:26] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "We can do pcX later on." [dns] - 10https://gerrit.wikimedia.org/r/891552 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert)
[09:12:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:12:45] <marostegui>	 !log Enable replication codfw -> eqiad on s3 T330619
[09:12:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:50] <stashbot>	 T330619: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T330619
[09:13:04] <jynus>	 latency spike since 8:57
[09:13:19] <jynus>	 for parsoid
[09:13:54] <jynus>	 eqiad only
[09:14:51] <wikibugs>	 (03PS5) 10Jelto: Lower TTL on gitlab records to 300 seconds to facilitate failover [dns] - 10https://gerrit.wikimedia.org/r/891886 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney)
[09:15:17] <logmsgbot>	 !log hashar@deploy1002 Started deploy [releng/jenkins-deploy@0e465ac] (releasing): (no justification provided)
[09:15:30] <hashar>	 we are doing the Jenkins updates
[09:16:04] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [releng/jenkins-deploy@0e465ac] (releasing): (no justification provided) (duration: 00m 46s)
[09:16:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/891848 (https://phabricator.wikimedia.org/T306421) (owner: 10Ayounsi)
[09:16:40] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: more readable varnish/haproxy frontend unavailable [alerts] - 10https://gerrit.wikimedia.org/r/892362 (https://phabricator.wikimedia.org/T330405)
[09:16:58] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Re-image: clear DHCP cache sooner [cookbooks] - 10https://gerrit.wikimedia.org/r/891848 (https://phabricator.wikimedia.org/T306421) (owner: 10Ayounsi)
[09:17:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:19:25] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye
[09:20:08] <hashar>	 !log Restarting CI Jenkins T330045
[09:20:10] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "IPv6 PTR record for gitlab.wikimedia.org was missing, I amended it with 300 seconds." [dns] - 10https://gerrit.wikimedia.org/r/891886 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney)
[09:20:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:13] <stashbot>	 T330045: Upgrade Jenkins to latest LTS 2.375.3 - https://phabricator.wikimedia.org/T330045
[09:20:27] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] Lower TTL on gitlab records to 300 seconds to facilitate failover [dns] - 10https://gerrit.wikimedia.org/r/891886 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney)
[09:21:50] <wikibugs>	 (03PS1) 10Sergio Gimeno: GrowthExperiments: Enable link recommendation for 7th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892363 (https://phabricator.wikimedia.org/T304551)
[09:21:52] <wikibugs>	 (03PS1) 10Sergio Gimeno: GrowthExperiments: Enable link recommendation for 8th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892364 (https://phabricator.wikimedia.org/T308133)
[09:21:54] <wikibugs>	 (03PS1) 10Sergio Gimeno: GrowthExperiments: Enable link recommendation for 9th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892365 (https://phabricator.wikimedia.org/T308134)
[09:21:57] <elukey>	 latency seems trending down for parsoid
[09:22:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:24:18] <wikibugs>	 (03CR) 10Volans: [C: 03+2] icinga: uniform code and add test [software/spicerack] - 10https://gerrit.wikimedia.org/r/891803 (owner: 10Volans)
[09:26:15] <wikibugs>	 (03PS1) 10Volans: OS_VERSIONS: add bookworm to the allowed versions [cookbooks] - 10https://gerrit.wikimedia.org/r/892368 (https://phabricator.wikimedia.org/T330495)
[09:26:30] <marostegui>	 !log Enable replication codfw -> eqiad on s8 T330619
[09:26:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:34] <stashbot>	 T330619: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T330619
[09:27:44] <wikibugs>	 (03CR) 10Jcrespo: "I think this is more readable and less confusing (among a lot more work to be done in this regard), but it should be ultimately the servic" [alerts] - 10https://gerrit.wikimedia.org/r/892362 (https://phabricator.wikimedia.org/T330405) (owner: 10Filippo Giunchedi)
[09:27:52] <marostegui>	 !log Enable replication codfw -> eqiad on s7 T330619
[09:27:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:05] <wikibugs>	 (03Merged) 10jenkins-bot: icinga: uniform code and add test [software/spicerack] - 10https://gerrit.wikimedia.org/r/891803 (owner: 10Volans)
[09:28:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] OS_VERSIONS: add bookworm to the allowed versions [cookbooks] - 10https://gerrit.wikimedia.org/r/892368 (https://phabricator.wikimedia.org/T330495) (owner: 10Volans)
[09:30:15] <wikibugs>	 (03PS1) 10Marostegui: realm.pp: Add private tables [puppet] - 10https://gerrit.wikimedia.org/r/892369 (https://phabricator.wikimedia.org/T330502)
[09:31:08] <wikibugs>	 (03PS4) 10Jelto: Update records for gitlab from gitlab1004 -> gitlab2002 [dns] - 10https://gerrit.wikimedia.org/r/891888 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney)
[09:31:47] <wikibugs>	 (03CR) 10Volans: [C: 03+2] OS_VERSIONS: add bookworm to the allowed versions [cookbooks] - 10https://gerrit.wikimedia.org/r/892368 (https://phabricator.wikimedia.org/T330495) (owner: 10Volans)
[09:32:31] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye
[09:33:14] <wikibugs>	 (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] hive: add gc logs to hiveservers and hive-metastore [puppet] - 10https://gerrit.wikimedia.org/r/891850 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison)
[09:33:18] <wikibugs>	 (03PS6) 10Jbond: redfish: Add simple supermicro class [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363
[09:33:20] <wikibugs>	 (03PS5) 10Jelto: Update records for gitlab from gitlab1004 -> gitlab2002 [dns] - 10https://gerrit.wikimedia.org/r/891888 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney)
[09:33:47] <wikibugs>	 (03Merged) 10jenkins-bot: OS_VERSIONS: add bookworm to the allowed versions [cookbooks] - 10https://gerrit.wikimedia.org/r/892368 (https://phabricator.wikimedia.org/T330495) (owner: 10Volans)
[09:34:27] <marostegui>	 !log Enable replication codfw -> eqiad on s6 T330619
[09:34:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:32] <stashbot>	 T330619: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T330619
[09:34:39] <hashar>	 jelto: we have completed the Jenkins upgrades ;]
[09:35:00] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/891888 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney)
[09:35:16] <jelto>	 hashar: thanks for letting us know!
[09:35:39] <wikibugs>	 (03CR) 10Jbond: "thanks updated" [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 (owner: 10Jbond)
[09:36:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] redfish: Add simple supermicro class [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 (owner: 10Jbond)
[09:36:51] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1003.eqiad.wmnet with OS bullseye
[09:36:59] <logmsgbot>	 !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1003.eqiad.wmnet with OS bullseye
[09:37:53] <wikibugs>	 (03PS1) 10Elukey: Revert "Re-image: clear DHCP cache sooner" [cookbooks] - 10https://gerrit.wikimedia.org/r/891981
[09:38:17] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "The facter command fails to run, reverting.." [cookbooks] - 10https://gerrit.wikimedia.org/r/891848 (https://phabricator.wikimedia.org/T306421) (owner: 10Ayounsi)
[09:39:29] <marostegui>	 !log Enable replication codfw -> eqiad on s5 T330619
[09:39:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:34] <stashbot>	 T330619: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T330619
[09:39:44] <wikibugs>	 (03PS3) 10Majavah: P:toolforge::k8s::haproxy: remove support for hash node list [puppet] - 10https://gerrit.wikimedia.org/r/891826
[09:39:46] <wikibugs>	 (03PS5) 10Majavah: P:toolforge::k8s::haproxy: add api gateway load balancer [puppet] - 10https://gerrit.wikimedia.org/r/891827 (https://phabricator.wikimedia.org/T329443)
[09:39:48] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: use api gateway for jobs cli [puppet] - 10https://gerrit.wikimedia.org/r/892370 (https://phabricator.wikimedia.org/T329443)
[09:39:59] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] admin_ng: add configuration for the istio knative local gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/892353 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[09:40:36] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Revert "Re-image: clear DHCP cache sooner" [cookbooks] - 10https://gerrit.wikimedia.org/r/891981 (owner: 10Elukey)
[09:43:50] <wikibugs>	 (03PS7) 10Jbond: redfish: Add simple supermicro class [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363
[09:44:54] <marostegui>	 !log Enable replication codfw -> eqiad on s1 T330619
[09:44:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:59] <stashbot>	 T330619: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T330619
[09:46:45] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1003.eqiad.wmnet with OS bullseye
[09:46:52] <logmsgbot>	 !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1003.eqiad.wmnet with OS bullseye
[09:47:30] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1003.eqiad.wmnet with OS bullseye
[09:49:49] <wikibugs>	 (03CR) 10Marostegui: "This can be merged anytime btw" [puppet] - 10https://gerrit.wikimedia.org/r/892369 (https://phabricator.wikimedia.org/T330502) (owner: 10Marostegui)
[09:52:16] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[09:53:22] <marostegui>	 elukey: could that be you?^
[10:00:45] <wikibugs>	 10SRE-OnFire, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q3), 10User-fgiunchedi: Improve AlertManager alert titles as sent to VictorOps - https://phabricator.wikimedia.org/T317240 (10fgiunchedi) I think the titles are indeed far easier to read and already led to other improvements (T330405)...
[10:02:23] <wikibugs>	 (03CR) 10Jaime Nuche: scap: add required Python3 venv package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891837 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[10:04:40] <wikibugs>	 (03CR) 10Clément Goubert: [C: 04-2] "Preparation for 2023-02-28 datacenter service switchover." [dns] - 10https://gerrit.wikimedia.org/r/892372 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert)
[10:05:29] <wikibugs>	 (03CR) 10Clément Goubert: [C: 04-2] "Preparation for 2023-02-28 datacenter service switchover." [puppet] - 10https://gerrit.wikimedia.org/r/892373 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert)
[10:05:33] <elukey>	 marostegui: o/ in theory no, I didn't merge puppet changes today
[10:05:38] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 (owner: 10Jbond)
[10:06:28] <marostegui>	 elukey: ah ok, I saw the +2 from you 
[10:06:48] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] re-introduce webserver-misc-static.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/891384 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn)
[10:07:06] <marostegui>	 it looks like it is from nfraison 
[10:07:10] <elukey>	 nfraison, jbond --^
[10:07:15] <elukey>	 there are two commits from you
[10:07:46] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] sre.switchdc.mediawiki: use python warmup script [cookbooks] - 10https://gerrit.wikimedia.org/r/891520 (https://phabricator.wikimedia.org/T288867) (owner: 10Clément Goubert)
[10:07:47] <nfraison>	 elukey: yes I've requested jbond if I can merge in #sre
[10:08:32] <marostegui>	 !log Enable replication codfw -> eqiad on s4 T330619
[10:08:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:37] <stashbot>	 T330619: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T330619
[10:09:15] <elukey>	 nfraison: ahh okok
[10:09:45] <wikibugs>	 (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: use python warmup script [cookbooks] - 10https://gerrit.wikimedia.org/r/891520 (https://phabricator.wikimedia.org/T288867) (owner: 10Clément Goubert)
[10:10:19] <wikibugs>	 (03PS8) 10Jbond: redfish: Add simple supermicro class [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363
[10:11:52] <wikibugs>	 (03PS1) 10Jbond: ceph: remove cloud data [labs/private] - 10https://gerrit.wikimedia.org/r/892376
[10:13:38] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.reimage: add full path for facter and run clear dchp earlier [cookbooks] - 10https://gerrit.wikimedia.org/r/892377 (https://phabricator.wikimedia.org/T306421)
[10:13:51] <wikibugs>	 (03PS9) 10Jbond: redfish: Add simple supermicro class [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363
[10:17:50] <logmsgbot>	 !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab1004.wikimedia.org with reason: Running failover to gitlab2002- T329931
[10:17:54] <stashbot>	 T329931: Switchover gitlab (gitlab1004 -> gitlab2002) - https://phabricator.wikimedia.org/T329931
[10:18:03] <logmsgbot>	 !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab1004.wikimedia.org with reason: Running failover to gitlab2002- T329931
[10:18:16] <wikibugs>	 (03CR) 10Volans: sre.hosts.reimage: add full path for facter and run clear dchp earlier (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/892377 (https://phabricator.wikimedia.org/T306421) (owner: 10Elukey)
[10:19:07] <claime>	 !log live testing cache warmup cookbook
[10:19:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:24] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-optional-warmup-caches
[10:20:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] redfish: Add simple supermicro class [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 (owner: 10Jbond)
[10:20:34] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] "Many thanks jbond." [labs/private] - 10https://gerrit.wikimedia.org/r/892376 (owner: 10Jbond)
[10:20:51] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "do you want me to merge it (and restart sanitarium hosts and their masters?)" [puppet] - 10https://gerrit.wikimedia.org/r/892369 (https://phabricator.wikimedia.org/T330502) (owner: 10Marostegui)
[10:21:11] <wikibugs>	 (03CR) 10Marostegui: realm.pp: Add private tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/892369 (https://phabricator.wikimedia.org/T330502) (owner: 10Marostegui)
[10:21:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] realm.pp: Add private tables [puppet] - 10https://gerrit.wikimedia.org/r/892369 (https://phabricator.wikimedia.org/T330502) (owner: 10Marostegui)
[10:22:00] <wikibugs>	 (03PS2) 10Elukey: sre.hosts.reimage: add full path for facter and run clear dchp earlier [cookbooks] - 10https://gerrit.wikimedia.org/r/892377 (https://phabricator.wikimedia.org/T306421)
[10:22:09] <wikibugs>	 (03CR) 10Elukey: sre.hosts.reimage: add full path for facter and run clear dchp earlier (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/892377 (https://phabricator.wikimedia.org/T306421) (owner: 10Elukey)
[10:22:30] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:22:33] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-optional-warmup-caches (exit_code=0)
[10:23:24] <icinga-wm>	 PROBLEM - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 22: Connection refused https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[10:23:35] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM to try again" [cookbooks] - 10https://gerrit.wikimedia.org/r/892377 (https://phabricator.wikimedia.org/T306421) (owner: 10Elukey)
[10:23:42] <icinga-wm>	 PROBLEM - Gitlab HTTPS SSL Expiry on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 443: Connection refused https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[10:23:43] <logmsgbot>	 !log dcaro@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1003.eqiad.wmnet with OS bullseye
[10:23:47] <wikibugs>	 (03Merged) 10jenkins-bot: redfish: Add simple supermicro class [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 (owner: 10Jbond)
[10:23:54] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1003']
[10:24:06] <icinga-wm>	 PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 443: Connection refused https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[10:24:17] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 (owner: 10Jbond)
[10:24:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add bookworm pxelinux.cfg [puppet] - 10https://gerrit.wikimedia.org/r/891839 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[10:25:00] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[10:25:11] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hosts.reimage: add full path for facter and run clear dchp earlier [cookbooks] - 10https://gerrit.wikimedia.org/r/892377 (https://phabricator.wikimedia.org/T306421) (owner: 10Elukey)
[10:25:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:26:12] <marostegui>	 !log Restart codfw sanitarium hosts T330502
[10:26:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:16] <stashbot>	 T330502: Create oathauth_types and oathauth_devices tables - https://phabricator.wikimedia.org/T330502
[10:26:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Blacklist f2fs [puppet] - 10https://gerrit.wikimedia.org/r/891817 (owner: 10Muehlenhoff)
[10:26:35] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye
[10:26:57] <wikibugs>	 (03CR) 10Jbond: "i see this is merged so i wouldn't worry about comments below unless you end up touching things again 😊" [cookbooks] - 10https://gerrit.wikimedia.org/r/892377 (https://phabricator.wikimedia.org/T306421) (owner: 10Elukey)
[10:29:41] <wikibugs>	 (03CR) 10Volans: [C: 03+1] sre.hosts.reimage: add full path for facter and run clear dchp earlier (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/892377 (https://phabricator.wikimedia.org/T306421) (owner: 10Elukey)
[10:30:01] <Amir1>	 jouncebot: nowandnext
[10:30:01] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 29 minute(s)
[10:30:01] <jouncebot>	 In 0 hour(s) and 29 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1100)
[10:30:33] <wikibugs>	 (03PS2) 10Jbond: ssh config: Add ControlPath and ControlPersist parameters [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891568
[10:30:40] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] ssh config: Add ControlPath and ControlPersist parameters [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891568 (owner: 10Jbond)
[10:31:50] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1003']
[10:31:59] <wikibugs>	 (03PS1) 10Ladsgroup: Completely get rid of responsiveimages removal [extensions/MobileFrontend] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891982 (https://phabricator.wikimedia.org/T326147)
[10:32:24] <marostegui>	 !log Restart eqiad sanitarium hosts T330502
[10:32:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:28] <stashbot>	 T330502: Create oathauth_types and oathauth_devices tables - https://phabricator.wikimedia.org/T330502
[10:35:03] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Completely get rid of responsiveimages removal [extensions/MobileFrontend] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891982 (https://phabricator.wikimedia.org/T326147) (owner: 10Ladsgroup)
[10:36:15] <wikibugs>	 (03CR) 10Jbond: sre.hosts.reimage: add full path for facter and run clear dchp earlier (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/892377 (https://phabricator.wikimedia.org/T306421) (owner: 10Elukey)
[10:38:21] <wikibugs>	 (03PS1) 10Jbond: sre.hosts.reimage: Drop -p and add no-external-facts [cookbooks] - 10https://gerrit.wikimedia.org/r/892378
[10:39:04] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1003']
[10:39:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] sre.hosts.reimage: Drop -p and add no-external-facts [cookbooks] - 10https://gerrit.wikimedia.org/r/892378 (owner: 10Jbond)
[10:39:15] <wikibugs>	 (03PS2) 10Jbond: sre.hosts.reimage: Drop -p and add no-external-facts [cookbooks] - 10https://gerrit.wikimedia.org/r/892378
[10:41:22] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/892378 (owner: 10Jbond)
[10:42:17] <wikibugs>	 (03CR) 10Jbond: "also see comment at:" [cookbooks] - 10https://gerrit.wikimedia.org/r/892378 (owner: 10Jbond)
[10:42:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hosts.reimage: Drop -p and add no-external-facts [cookbooks] - 10https://gerrit.wikimedia.org/r/892378 (owner: 10Jbond)
[10:42:45] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: add configuration for the istio knative local gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/892353 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[10:43:56] <logmsgbot>	 !log root@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1003']
[10:44:16] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.reimage: Drop -p and add no-external-facts [cookbooks] - 10https://gerrit.wikimedia.org/r/892378 (owner: 10Jbond)
[10:46:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Completely get rid of responsiveimages removal [extensions/MobileFrontend] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891982 (https://phabricator.wikimedia.org/T326147) (owner: 10Ladsgroup)
[10:48:28] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[10:48:31] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:49:39] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Marostegui)
[10:52:53] <wikibugs>	 (03PS1) 10Btullis: Move ceph profile authentication token to the role [labs/private] - 10https://gerrit.wikimedia.org/r/892380 (https://phabricator.wikimedia.org/T324660)
[10:53:50] <wikibugs>	 (03PS2) 10David Caro: cloud: add tests for >buster os [puppet] - 10https://gerrit.wikimedia.org/r/891593
[10:53:52] <wikibugs>	 (03CR) 10David Caro: cloud: add tests for >buster os (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/891593 (owner: 10David Caro)
[10:54:09] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1003']
[10:54:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] clamd.conf: Remove some config entries [puppet] - 10https://gerrit.wikimedia.org/r/890799 (https://phabricator.wikimedia.org/T330129) (owner: 10Muehlenhoff)
[10:54:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: dse-k8s-worker1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1007.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:55:05] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] Move ceph profile authentication token to the role [labs/private] - 10https://gerrit.wikimedia.org/r/892380 (https://phabricator.wikimedia.org/T324660) (owner: 10Btullis)
[10:55:14] <wikibugs>	 (03PS2) 10Ladsgroup: Completely get rid of responsiveimages removal [extensions/MobileFrontend] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891982 (https://phabricator.wikimedia.org/T326147)
[10:56:18] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] cloud: add tests for >buster os [puppet] - 10https://gerrit.wikimedia.org/r/891593 (owner: 10David Caro)
[10:56:45] <wikibugs>	 (03PS3) 10ArielGlenn: Add dumpsdata1004 and dumpsdata1005 as spare dumps hosts and rsync pullers [puppet] - 10https://gerrit.wikimedia.org/r/892033 (https://phabricator.wikimedia.org/T330573)
[10:59:28] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] Add dumpsdata1004 and dumpsdata1005 as spare dumps hosts and rsync pullers [puppet] - 10https://gerrit.wikimedia.org/r/892033 (https://phabricator.wikimedia.org/T330573) (owner: 10ArielGlenn)
[10:59:54] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudcephosd1003']
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1100)
[11:04:25] <wikibugs>	 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Marostegui)
[11:04:26] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1003']
[11:04:31] <wikibugs>	 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Marostegui) Circular replication is now enabled (T330619) everywhere where it is supposed to be. It is one of our pr...
[11:05:51] <wikibugs>	 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Clement_Goubert) >>! In T330302#8648213, @Marostegui wrote: > Circular replication is now enabled (T330619) everywhe...
[11:05:59] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye
[11:07:17] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye
[11:08:25] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1003.eqiad.wmnet with OS bullseye
[11:08:37] <wikibugs>	 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Marostegui) It is probably something we still need to test before the switch anyways, as it is key, especially for t...
[11:10:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Completely get rid of responsiveimages removal [extensions/MobileFrontend] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891982 (https://phabricator.wikimedia.org/T326147) (owner: 10Ladsgroup)
[11:10:52] <apergos>	 !log  rsync private xmldatadumps dir from dumpsdata1003 to dumpsdata1004; running from ariel screen session on dumpsdata1003, no bandwidth cap 
[11:10:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:14:16] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10jbond) >>! In T326848#8645012, @Papaul wrote: > @jbond  > ` > poweredge-r450: picking DellDriverCategory.BIOS update file > We have found multiple ent...
[11:15:09] <wikibugs>	 (03PS1) 10Btullis: Add keydata for ceph mgr daemons [labs/private] - 10https://gerrit.wikimedia.org/r/892388 (https://phabricator.wikimedia.org/T324660)
[11:15:45] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] Add keydata for ceph mgr daemons [labs/private] - 10https://gerrit.wikimedia.org/r/892388 (https://phabricator.wikimedia.org/T324660) (owner: 10Btullis)
[11:16:04] <wikibugs>	 (03CR) 10Ladsgroup: "This change is ready for review." [extensions/MobileFrontend] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891984 (owner: 10Ladsgroup)
[11:16:46] <wikibugs>	 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Clement_Goubert) We can probably just run `sre.switchdc.mediawiki.03-set-db-readonly` and `sre.switchdc.mediawiki.06...
[11:17:26] <wikibugs>	 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Marostegui) That works for me :) We might need to make a not that having circular replication is a hard dependency
[11:20:20] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye
[11:21:28] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Switch active gitlab host from gitlab1004 (eqiad) -> gitlab2002 (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/891863 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney)
[11:22:08] <icinga-wm>	 PROBLEM - Host ms-fe2013 is DOWN: PING CRITICAL - Packet loss = 100%
[11:22:22] <vgutierrez>	 hmmm Emperor ^^ :?
[11:23:18] <Emperor>	 vgutierrez: it's being worked on and isn't in service
[11:23:28] <wikibugs>	 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Clement_Goubert) Added https://wikitech.wikimedia.org/wiki/Switch_Datacenter#03-set-db-readonly as well as a note in...
[11:23:40] <vgutierrez>	 ack, I've missed the SAL entry, sorry
[11:23:58] <Emperor>	 j.bond is working on it, I've asked him to downtime it in the mean time :)
[11:24:06] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] .nvmrc: Update to 16.19.1 after CI update [extensions/MobileFrontend] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891984 (owner: 10Ladsgroup)
[11:27:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Sync more clamd.conf settings from 0.103.8 [puppet] - 10https://gerrit.wikimedia.org/r/892389 (https://phabricator.wikimedia.org/T330129)
[11:28:10] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on ms-fe2013.codfw.wmnet with reason: testing redfish T326848
[11:28:14] <stashbot>	 T326848: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848
[11:28:25] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-fe2013.codfw.wmnet with reason: testing redfish T326848
[11:29:30] <apergos>	 !log  rsync public (huge!)  xmldatadumps dir from dumpsdata1003 to dumpsdata1004; running from ariel screen session on dumpsdata1003, no bandwidth cap 
[11:29:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44759 and previous config saved to /var/cache/conftool/dbconfig/20230227-112937-root.json
[11:31:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1127', diff saved to https://phabricator.wikimedia.org/P44760 and previous config saved to /var/cache/conftool/dbconfig/20230227-113130-root.json
[11:34:26] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Update records for gitlab from gitlab1004 -> gitlab2002 [dns] - 10https://gerrit.wikimedia.org/r/891888 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney)
[11:35:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:37:26] <icinga-wm>	 RECOVERY - Gitlab HTTPS SSL Expiry on gitlab.wikimedia.org is OK: OK - Certificate gitlab.wikimedia.org will expire on Mon 01 May 2023 06:51:05 PM GMT +0000. https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[11:38:42] <wikibugs>	 (03Merged) 10jenkins-bot: .nvmrc: Update to 16.19.1 after CI update [extensions/MobileFrontend] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891984 (owner: 10Ladsgroup)
[11:39:58] <wikibugs>	 (03PS3) 10Ladsgroup: Completely get rid of responsiveimages removal [extensions/MobileFrontend] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891982 (https://phabricator.wikimedia.org/T326147)
[11:40:01] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Completely get rid of responsiveimages removal [extensions/MobileFrontend] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891982 (https://phabricator.wikimedia.org/T326147) (owner: 10Ladsgroup)
[11:42:25] <logmsgbot>	 !log hnowlan@deploy1002 Started deploy [restbase/deploy@bcb0a69]: Add azwikimedia T317120
[11:42:30] <stashbot>	 T317120: Add azwikimedia to RESTBase - https://phabricator.wikimedia.org/T317120
[11:42:54] <herzog>	 jouncebot: next
[11:42:54] <jouncebot>	 In 2 hour(s) and 17 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1400)
[11:43:50] <logmsgbot>	 !log hnowlan@deploy1002 Finished deploy [restbase/deploy@bcb0a69]: Add azwikimedia T317120 (duration: 01m 25s)
[11:44:03] <herzog>	 Superpes: hi, are you planning on scheduling T330470?
[11:44:04] <stashbot>	 T330470: New protection level and right to edit through it for eswiki - https://phabricator.wikimedia.org/T330470
[11:44:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 3%: Repooling', diff saved to https://phabricator.wikimedia.org/P44761 and previous config saved to /var/cache/conftool/dbconfig/20230227-114442-root.json
[11:45:07] <Superpes>	 herzog Yep In the afternoon or evening (or maybe tomorrow - based on commitments in RL) :P
[11:45:29] <herzog>	 Superpes: copy, I was asked :)
[11:45:47] <wikibugs>	 (03PS1) 10Jbond: sre.hardware.upgrade-firmware: incorrect error status [cookbooks] - 10https://gerrit.wikimedia.org/r/892393
[11:45:50] <herzog>	 I will tell em to take a cup of tea
[11:46:04] <Superpes>	 Well there was the weekend in between otherwise I would have already scheduled it lol :P
[11:46:34] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] varnish: Set `X-Content-Type-Options: nosniff` on upload requests [puppet] - 10https://gerrit.wikimedia.org/r/890512 (https://phabricator.wikimedia.org/T309787) (owner: 10Legoktm)
[11:47:10] <icinga-wm>	 RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 141880 bytes in 1.778 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[11:48:56] <icinga-wm>	 RECOVERY - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[11:49:36] <vgutierrez>	 !log set "X-Content-Type-Options: nosniff" on upload.wm.o requests - T309787
[11:49:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:49:41] <stashbot>	 T309787: Remove IEContentAnalyzer - https://phabricator.wikimedia.org/T309787
[11:50:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:51:04] <icinga-wm>	 PROBLEM - Check systemd state on dumpsdata1005 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:51:28] <icinga-wm>	 RECOVERY - Host ms-fe2013 is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms
[11:51:59] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye
[11:53:36] <logmsgbot>	 !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1003.eqiad.wmnet with OS bullseye
[11:55:13] <wikibugs>	 (03PS2) 10Jbond: sre.hardware.upgrade-firmware: incorrect error status [cookbooks] - 10https://gerrit.wikimedia.org/r/892393 (https://phabricator.wikimedia.org/T326848)
[11:55:52] <wikibugs>	 (03Merged) 10jenkins-bot: Completely get rid of responsiveimages removal [extensions/MobileFrontend] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891982 (https://phabricator.wikimedia.org/T326147) (owner: 10Ladsgroup)
[11:56:46] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10jbond) @MatthewVernon the firmware, bios and network have all been upgraded so should be good to procead
[11:58:43] <wikibugs>	 10SRE, 10MediaWiki-File-management, 10Traffic, 10MW-1.40-notes (1.40.0-wmf.25; 2023-02-27), and 2 others: Remove IEContentAnalyzer - https://phabricator.wikimedia.org/T309787 (10Vgutierrez) ` vgutierrez@cp6001:~$ curl -H 'Host: upload.wikimedia.org' -k https://127.0.0.1/favicon.ico -s -v -o /dev/null 2>&1...
[11:59:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44762 and previous config saved to /var/cache/conftool/dbconfig/20230227-115947-root.json
[12:00:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1127', diff saved to https://phabricator.wikimedia.org/P44763 and previous config saved to /var/cache/conftool/dbconfig/20230227-120002-root.json
[12:00:28] <icinga-wm>	 RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:04:13] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Datacenter-Switchover: March 2023 Traffic Switchover checklist - https://phabricator.wikimedia.org/T330650 (10Clement_Goubert)
[12:04:55] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye
[12:05:45] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:05:46] <wikibugs>	 (03PS2) 10Clément Goubert: traffic: Depool eqiad from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/891554 (https://phabricator.wikimedia.org/T330650)
[12:06:04] <icinga-wm>	 PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:08:24] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: March 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert)
[12:08:41] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: March 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert)
[12:08:47] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert)
[12:09:01] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: March 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) p:05Triage→03High
[12:09:52] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: March 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert)
[12:10:02] <wikibugs>	 (03PS2) 10Clément Goubert: wmnet: Switch deployment CNAMEs to codfw [dns] - 10https://gerrit.wikimedia.org/r/892372 (https://phabricator.wikimedia.org/T330651)
[12:10:09] <wikibugs>	 (03PS3) 10Clément Goubert: Switch deployment server to deploy2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/892373 (https://phabricator.wikimedia.org/T330651)
[12:10:36] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-airflow1005.eqiad.wmnet with reason: host still been configuered - T327970
[12:10:41] <stashbot>	 T327970: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970
[12:10:51] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-airflow1005.eqiad.wmnet with reason: host still been configuered - T327970
[12:10:59] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Traffic Switchover checklist - https://phabricator.wikimedia.org/T330650 (10Clement_Goubert)
[12:11:17] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert)
[12:12:20] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Traffic Switchover checklist - https://phabricator.wikimedia.org/T330650 (10Clement_Goubert) p:05Triage→03High
[12:12:20] <moritzm>	 !log installing apr-util security updates on buster
[12:12:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1182', diff saved to https://phabricator.wikimedia.org/P44764 and previous config saved to /var/cache/conftool/dbconfig/20230227-121846-root.json
[12:21:31] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye
[12:21:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44765 and previous config saved to /var/cache/conftool/dbconfig/20230227-122131-root.json
[12:22:56] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert)
[12:23:16] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service,monitor_refine_eventlogging_legacy.service John Bond T330652 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:23:19] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert)
[12:25:27] <wikibugs>	 (03CR) 10Clément Goubert: [C: 04-2] "Preparation for 2023-03-01 mediawiki switchover" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892428 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert)
[12:26:04] <icinga-wm>	 RECOVERY - Check systemd state on aphlict2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:28:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1200 db1111 db1168 db1143 T330653', diff saved to https://phabricator.wikimedia.org/P44766 and previous config saved to /var/cache/conftool/dbconfig/20230227-122804-root.json
[12:28:09] <stashbot>	 T330653: Upgrade 10.6.10 hosts to 10.6.12 - https://phabricator.wikimedia.org/T330653
[12:31:36] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1003.eqiad.wmnet with OS bullseye
[12:34:31] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye
[12:34:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44767 and previous config saved to /var/cache/conftool/dbconfig/20230227-123447-root.json
[12:34:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44768 and previous config saved to /var/cache/conftool/dbconfig/20230227-123454-root.json
[12:34:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44769 and previous config saved to /var/cache/conftool/dbconfig/20230227-123459-root.json
[12:35:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1137 T330653', diff saved to https://phabricator.wikimedia.org/P44770 and previous config saved to /var/cache/conftool/dbconfig/20230227-123514-root.json
[12:35:18] <stashbot>	 T330653: Upgrade 10.6.10 hosts to 10.6.12 - https://phabricator.wikimedia.org/T330653
[12:36:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 3%: Repooling', diff saved to https://phabricator.wikimedia.org/P44771 and previous config saved to /var/cache/conftool/dbconfig/20230227-123636-root.json
[12:37:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44772 and previous config saved to /var/cache/conftool/dbconfig/20230227-123701-root.json
[12:37:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44773 and previous config saved to /var/cache/conftool/dbconfig/20230227-123742-root.json
[12:38:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1022 es2022 T330653', diff saved to https://phabricator.wikimedia.org/P44774 and previous config saved to /var/cache/conftool/dbconfig/20230227-123814-root.json
[12:39:21] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on centrallog1002 is CRITICAL: CRITICAL - degraded: The following units failed: kafkatee.service John Bond T330654 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:40:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44775 and previous config saved to /var/cache/conftool/dbconfig/20230227-124050-root.json
[12:41:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44776 and previous config saved to /var/cache/conftool/dbconfig/20230227-124100-root.json
[12:41:32] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:43:33] <logmsgbot>	 !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1003.eqiad.wmnet with OS bullseye
[12:45:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for apr-util [puppet] - 10https://gerrit.wikimedia.org/r/892436
[12:48:56] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:49:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44777 and previous config saved to /var/cache/conftool/dbconfig/20230227-124952-root.json
[12:49:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44778 and previous config saved to /var/cache/conftool/dbconfig/20230227-124959-root.json
[12:50:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44779 and previous config saved to /var/cache/conftool/dbconfig/20230227-125003-root.json
[12:51:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for apr-util [puppet] - 10https://gerrit.wikimedia.org/r/892436 (owner: 10Muehlenhoff)
[12:51:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44780 and previous config saved to /var/cache/conftool/dbconfig/20230227-125141-root.json
[12:52:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44781 and previous config saved to /var/cache/conftool/dbconfig/20230227-125206-root.json
[12:52:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44782 and previous config saved to /var/cache/conftool/dbconfig/20230227-125247-root.json
[12:54:32] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:55:35] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Completely get rid of responsiveimages removal, part I (T308932) synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[12:55:39] <stashbot>	 T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022)    - https://phabricator.wikimedia.org/T308932
[12:55:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44783 and previous config saved to /var/cache/conftool/dbconfig/20230227-125555-root.json
[12:56:00] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Joe) We should probably test that both scap works and a scap3 deployment also works (e.g. `docker-pkg`) when we've migrated the deployment server....
[12:56:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44784 and previous config saved to /var/cache/conftool/dbconfig/20230227-125605-root.json
[12:56:55] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/891827 (https://phabricator.wikimedia.org/T329443) (owner: 10Majavah)
[12:59:28] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/891826 (owner: 10Majavah)
[13:02:00] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:04:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44785 and previous config saved to /var/cache/conftool/dbconfig/20230227-130457-root.json
[13:05:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44786 and previous config saved to /var/cache/conftool/dbconfig/20230227-130503-root.json
[13:05:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44787 and previous config saved to /var/cache/conftool/dbconfig/20230227-130508-root.json
[13:05:40] <moritzm>	 !log installing openssl security updates on Buster
[13:05:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44788 and previous config saved to /var/cache/conftool/dbconfig/20230227-130646-root.json
[13:07:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44789 and previous config saved to /var/cache/conftool/dbconfig/20230227-130711-root.json
[13:07:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44790 and previous config saved to /var/cache/conftool/dbconfig/20230227-130752-root.json
[13:08:38] <wikibugs>	 (03PS1) 10ArielGlenn: for dumpsdata1004,5 use the partman recipe for reimaging [puppet] - 10https://gerrit.wikimedia.org/r/892437 (https://phabricator.wikimedia.org/T330573)
[13:11:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44791 and previous config saved to /var/cache/conftool/dbconfig/20230227-131100-root.json
[13:14:58] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:19:17] <wikibugs>	 (03PS1) 10Jbond: sre.hosts.reimage: uses networking.mac [cookbooks] - 10https://gerrit.wikimedia.org/r/892438
[13:19:34] <wikibugs>	 (03PS2) 10Jbond: sre.hosts.reimage: uses networking.mac [cookbooks] - 10https://gerrit.wikimedia.org/r/892438
[13:20:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44793 and previous config saved to /var/cache/conftool/dbconfig/20230227-132002-root.json
[13:20:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44794 and previous config saved to /var/cache/conftool/dbconfig/20230227-132008-root.json
[13:20:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44795 and previous config saved to /var/cache/conftool/dbconfig/20230227-132013-root.json
[13:21:40] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (10JMeybohm)
[13:21:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44796 and previous config saved to /var/cache/conftool/dbconfig/20230227-132151-root.json
[13:22:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44797 and previous config saved to /var/cache/conftool/dbconfig/20230227-132215-root.json
[13:22:30] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM.  We always want to issue this for the primary interface (one that has done DHCP), so if facter will take one with GW it should be sa" [cookbooks] - 10https://gerrit.wikimedia.org/r/892438 (owner: 10Jbond)
[13:22:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44798 and previous config saved to /var/cache/conftool/dbconfig/20230227-132257-root.json
[13:25:39] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] scap: add required Python3 venv package [puppet] - 10https://gerrit.wikimedia.org/r/891837 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[13:26:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44799 and previous config saved to /var/cache/conftool/dbconfig/20230227-132605-root.json
[13:26:09] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for Schwirz - https://phabricator.wikimedia.org/T330095 (10JMeybohm) Adding @KFrancis for signing NDA
[13:26:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44800 and previous config saved to /var/cache/conftool/dbconfig/20230227-132615-root.json
[13:30:07] <logmsgbot>	 !log ladsgroup@deploy1002 sync-file aborted: Completely get rid of responsiveimages removal, part I (T308932) (duration: 44m 38s)
[13:30:11] <stashbot>	 T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022)    - https://phabricator.wikimedia.org/T308932
[13:32:06] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Completely get rid of responsiveimages removal, part I (T326147) synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[13:32:11] <stashbot>	 T326147: Stop fragmenting ParserCache entries for mobile frontend  - https://phabricator.wikimedia.org/T326147
[13:32:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2175 T330653', diff saved to https://phabricator.wikimedia.org/P44805 and previous config saved to /var/cache/conftool/dbconfig/20230227-133231-root.json
[13:32:36] <stashbot>	 T330653: Upgrade 10.6.10 hosts to 10.6.12 - https://phabricator.wikimedia.org/T330653
[13:32:57] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39845/console" [puppet] - 10https://gerrit.wikimedia.org/r/891837 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[13:35:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44808 and previous config saved to /var/cache/conftool/dbconfig/20230227-133506-root.json
[13:35:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44809 and previous config saved to /var/cache/conftool/dbconfig/20230227-133506-root.json
[13:35:13] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39846/console" [puppet] - 10https://gerrit.wikimedia.org/r/891837 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[13:35:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44810 and previous config saved to /var/cache/conftool/dbconfig/20230227-133513-root.json
[13:35:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44811 and previous config saved to /var/cache/conftool/dbconfig/20230227-133518-root.json
[13:36:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44813 and previous config saved to /var/cache/conftool/dbconfig/20230227-133657-root.json
[13:37:17] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/892393 (https://phabricator.wikimedia.org/T326848) (owner: 10Jbond)
[13:37:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44814 and previous config saved to /var/cache/conftool/dbconfig/20230227-133720-root.json
[13:38:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44815 and previous config saved to /var/cache/conftool/dbconfig/20230227-133801-root.json
[13:39:51] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Volans) [nit] the `enable-puppet` + `run-puppe-agent` can be simplified with `run-puppet-agent --enable "reason"`.
[13:40:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2122 T330653', diff saved to https://phabricator.wikimedia.org/P44817 and previous config saved to /var/cache/conftool/dbconfig/20230227-134018-root.json
[13:40:23] <stashbot>	 T330653: Upgrade 10.6.10 hosts to 10.6.12 - https://phabricator.wikimedia.org/T330653
[13:41:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44818 and previous config saved to /var/cache/conftool/dbconfig/20230227-134110-root.json
[13:41:20] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.40.0-wmf.24/extensions/MobileFrontend/extension.json: Completely get rid of responsiveimages removal, part I (T326147) (duration: 10m 48s)
[13:41:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44819 and previous config saved to /var/cache/conftool/dbconfig/20230227-134120-root.json
[13:41:24] <stashbot>	 T326147: Stop fragmenting ParserCache entries for mobile frontend  - https://phabricator.wikimedia.org/T326147
[13:42:09] <wikibugs>	 (03PS1) 10Cathal Mooney: Move execution of clear_dhcp_cache() until after Puppet run [cookbooks] - 10https://gerrit.wikimedia.org/r/892442 (https://phabricator.wikimedia.org/T306421)
[13:44:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44821 and previous config saved to /var/cache/conftool/dbconfig/20230227-134405-root.json
[13:47:12] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Update makevm to include completion of the installation with the puppet runs - https://phabricator.wikimedia.org/T306661 (10Volans) @MoritzMuehlenhoff in theory no, the makevm cookbook should call the reimage one directly and do all a...
[13:47:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44823 and previous config saved to /var/cache/conftool/dbconfig/20230227-134753-root.json
[13:47:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44824 and previous config saved to /var/cache/conftool/dbconfig/20230227-134756-root.json
[13:50:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44825 and previous config saved to /var/cache/conftool/dbconfig/20230227-135010-root.json
[13:50:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44826 and previous config saved to /var/cache/conftool/dbconfig/20230227-135011-root.json
[13:50:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44827 and previous config saved to /var/cache/conftool/dbconfig/20230227-135018-root.json
[13:50:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44828 and previous config saved to /var/cache/conftool/dbconfig/20230227-135023-root.json
[13:50:32] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Completely get rid of responsiveimages removal, part II (T326147) synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[13:50:37] <stashbot>	 T326147: Stop fragmenting ParserCache entries for mobile frontend  - https://phabricator.wikimedia.org/T326147
[13:52:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44830 and previous config saved to /var/cache/conftool/dbconfig/20230227-135202-root.json
[13:52:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44831 and previous config saved to /var/cache/conftool/dbconfig/20230227-135225-root.json
[13:52:37] <wikibugs>	 (03PS1) 10Jbond: standard_packages: also manage the rasdaemon service [puppet] - 10https://gerrit.wikimedia.org/r/892444
[13:53:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44832 and previous config saved to /var/cache/conftool/dbconfig/20230227-135306-root.json
[13:54:29] <wikibugs>	 (03PS2) 10Jbond: standard_packages: also manage the rasdaemon  service [puppet] - 10https://gerrit.wikimedia.org/r/892444
[13:55:25] <wikibugs>	 (03PS2) 10Cathal Mooney: Move execution of clear_dhcp_cache() and use default facter MAC [cookbooks] - 10https://gerrit.wikimedia.org/r/892442 (https://phabricator.wikimedia.org/T306421)
[13:56:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44833 and previous config saved to /var/cache/conftool/dbconfig/20230227-135615-root.json
[13:56:22] <icinga-wm>	 RECOVERY - Check systemd state on dumpsdata1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:56:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44834 and previous config saved to /var/cache/conftool/dbconfig/20230227-135625-root.json
[13:56:27] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Update makevm to include completion of the installation with the puppet runs - https://phabricator.wikimedia.org/T306661 (10MoritzMuehlenhoff) >>! In T306661#8648754, @Volans wrote: > @MoritzMuehlenhoff in theory no, the makevm cookbo...
[13:56:34] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.40.0-wmf.24/extensions/MobileFrontend/includes/MobileFrontendHooks.php: Completely get rid of responsiveimages removal, part II (T326147) (duration: 07m 24s)
[13:56:38] <stashbot>	 T326147: Stop fragmenting ParserCache entries for mobile frontend  - https://phabricator.wikimedia.org/T326147
[13:57:38] <wikibugs>	 (03CR) 10Volans: "LGTM, question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/892442 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney)
[13:58:37] <moritzm>	 !log restarting apache on mw canaries to pick up apr-util updates
[13:58:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2178 db2146 db2180 T330653', diff saved to https://phabricator.wikimedia.org/P44835 and previous config saved to /var/cache/conftool/dbconfig/20230227-135856-root.json
[13:59:01] <stashbot>	 T330653: Upgrade 10.6.10 hosts to 10.6.12 - https://phabricator.wikimedia.org/T330653
[13:59:04] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Completely get rid of responsiveimages removal, part III (T326147) synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[13:59:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44836 and previous config saved to /var/cache/conftool/dbconfig/20230227-135910-root.json
[14:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1400). nyaa~
[14:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[14:02:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44837 and previous config saved to /var/cache/conftool/dbconfig/20230227-140244-root.json
[14:02:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44838 and previous config saved to /var/cache/conftool/dbconfig/20230227-140249-root.json
[14:02:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44839 and previous config saved to /var/cache/conftool/dbconfig/20230227-140255-root.json
[14:03:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44840 and previous config saved to /var/cache/conftool/dbconfig/20230227-140301-root.json
[14:03:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44841 and previous config saved to /var/cache/conftool/dbconfig/20230227-140310-root.json
[14:04:22] <wikibugs>	 (03PS3) 10Cathal Mooney: Move execution of clear_dhcp_cache() and use default facter MAC [cookbooks] - 10https://gerrit.wikimedia.org/r/892442 (https://phabricator.wikimedia.org/T306421)
[14:05:05] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Update makevm to include completion of the installation with the puppet runs - https://phabricator.wikimedia.org/T306661 (10Volans) Right, and also re-thinking about it given that VMs can't change cluster currently and we don't use ot...
[14:05:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44842 and previous config saved to /var/cache/conftool/dbconfig/20230227-140515-root.json
[14:05:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44843 and previous config saved to /var/cache/conftool/dbconfig/20230227-140523-root.json
[14:05:26] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.40.0-wmf.24/extensions/MobileFrontend/includes/MobileContext.php: Completely get rid of responsiveimages removal, part III (T326147) (duration: 07m 36s)
[14:05:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44844 and previous config saved to /var/cache/conftool/dbconfig/20230227-140527-root.json
[14:05:31] <stashbot>	 T326147: Stop fragmenting ParserCache entries for mobile frontend  - https://phabricator.wikimedia.org/T326147
[14:07:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44845 and previous config saved to /var/cache/conftool/dbconfig/20230227-140707-root.json
[14:07:09] <wikibugs>	 (03CR) 10Cathal Mooney: sre.hosts.reimage: uses networking.mac [cookbooks] - 10https://gerrit.wikimedia.org/r/892438 (owner: 10Jbond)
[14:08:02] <wikibugs>	 (03Abandoned) 10Cathal Mooney: sre.hosts.reimage: uses networking.mac [cookbooks] - 10https://gerrit.wikimedia.org/r/892438 (owner: 10Jbond)
[14:08:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44846 and previous config saved to /var/cache/conftool/dbconfig/20230227-140811-root.json
[14:08:26] <wikibugs>	 (03CR) 10Cathal Mooney: Move execution of clear_dhcp_cache() and use default facter MAC (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/892442 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney)
[14:08:31] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/892442 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney)
[14:08:53] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM, 👍" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/891844 (https://phabricator.wikimedia.org/T330484) (owner: 10Jbond)
[14:09:44] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Move execution of clear_dhcp_cache() and use default facter MAC [cookbooks] - 10https://gerrit.wikimedia.org/r/892442 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney)
[14:09:56] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10Jclark-ctr) 05Open→03Resolved rebalanced power
[14:11:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44847 and previous config saved to /var/cache/conftool/dbconfig/20230227-141120-root.json
[14:11:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44848 and previous config saved to /var/cache/conftool/dbconfig/20230227-141130-root.json
[14:11:36] <wikibugs>	 (03Merged) 10jenkins-bot: Move execution of clear_dhcp_cache() and use default facter MAC [cookbooks] - 10https://gerrit.wikimedia.org/r/892442 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney)
[14:11:50] <icinga-wm>	 RECOVERY - Check systemd state on durum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:13:48] <wikibugs>	 (03PS1) 10Raymond Ndibe: puppet: update firewall rules for cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/892446 (https://phabricator.wikimedia.org/T303663)
[14:13:55] <wikibugs>	 (03CR) 10Bking: [C: 03+2] dse-k8s: raise memory for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/891577 (https://phabricator.wikimedia.org/T302494) (owner: 10Bking)
[14:14:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44849 and previous config saved to /var/cache/conftool/dbconfig/20230227-141415-root.json
[14:14:43] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye
[14:14:52] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert)
[14:16:20] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert)
[14:17:14] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[14:17:37] <wikibugs>	 10SRE, 10DC-Ops, 10Infrastructure-Foundations: private repo deployment - perccli implementation - https://phabricator.wikimedia.org/T308027 (10RobH)
[14:17:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) 05Open→03In progress a:05Jclark-ctr→03RobH If I have an overwhelming number of notifications in a short period (seems I did around January 18th) I may miss...
[14:17:47] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[14:17:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44850 and previous config saved to /var/cache/conftool/dbconfig/20230227-141749-root.json
[14:17:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44851 and previous config saved to /var/cache/conftool/dbconfig/20230227-141754-root.json
[14:18:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44852 and previous config saved to /var/cache/conftool/dbconfig/20230227-141800-root.json
[14:18:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44853 and previous config saved to /var/cache/conftool/dbconfig/20230227-141806-root.json
[14:18:07] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[14:18:09] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[14:18:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44854 and previous config saved to /var/cache/conftool/dbconfig/20230227-141815-root.json
[14:18:18] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[14:18:30] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[14:18:52] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert)
[14:19:20] <wikibugs>	 (03Merged) 10jenkins-bot: dse-k8s: raise memory for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/891577 (https://phabricator.wikimedia.org/T302494) (owner: 10Bking)
[14:20:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44855 and previous config saved to /var/cache/conftool/dbconfig/20230227-142020-root.json
[14:21:15] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10Jhancock.wm) @Papaul network cable was reseated and showing as connected now on wdqs2022.
[14:21:25] <wikibugs>	 (03PS1) 10AikoChou: httpbb: update tests for revert-risk and outlink model [puppet] - 10https://gerrit.wikimedia.org/r/892448 (https://phabricator.wikimedia.org/T327787)
[14:22:27] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[14:23:06] <wikibugs>	 (03PS2) 10AikoChou: httpbb: update tests for revert-risk and outlink model [puppet] - 10https://gerrit.wikimedia.org/r/892448 (https://phabricator.wikimedia.org/T327787)
[14:23:50] <wikibugs>	 10SRE, 10Data-Engineering, 10Event-Platform Value Stream, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata)
[14:27:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: incorrect error status [cookbooks] - 10https://gerrit.wikimedia.org/r/892393 (https://phabricator.wikimedia.org/T326848) (owner: 10Jbond)
[14:28:34] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1006.eqiad.wmnet with reason: host reimage
[14:29:00] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: incorrect error status [cookbooks] - 10https://gerrit.wikimedia.org/r/892393 (https://phabricator.wikimedia.org/T326848) (owner: 10Jbond)
[14:29:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] differ: add more types to the exclusion of core types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/891844 (https://phabricator.wikimedia.org/T330484) (owner: 10Jbond)
[14:29:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44857 and previous config saved to /var/cache/conftool/dbconfig/20230227-142919-root.json
[14:30:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] httpbb: update tests for revert-risk and outlink model [puppet] - 10https://gerrit.wikimedia.org/r/892448 (https://phabricator.wikimedia.org/T327787) (owner: 10AikoChou)
[14:30:54] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10Papaul) @Jhancock.wm thank you I have also wdqs2015 see my comment on the 23rd. Thanks
[14:31:30] <wikibugs>	 (03Merged) 10jenkins-bot: differ: add more types to the exclusion of core types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/891844 (https://phabricator.wikimedia.org/T330484) (owner: 10Jbond)
[14:31:30] <claime>	 jouncebot: nowandnext
[14:31:30] <jouncebot>	 For the next 0 hour(s) and 28 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1400)
[14:31:30] <jouncebot>	 In 1 hour(s) and 58 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1630)
[14:32:06] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1006.eqiad.wmnet with reason: host reimage
[14:32:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44858 and previous config saved to /var/cache/conftool/dbconfig/20230227-143254-root.json
[14:33:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44859 and previous config saved to /var/cache/conftool/dbconfig/20230227-143259-root.json
[14:33:01] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on idm2001.wikimedia.org with reason: host still been configuered - T320797
[14:33:05] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on idm2001.wikimedia.org with reason: host still been configuered - T320797
[14:33:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44860 and previous config saved to /var/cache/conftool/dbconfig/20230227-143305-root.json
[14:33:07] <stashbot>	 T320797: Initial production deployment of the IDM - https://phabricator.wikimedia.org/T320797
[14:33:11] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on idm2001.wikimedia.org with reason: host still been configuered - T320797
[14:33:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44861 and previous config saved to /var/cache/conftool/dbconfig/20230227-143311-root.json
[14:33:15] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on idm2001.wikimedia.org with reason: host still been configuered - T320797
[14:33:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44862 and previous config saved to /var/cache/conftool/dbconfig/20230227-143321-root.json
[14:33:47] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[14:34:18] <claime>	 !log live testing sre.switchdc.mediawiki.03-set-db-readonly and sre.switchdc.mediawiki.06-set-db-readwrite back to back - T330302
[14:34:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:23] <stashbot>	 T330302: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302
[14:34:28] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly
[14:34:59] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0)
[14:35:01] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite
[14:35:03] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0)
[14:35:13] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39850/console" [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis)
[14:35:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44863 and previous config saved to /var/cache/conftool/dbconfig/20230227-143525-root.json
[14:35:51] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1006.eqiad.wmnet with OS bullseye
[14:35:53] <claime>	 !log done live testing sre.switchdc.mediawiki.03-set-db-readonly and sre.switchdc.mediawiki.06-set-db-readwrite back to back - T330302
[14:35:57] <wikibugs>	 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye
[14:35:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:21] <wikibugs>	 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Clement_Goubert) 05Open→03Resolved Looks good, resolving.
[14:37:24] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert)
[14:37:57] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert)
[14:38:21] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) 05In progress→03Resolved All code paths exercised and fixes applied and tested. Resolving.
[14:38:31] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert)
[14:38:47] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert)
[14:38:54] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Ensure sre.switchdc.mediawiki live test multi-DC compatibility - https://phabricator.wikimedia.org/T329065 (10Clement_Goubert) 05In progress→03Resolved All code paths exercised for multi-DC, fixes applied and working. Resolving.
[14:39:06] <wikibugs>	 (03CR) 10Btullis: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis)
[14:39:12] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) 05Open→03Resolved All blockers resolved.
[14:43:24] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover SRE-side Communication - https://phabricator.wikimedia.org/T329042 (10Clement_Goubert)
[14:43:52] <wikibugs>	 (03PS24) 10Btullis: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123)
[14:44:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44864 and previous config saved to /var/cache/conftool/dbconfig/20230227-144424-root.json
[14:45:12] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-etcd2001.codfw.wmnet with OS bullseye
[14:45:54] <Superpes>	 Anyone is around for a deployment? Sorry I just got home :/
[14:46:01] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:47:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44865 and previous config saved to /var/cache/conftool/dbconfig/20230227-144759-root.json
[14:48:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44866 and previous config saved to /var/cache/conftool/dbconfig/20230227-144804-root.json
[14:48:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44867 and previous config saved to /var/cache/conftool/dbconfig/20230227-144810-root.json
[14:48:13] <Lucas_WMDE>	 Superpes: I’m around, but there’s nothing in the calendar afaict
[14:48:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44868 and previous config saved to /var/cache/conftool/dbconfig/20230227-144816-root.json
[14:48:21] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:48:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44869 and previous config saved to /var/cache/conftool/dbconfig/20230227-144826-root.json
[14:49:27] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye
[14:49:57] <Superpes>	 Lucas_WMDE Yes, I know, I didn't add anything because I didn't know if I could be here in time :(
[14:50:10] <Lucas_WMDE>	 if it’s a config change I can probably deploy it
[14:50:13] <wikibugs>	 (03PS1) 10Nicolas Fraison: Failover hive to standby server [dns] - 10https://gerrit.wikimedia.org/r/892460 (https://phabricator.wikimedia.org/T303168)
[14:50:19] <Lucas_WMDE>	 probably not enough time for a backport gate-and-submit now though
[14:50:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44870 and previous config saved to /var/cache/conftool/dbconfig/20230227-145030-root.json
[14:51:29] <Superpes>	 Oh, thanks, so it's probably better to schedule it for another window 
[14:52:13] <logmsgbot>	 !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1006.eqiad.wmnet with OS bullseye
[14:52:17] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (10Ottomata) Approved.
[14:52:21] <wikibugs>	 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata1006 (**FAIL**)   - Removed from Puppet and PuppetD...
[14:52:30] <wikibugs>	 (03PS1) 10Elukey: role::etcd::v3::ml_etcd: set cluster status to existing [puppet] - 10https://gerrit.wikimedia.org/r/892462 (https://phabricator.wikimedia.org/T330662)
[14:52:34] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1006.eqiad.wmnet with OS bullseye
[14:52:41] <wikibugs>	 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye
[14:53:41] <icinga-wm>	 RECOVERY - Check systemd state on sretest1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:53:56] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] role::etcd::v3::ml_etcd: set cluster status to existing [puppet] - 10https://gerrit.wikimedia.org/r/892462 (https://phabricator.wikimedia.org/T330662) (owner: 10Elukey)
[14:54:09] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye
[14:54:46] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::etcd::v3::ml_etcd: set cluster status to existing [puppet] - 10https://gerrit.wikimedia.org/r/892462 (https://phabricator.wikimedia.org/T330662) (owner: 10Elukey)
[14:54:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: dse-k8s-worker1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1007.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:56:22] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-etcd2001.codfw.wmnet with reason: host reimage
[14:56:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: wmf-auto-restart: fails for exim4 - https://phabricator.wikimedia.org/T330660 (10MoritzMuehlenhoff)
[14:58:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ArielGlenn) Awesome, I would have looked for you on irc in a few days if I hadn't heard anything, no worries. Happy to see this moving along!
[14:59:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44871 and previous config saved to /var/cache/conftool/dbconfig/20230227-145929-root.json
[15:01:04] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-etcd2001.codfw.wmnet with reason: host reimage
[15:02:06] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:03:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. On a totally unrelated, yet important note: It seems this should have been ge buster, I guess we need rasdaemon Bullseye as we" [puppet] - 10https://gerrit.wikimedia.org/r/892444 (owner: 10Jbond)
[15:03:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44872 and previous config saved to /var/cache/conftool/dbconfig/20230227-150304-root.json
[15:03:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44873 and previous config saved to /var/cache/conftool/dbconfig/20230227-150309-root.json
[15:03:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44874 and previous config saved to /var/cache/conftool/dbconfig/20230227-150315-root.json
[15:03:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44875 and previous config saved to /var/cache/conftool/dbconfig/20230227-150322-root.json
[15:03:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44876 and previous config saved to /var/cache/conftool/dbconfig/20230227-150331-root.json
[15:04:32] <logmsgbot>	 !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1006.eqiad.wmnet with OS bullseye
[15:04:37] <wikibugs>	 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata1006 (**FAIL**)   - Removed from Puppet and PuppetD...
[15:05:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) LVM data still exists on disks from a previous failed install attempt and the dd method didn't seem to remove, suspended instllation on dumpsdata1006 and set it to...
[15:05:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44877 and previous config saved to /var/cache/conftool/dbconfig/20230227-150535-root.json
[15:06:01] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1003.eqiad.wmnet with OS bullseye
[15:06:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: wmf-auto-restart: fails for exim4 - https://phabricator.wikimedia.org/T330660 (10jbond) > The wmf-auto-restart failure is ultimately fallout from earlier failures of Exim itself should we create a new task to add a proper systemd unit file for exim.   as this did...
[15:08:15] <wikibugs>	 (03PS1) 10Elukey: role::etcd::v3::ml_etcd: set the new discovery records [puppet] - 10https://gerrit.wikimedia.org/r/892466 (https://phabricator.wikimedia.org/T330662)
[15:08:17] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Remove vendored thumbor-community-core [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/888210 (owner: 10Hnowlan)
[15:08:30] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1007.eqiad.wmnet with reason: host reimage
[15:08:46] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] role::etcd::v3::ml_etcd: set the new discovery records [puppet] - 10https://gerrit.wikimedia.org/r/892466 (https://phabricator.wikimedia.org/T330662) (owner: 10Elukey)
[15:08:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::etcd::v3::ml_etcd: set the new discovery records [puppet] - 10https://gerrit.wikimedia.org/r/892466 (https://phabricator.wikimedia.org/T330662) (owner: 10Elukey)
[15:10:29] <wikibugs>	 (03CR) 10Hnowlan: service, k8s: Add service definitions for rest-gateway (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891510 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan)
[15:11:31] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1007.eqiad.wmnet with reason: host reimage
[15:11:58] <inflatador>	 !log bking@deploy1002 applying https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/891577 on dse-k8s-cluster via helmfile
[15:12:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:24] <wikibugs>	 (03PS1) 10Superpes15: [extwiki] Change wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892467 (https://phabricator.wikimedia.org/T330588)
[15:12:26] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:12:27] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:13:33] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:13:36] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:14:17] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] filebackend: Replace stringified class names with ::class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891962 (owner: 10Reedy)
[15:14:22] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] filebackend: Opinionated reformatting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891964 (owner: 10Reedy)
[15:14:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44878 and previous config saved to /var/cache/conftool/dbconfig/20230227-151434-root.json
[15:14:42] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert)
[15:15:07] <wikibugs>	 (03PS2) 10Hnowlan: WIP: helmfile: add device-analytics configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/886358 (https://phabricator.wikimedia.org/T320967)
[15:15:44] <wikibugs>	 (03Merged) 10jenkins-bot: Remove vendored thumbor-community-core [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/888210 (owner: 10Hnowlan)
[15:15:49] <wikibugs>	 (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892467 (https://phabricator.wikimedia.org/T330588) (owner: 10Superpes15)
[15:17:17] <wikibugs>	 (03CR) 10Stevemunene: Update airflow conf compatibility with airflow 2.5.0 connect postgresql (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[15:18:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44880 and previous config saved to /var/cache/conftool/dbconfig/20230227-151808-root.json
[15:18:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44881 and previous config saved to /var/cache/conftool/dbconfig/20230227-151813-root.json
[15:18:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44882 and previous config saved to /var/cache/conftool/dbconfig/20230227-151819-root.json
[15:18:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44883 and previous config saved to /var/cache/conftool/dbconfig/20230227-151826-root.json
[15:18:28] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1003.eqiad.wmnet with reason: host reimage
[15:18:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44884 and previous config saved to /var/cache/conftool/dbconfig/20230227-151836-root.json
[15:18:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: helmfile: add device-analytics configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/886358 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan)
[15:19:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceede
[15:19:38] <icinga-wm>	 PROBLEM - Check systemd state on ml-etcd1003 is CRITICAL: CRITICAL - degraded: The following units failed: etcd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:20:20] <elukey>	 this is probably me, checking --^
[15:20:24] <icinga-wm>	 PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd
[15:20:44] <icinga-wm>	 PROBLEM - Check systemd state on ml-etcd2003 is CRITICAL: CRITICAL - degraded: The following units failed: etcd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:21:31] <logmsgbot>	 !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1003.eqiad.wmnet with reason: host reimage
[15:21:40] <icinga-wm>	 RECOVERY - Etcd cluster health on ml-etcd1003 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd
[15:23:44] <wikibugs>	 (03PS3) 10Superpes15: [extwiki] Change wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892467 (https://phabricator.wikimedia.org/T330588)
[15:24:18] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExcee
[15:24:32] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:24:36] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:24:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job ml_etcd in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:24:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (GET clusterinformations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:25:26] <icinga-wm>	 PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-aqu-singleuser-conda-analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:27:40] <icinga-wm>	 RECOVERY - Check systemd state on ml-etcd2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:28:30] <icinga-wm>	 PROBLEM - Host ms-fe2013 is DOWN: PING CRITICAL - Packet loss = 100%
[15:29:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job ml_etcd in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:29:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: dse-k8s-worker1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1007.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[15:29:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (18) High Kubernetes API latency (GET clusterinformations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:31:10] <icinga-wm>	 PROBLEM - Check systemd state on ml-etcd1001 is CRITICAL: CRITICAL - degraded: The following units failed: etcd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:31:38] <wikibugs>	 (03PS3) 10Urbanecm: cswiki: Grant changetags only to bots/sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891503 (https://phabricator.wikimedia.org/T330383)
[15:31:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891503 (https://phabricator.wikimedia.org/T330383) (owner: 10Urbanecm)
[15:32:33] <wikibugs>	 (03Merged) 10jenkins-bot: cswiki: Grant changetags only to bots/sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891503 (https://phabricator.wikimedia.org/T330383) (owner: 10Urbanecm)
[15:32:49] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:891503|cswiki: Grant changetags only to bots/sysops (T330383)]]
[15:32:54] <stashbot>	 T330383: Remove changetags from user at cswiki - https://phabricator.wikimedia.org/T330383
[15:33:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44886 and previous config saved to /var/cache/conftool/dbconfig/20230227-153313-root.json
[15:33:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44887 and previous config saved to /var/cache/conftool/dbconfig/20230227-153318-root.json
[15:33:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44888 and previous config saved to /var/cache/conftool/dbconfig/20230227-153324-root.json
[15:34:06] <icinga-wm>	 RECOVERY - Check systemd state on ml-etcd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:34:34] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:891503|cswiki: Grant changetags only to bots/sysops (T330383)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[15:34:52] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[15:34:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (25) High Kubernetes API latency (GET clusterinformations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:35:50] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye
[15:36:19] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - dcaro@cumin1001"
[15:37:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: wmf-auto-restart: fails for exim4 - https://phabricator.wikimedia.org/T330660 (10MoritzMuehlenhoff) >>! In T330660#8649059, @jbond wrote: > should we create a new task to add a proper systemd unit file for exim.   as this did not show in icinga or systemd status d...
[15:37:16] <wikibugs>	 10SRE, 10Traffic: haproxy: work on systemd unit hardening (cp hosts) - https://phabricator.wikimedia.org/T323944 (10Vgutierrez) 05Stalled→03In progress yes, it's currently running on cp4045 and I'm planning to extend the experiment to ulsfo tomorrow EU morning
[15:37:52] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[15:39:37] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[15:39:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (34) High Kubernetes API latency (GET clusterinformations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:40:28] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:891503|cswiki: Grant changetags only to bots/sysops (T330383)]] (duration: 07m 39s)
[15:40:33] <stashbot>	 T330383: Remove changetags from user at cswiki - https://phabricator.wikimedia.org/T330383
[15:40:48] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.48.182:9042 on restbase1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
[15:41:11] <logmsgbot>	 !log dcaro@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - dcaro@cumin1001"
[15:41:16] <logmsgbot>	 !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1003.eqiad.wmnet with OS bullseye
[15:41:42] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1004']
[15:41:47] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2015.codfw.wmnet with OS bullseye
[15:41:56] <marostegui>	 jouncebot: next
[15:41:56] <jouncebot>	 In 0 hour(s) and 48 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1630)
[15:42:01] <marostegui>	 jouncebot: now
[15:42:01] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 47 minute(s)
[15:42:12] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2015.codfw.wmnet with OS bullseye
[15:42:56] <wikibugs>	 (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892478 (https://phabricator.wikimedia.org/T330653)
[15:43:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) `   Virtual Disk 238: RAID10, 43.661TB, Ready, Initialization 6%                     Virtual Disk 239: RAID1, 446.625GB, Ready...
[15:43:49] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye
[15:44:19] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye
[15:44:22] <icinga-wm>	 RECOVERY - Check systemd state on ml-etcd1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:45:02] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.180:9042 on restbase1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
[15:46:36] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.48.181:9042 on restbase1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
[15:47:19] <wikibugs>	 (03PS2) 10Marostegui: ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892478 (https://phabricator.wikimedia.org/T330653)
[15:47:20] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:48:09] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1004']
[15:48:46] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[15:50:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: wmf-auto-restart: fails for exim4 - https://phabricator.wikimedia.org/T330660 (10jbond) >>! In T330660#8649219, @MoritzMuehlenhoff wrote: >>>! In T330660#8649059, @jbond wrote: >> should we create a new task to add a proper systemd unit file for exim.   as this di...
[15:52:00] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ml-etcd2001.codfw.wmnet with reason: etcd cluster upgrade failed, waiting for k8s upgrade
[15:52:02] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ml-etcd2001.codfw.wmnet with reason: etcd cluster upgrade failed, waiting for k8s upgrade
[15:52:19] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2022.codfw.wmnet with OS bullseye
[15:52:25] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2022.codfw.wmnet with OS bullseye
[15:52:38] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1004']
[15:56:25] <logmsgbot>	 !log elukey@cumin1001 END (ERROR) - Cookbook sre.ganeti.reimage (exit_code=97) for host ml-etcd2001.codfw.wmnet with OS bullseye
[15:58:10] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:58:49] <logmsgbot>	 !log root@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudcephosd1004']
[16:00:23] <wikibugs>	 (03PS1) 10Elukey: role::ml_k8s::{master,worker}: upgrade ml-serve-codfw to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/892482 (https://phabricator.wikimedia.org/T330669)
[16:02:26] <logmsgbot>	 !log hashar@deploy1002 Started deploy [integration/docroot@cd7c263]: build: Pin PHPUnit to 9.5.28 like in other repos
[16:02:38] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [integration/docroot@cd7c263]: build: Pin PHPUnit to 9.5.28 like in other repos (duration: 00m 12s)
[16:02:52] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2015.codfw.wmnet with reason: host reimage
[16:03:09] <wikibugs>	 (03PS1) 10Elukey: admin_ng: upgrade ml-serve-codfw's settings to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/892483 (https://phabricator.wikimedia.org/T330669)
[16:03:56] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Enable haproxy systemd hardening in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/892484 (https://phabricator.wikimedia.org/T323944)
[16:05:34] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] admin_ng: upgrade ml-serve-codfw's settings to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/892483 (https://phabricator.wikimedia.org/T330669) (owner: 10Elukey)
[16:06:01] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1008.eqiad.wmnet with reason: host reimage
[16:06:02] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] role::ml_k8s::{master,worker}: upgrade ml-serve-codfw to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/892482 (https://phabricator.wikimedia.org/T330669) (owner: 10Elukey)
[16:06:04] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2015.codfw.wmnet with reason: host reimage
[16:07:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[16:07:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[16:07:36] <icinga-wm>	 PROBLEM - Host dse-k8s-worker1008 is DOWN: PING CRITICAL - Packet loss = 100%
[16:07:36] <icinga-wm>	 PROBLEM - Host dse-k8s-worker1005 is DOWN: PING CRITICAL - Packet loss = 100%
[16:08:09] <wikibugs>	 10SRE, 10Traffic: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ssingh)
[16:08:37] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1008.eqiad.wmnet with reason: host reimage
[16:08:39] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1004.eqiad.wmnet with OS bullseye
[16:08:42] <wikibugs>	 10SRE, 10Traffic: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ssingh) p:05Triage→03Medium
[16:09:25] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39853/console" [puppet] - 10https://gerrit.wikimedia.org/r/892484 (https://phabricator.wikimedia.org/T323944) (owner: 10Vgutierrez)
[16:09:48] <jinxer-wm>	 (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:09:51] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-aqu-singleuser-conda-analytics.service John Bond T330671 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:10:00] <vgutierrez>	 uh?
[16:10:04] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/892484 (https://phabricator.wikimedia.org/T323944) (owner: 10Vgutierrez)
[16:11:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:11:47] <cdanis>	 here
[16:11:57] * jbond here
[16:12:02] <jbond>	 vgutierrez: im gussing not expected
[16:12:14] * Emperor here
[16:12:49] <volans>	 esams
[16:12:57] <cdanis>	 but only ip6?
[16:13:04] <jbond>	 possibly allready cleared https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fservice&var-module=All&orgId=1&from=1677513366216&to=1677514341111
[16:13:07] <volans>	 from AM seems so
[16:13:14] * brett here (somehow still on call)
[16:13:21] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2022.codfw.wmnet with reason: host reimage
[16:14:16] <Emperor>	 esams v6 still looks only 50% available to me
[16:14:48] <jinxer-wm>	 (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:15:27] <brett>	 Is it just probes or is there an actual problem?
[16:15:42] <Emperor>	 quite a rise in slow-but-successful too
[16:16:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:16:25] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2022.codfw.wmnet with reason: host reimage
[16:19:48] <jinxer-wm>	 (ProbeDown) resolved: Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:21:10] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[16:23:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[16:25:12] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[16:25:54] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:26:17] <wikibugs>	 (03PS1) 10Volans: sre.hosts.reimage: clear DHCP cache for row E/F [cookbooks] - 10https://gerrit.wikimedia.org/r/892487
[16:26:23] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Persistence, 10Infrastructure-Foundations, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10MPhamWMF)
[16:28:29] <wikibugs>	 (03CR) 10David Caro: "Looks ok to me, some questions about `Optional` there, and feel free to ignore any `nit` thingies." [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis)
[16:28:46] <icinga-wm>	 RECOVERY - Host dse-k8s-worker1005 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[16:28:51] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1005.eqiad.wmnet with reason: host reimage
[16:30:04] <jouncebot>	 jan_drewniak: OwO what's this, a deployment window?? Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1630). nyaa~
[16:31:26] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1005.eqiad.wmnet with reason: host reimage
[16:32:07] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[16:33:06] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[16:33:07] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2015.codfw.wmnet with OS bullseye
[16:33:14] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2015.codfw.wmnet with OS bullseye completed: - wdqs2015 (**PA...
[16:38:24] <jbond>	 q/a
[16:40:34] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:41:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:41:39] <wikibugs>	 10SRE, 10Traffic: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10Volans) @ssing  1) for the cookbooks all that I see is that they use the `A:dns-auth` cumin alias, so they will follow along.  2) for pywmflib there is a [[ https://gerrit.wi...
[16:43:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[16:44:48] <jinxer-wm>	 (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:46:09] <jinxer-wm>	 (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:46:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:46:55] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[16:46:56] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2022.codfw.wmnet with OS bullseye
[16:47:01] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2022.codfw.wmnet with OS bullseye completed: - wdqs2022 (**PA...
[16:47:18] <icinga-wm>	 PROBLEM - SSH on restbase1026 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:47:26] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye
[16:47:26] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10Papaul)
[16:48:05] <wikibugs>	 (03PS1) 10Zabe: Initial configuration for gucwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892492 (https://phabricator.wikimedia.org/T321880)
[16:48:30] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10Papaul) 05Open→03Resolved complete @bking @Gehel  all yours
[16:48:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Initial configuration for gucwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892492 (https://phabricator.wikimedia.org/T321880) (owner: 10Zabe)
[16:49:14] <wikibugs>	 (03PS2) 10Zabe: Initial configuration for gucwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892492 (https://phabricator.wikimedia.org/T321880)
[16:54:28] <logmsgbot>	 !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1004.eqiad.wmnet with OS bullseye
[16:54:33] <wikibugs>	 10SRE, 10Znuny, 10serviceops-collab, 10Patch-For-Review: Puppet template for /etc/clamav/clamd.conf needs to be updated - https://phabricator.wikimedia.org/T330129 (10Jelto) p:05Triage→03Medium a:03Arnoldokoth
[16:54:52] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1004.eqiad.wmnet with OS bullseye
[16:55:09] <wikibugs>	 (03CR) 10AOkoth: [C: 03+1] Sync more clamd.conf settings from 0.103.8 [puppet] - 10https://gerrit.wikimedia.org/r/892389 (https://phabricator.wikimedia.org/T330129) (owner: 10Muehlenhoff)
[17:03:57] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892478 (https://phabricator.wikimedia.org/T330653) (owner: 10Marostegui)
[17:04:48] <jinxer-wm>	 (ProbeDown) resolved: (2) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:08:49] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Instrument-ClientError, 10patch-welcome: Prevent Firefox and Chrome extensions from being able to trigger alerts - https://phabricator.wikimedia.org/T330680 (10Jdlrobson)
[17:09:39] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1004.eqiad.wmnet with reason: host reimage
[17:10:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:11:09] <jinxer-wm>	 (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:12:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[17:12:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:12:45] <logmsgbot>	 !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1004.eqiad.wmnet with reason: host reimage
[17:14:48] <jinxer-wm>	 (ProbeDown) resolved: (2) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:15:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:15:53] <zabe>	 jouncebot: nowandnext
[17:15:53] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 44 minute(s)
[17:15:53] <jouncebot>	 In 0 hour(s) and 44 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1800)
[17:15:53] <jouncebot>	 In 0 hour(s) and 44 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1800)
[17:16:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[17:16:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:19:30] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Initial configuration for gucwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892492 (https://phabricator.wikimedia.org/T321880) (owner: 10Zabe)
[17:20:14] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for gucwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892492 (https://phabricator.wikimedia.org/T321880) (owner: 10Zabe)
[17:20:29] <wikibugs>	 (03PS3) 10Dzahn: re-introduce webserver-misc-static.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/891384 (https://phabricator.wikimedia.org/T330090)
[17:20:48] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Instrument-ClientError, 10patch-welcome: Prevent Firefox and Chrome extensions from being able to trigger alerts - https://phabricator.wikimedia.org/T330680 (10Jdlrobson) @cwhite I might need your help with this sometime this week.
[17:21:06] <icinga-wm>	 RECOVERY - Host dse-k8s-worker1008 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[17:22:18] <zabe>	 !log create Wikipedia Wayuu # T321880
[17:22:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:22] <stashbot>	 T321880: Create Wikipedia Wayuu - https://phabricator.wikimedia.org/T321880
[17:24:15] <wikibugs>	 (03PS1) 10Dzahn: planet: add https://design.wikimedia.org/blog/feed.xml to feeds [puppet] - 10https://gerrit.wikimedia.org/r/892510
[17:25:09] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] planet: add https://design.wikimedia.org/blog/feed.xml to feeds [puppet] - 10https://gerrit.wikimedia.org/r/892510 (owner: 10Dzahn)
[17:26:46] <logmsgbot>	 !log zabe@deploy1002 Started scap: create gucwiki T321880
[17:27:00] <icinga-wm>	 PROBLEM - MegaRAID on db2110 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:27:01] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on db2110 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T330681 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:27:07] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T330681 (10ops-monitoring-bot)
[17:28:33] <logmsgbot>	 !log zabe@deploy1002 zabe: create gucwiki T321880 synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[17:28:37] <stashbot>	 T321880: Create Wikipedia Wayuu - https://phabricator.wikimedia.org/T321880
[17:28:50] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "please let me know if there are more URLs in https://gist.github.com/Krinkle/e0d13f84b91e829afffa7b27822482be or elsewhere that are Wikime" [puppet] - 10https://gerrit.wikimedia.org/r/892510 (owner: 10Dzahn)
[17:29:14] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[17:29:44] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[17:33:24] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] re-introduce webserver-misc-static.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/891384 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn)
[17:33:48] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[17:35:13] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[17:35:16] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[17:35:41] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[17:36:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[17:36:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:36:18] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[17:36:19] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[17:36:30] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[17:36:50] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[17:36:56] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[17:37:01] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[17:37:11] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[17:37:22] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[17:37:51] <logmsgbot>	 !log zabe@deploy1002 Finished scap: create gucwiki T321880 (duration: 11m 05s)
[17:37:56] <stashbot>	 T321880: Create Wikipedia Wayuu - https://phabricator.wikimedia.org/T321880
[17:37:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:38:01] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[17:38:02] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[17:38:25] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye
[17:38:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: dse-k8s-worker1008.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1008.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[17:39:32] <wikibugs>	 (03PS1) 10BCornwall: ntp/esams: set to dns3002 [dns] - 10https://gerrit.wikimedia.org/r/892516
[17:42:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[17:42:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:42:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:43:59] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] ntp/esams: set to dns3002 [dns] - 10https://gerrit.wikimedia.org/r/892516 (owner: 10BCornwall)
[17:44:29] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] ntp/esams: set to dns3002 [dns] - 10https://gerrit.wikimedia.org/r/892516 (owner: 10BCornwall)
[17:44:34] <icinga-wm>	 RECOVERY - SSH on restbase1026 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:50:02] <wikibugs>	 (03PS1) 10Elukey: admin_ng: add istio network policy config for DSE [deployment-charts] - 10https://gerrit.wikimedia.org/r/892519 (https://phabricator.wikimedia.org/T330261)
[17:50:06] <icinga-wm>	 PROBLEM - SSH on restbase1026 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:50:48] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10Papaul) I need the partman recipe for those nodes
[17:51:39] <wikibugs>	 (03CR) 10Nicolas Fraison: Configure the new ceph servers with mon and mgr daemons (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis)
[17:52:39] <wikibugs>	 (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891883
[17:52:41] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891883 (owner: 10Zabe)
[17:53:26] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891883 (owner: 10Zabe)
[17:53:46] <icinga-wm>	 RECOVERY - SSH on restbase1026 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:54:53] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T330681 (10Marostegui)
[17:55:04] <wikibugs>	 (03PS1) 10Elukey: role::dse_k8s::worker: update istio-cni version [puppet] - 10https://gerrit.wikimedia.org/r/892522 (https://phabricator.wikimedia.org/T330261)
[17:55:10] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T330681 (10Marostegui) p:05Triage→03Medium
[17:55:43] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: add istio network policy config for DSE [deployment-charts] - 10https://gerrit.wikimedia.org/r/892519 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey)
[17:56:18] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::dse_k8s::worker: update istio-cni version [puppet] - 10https://gerrit.wikimedia.org/r/892522 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey)
[18:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1800)
[18:00:04] <jouncebot>	 ryankemper: Dear deployers, time to do the Wikidata Query Service weekly deploy deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1800).
[18:00:08] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: add istio network policy config for DSE [deployment-charts] - 10https://gerrit.wikimedia.org/r/892519 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey)
[18:01:03] <logmsgbot>	 !log zabe@deploy1002 Synchronized wmf-config/interwiki.php: (no justification provided) (duration: 06m 54s)
[18:01:10] <icinga-wm>	 PROBLEM - SSH on restbase1026 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:01:18] <zabe>	 whops forgot mentioning the patch
[18:02:58] <icinga-wm>	 RECOVERY - SSH on restbase1026 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:03:08] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[18:03:10] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[18:07:12] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@c8dc6d5]: cirrus namespaces: Work arround missing domain_name in upstream
[18:08:31] <herzog>	 zabe: you doing RESTBase and Pywikibot too? I can +2 the later
[18:08:57] <zabe>	 sure, can do
[18:09:23] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] "Thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/892362 (https://phabricator.wikimedia.org/T330405) (owner: 10Filippo Giunchedi)
[18:09:42] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@c8dc6d5]: cirrus namespaces: Work arround missing domain_name in upstream (duration: 02m 29s)
[18:10:22] <icinga-wm>	 PROBLEM - SSH on restbase1026 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:10:25] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T330681 (10wiki_willy) a:03Papaul
[18:10:47] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T330218 (10wiki_willy) a:03Papaul
[18:11:21] <zabe>	 maybe I am going to wait with RESTBase so that it can go together with gurwiki, since that is a bit of work to deploy
[18:11:22] <wikibugs>	 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T330372 (10wiki_willy) a:03Papaul
[18:12:43] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10wiki_willy) a:03Jclark-ctr
[18:13:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Reset management module of mc1039 - https://phabricator.wikimedia.org/T330072 (10wiki_willy) a:03Jclark-ctr
[18:13:43] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1099.eqiad.wmnet - https://phabricator.wikimedia.org/T329181 (10wiki_willy) a:03Jclark-ctr
[18:15:14] <herzog>	 ack, and since the wiki shouldn't be editted yet until they finish importing stuff it's not important
[18:17:57] <wikibugs>	 (03PS1) 10David Caro: cloudcephosd1004: use the right interface names [puppet] - 10https://gerrit.wikimedia.org/r/892526 (https://phabricator.wikimedia.org/T329502)
[18:18:31] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] cloudcephosd1004: use the right interface names [puppet] - 10https://gerrit.wikimedia.org/r/892526 (https://phabricator.wikimedia.org/T329502) (owner: 10David Caro)
[18:20:37] <wikibugs>	 (03CR) 10David Caro: Configure the new ceph servers with mon and mgr daemons (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis)
[18:24:47] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - dcaro@cumin1001"
[18:28:48] <icinga-wm>	 RECOVERY - SSH on restbase1026 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:29:19] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[18:29:48] <zabe>	 !log start running "foreachwikiindblist s3.dblist migrateRevisionCommentTemp.php --sleep 2" in screen # T275246
[18:29:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:29:57] <stashbot>	 T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246
[18:31:34] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new ms-be nodes - pt1979@cumin2002"
[18:31:34] <wikibugs>	 (03PS1) 10Urbanecm: Enable Growth features by default on testwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892529
[18:31:57] <wikibugs>	 (03PS2) 10Urbanecm: Enable Growth features by default on testwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892529
[18:32:34] <zabe>	 apparently I somehow managed to break my account on gucwiki :|
[18:32:55] <Bsadowski1>	 oh
[18:36:06] <urbanecm>	 zabe: what does that mean? and anything i can help with?
[18:36:17] <urbanecm>	 gucwiki works for me
[18:37:01] <zabe>	 when I try to login it throws TypeError: Argument 1 passed to MediaWiki\Extension\OATHAuth\Auth\SecondaryAuthenticationProvider::getProviderForModule() must be an instance of MediaWiki\Extension\OATHAuth\IModule, null given,
[18:37:06] <zabe>	 also my global userpage is gone
[18:37:47] <logmsgbot>	 !log dcaro@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - dcaro@cumin1001"
[18:37:51] <logmsgbot>	 !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1004.eqiad.wmnet with OS bullseye
[18:38:59] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new ms-be nodes - pt1979@cumin2002"
[18:39:00] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:39:04] <zabe>	 for the error: 2d75670c-06e7-4523-b81b-fb30cc8c96e2
[18:39:17] <urbanecm>	 i tested it with my bot account, and i managed to log in
[18:40:34] <zabe>	 hmm, I still get it.
[18:41:12] <urbanecm>	 your row in oathauth_users seems OK
[18:42:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[18:42:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[18:43:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[18:43:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[18:43:52] <zabe>	 ok, clearing global user cache through shell.php worked
[18:44:27] <urbanecm>	 great
[18:47:07] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 04-1] "Thanks! I want to make a phab task for this, for documentation and to share with the team. But once that is done, we could sync this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892529 (owner: 10Urbanecm)
[18:47:32] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[18:47:33] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[18:51:00] <icinga-wm>	 PROBLEM - SSH on restbase1026 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:56:23] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10bscarone) Thanks @MoritzMuehlenhoff, I am not being able to log in to JupyterHub, who should I contact regarding this issue?
[18:58:16] <icinga-wm>	 RECOVERY - SSH on restbase1026 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:58:23] <taavi>	 zabe: do you know what happened to your account there? I'm worried some of my OATHAuth changes broke something
[18:58:51] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10Urbanecm) 05Resolved→03Open According to [LDAP tool](https://ldap.toolforge.org/user/bscarone), this is missing the `nda` LDAP group, which is requir...
[18:59:03] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[18:59:56] <zabe>	 tbh, not really, I just guess it is due to the account being autocreated at a time where the new didn't exist on all wikis yet, although still this never happened before
[19:00:34] <wikibugs>	 (03CR) 10Raymond Ndibe: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39855/console" [puppet] - 10https://gerrit.wikimedia.org/r/892446 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe)
[19:01:49] <wikibugs>	 (03PS1) 10Dzahn: planet: add Kosta Harla to feeds [puppet] - 10https://gerrit.wikimedia.org/r/892538
[19:01:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] planet: add Kosta Harla to feeds [puppet] - 10https://gerrit.wikimedia.org/r/892538 (owner: 10Dzahn)
[19:02:12] <wikibugs>	 (03PS2) 10Dzahn: planet: add Kosta Harla to feeds [puppet] - 10https://gerrit.wikimedia.org/r/892538
[19:02:42] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10KFrancis) >>! In T330364#8643699, @MoritzMuehlenhoff wrote: >>>! In T330364#8643473, @SLyngshede-WMF wrote: >> @KFrancis Given that this is a reactivatio...
[19:03:50] <icinga-wm>	 PROBLEM - SSH on restbase1026 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:03:55] <zabe>	 * where the new wiki didn't exist on all appservers yet
[19:04:01] <zabe>	 not sure what I wrote above
[19:07:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) `   Virtual Disk 238: RAID10, 43.661TB, Ready, Initialization 42%                   `
[19:09:16] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2070.mgmt.codfw.wmnet with reboot policy FORCED
[19:11:12] <icinga-wm>	 RECOVERY - SSH on restbase1026 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:13:30] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for Schwirz - https://phabricator.wikimedia.org/T330095 (10KFrancis) @JMeybohm Please provide Norman Schwirz's email address and I'll put the agreement together.  Please send it to kfrancis@wikimedia.org if you'd rather not post it here.
[19:14:25] <Jhs>	 Amir1, could you do your magic for gucwiki in Wikidata? :)
[19:14:38] <Amir1>	 sure
[19:14:59] <Jhs>	 🎉
[19:16:20] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10MoritzMuehlenhoff) >>! In T330364#8650188, @Urbanecm wrote: > According to [LDAP tool](https://ldap.toolforge.org/user/bscarone), this is missing the `nd...
[19:18:36] <icinga-wm>	 PROBLEM - SSH on restbase1026 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:18:51] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10bscarone) @MoritzMuehlenhoff works now, thanks for the quick response!
[19:18:55] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[19:23:26] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering, 10Event-Platform Value Stream: Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10gmodena)
[19:25:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[19:25:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:25:43] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering, 10Event-Platform Value Stream: Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata)
[19:26:33] <wikibugs>	 (03PS1) 10Sbailey: enable Linter use namespace field and tag and template UI in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892543 (https://phabricator.wikimedia.org/T175177)
[19:28:52] <wikibugs>	 (03PS1) 10Zabe: Initial configuration for gurwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892544 (https://phabricator.wikimedia.org/T327813)
[19:29:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Initial configuration for gurwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892544 (https://phabricator.wikimedia.org/T327813) (owner: 10Zabe)
[19:30:18] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[19:30:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:30:27] <wikibugs>	 (03PS2) 10Zabe: Initial configuration for gurwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892544 (https://phabricator.wikimedia.org/T327813)
[19:31:11] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering, 10Event-Platform Value Stream: Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10gmodena)
[19:32:09] <zabe>	 jouncebot: nowandnext
[19:32:09] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 27 minute(s)
[19:32:09] <jouncebot>	 In 1 hour(s) and 27 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T2100)
[19:32:20] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Initial configuration for gurwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892544 (https://phabricator.wikimedia.org/T327813) (owner: 10Zabe)
[19:33:08] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for gurwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892544 (https://phabricator.wikimedia.org/T327813) (owner: 10Zabe)
[19:33:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[19:33:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:34:36] <zabe>	 !log create Wikipedia Farefare (Gurene) # T327813
[19:34:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:34:41] <stashbot>	 T327813: Create Wikipedia Farefare (Gurene) - https://phabricator.wikimedia.org/T327813
[19:35:04] <logmsgbot>	 !log zabe@deploy1002 Started scap: create gurwiki T327813
[19:36:31] <TheresNoTime>	 zabe: am I okay to 'deploy' a beta-only change (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/892543) ?
[19:37:01] <zabe>	 TheresNoTime: yep
[19:37:02] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:37:28] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2070.mgmt.codfw.wmnet with reboot policy FORCED
[19:38:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[19:38:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:38:30] <logmsgbot>	 !log samtar@deploy1002 Backport cancelled.
[19:38:41] <TheresNoTime>	 (okay that really doesn't need to log)
[19:38:57] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "beta deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892543 (https://phabricator.wikimedia.org/T175177) (owner: 10Sbailey)
[19:39:44] <wikibugs>	 (03Merged) 10jenkins-bot: enable Linter use namespace field and tag and template UI in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892543 (https://phabricator.wikimedia.org/T175177) (owner: 10Sbailey)
[19:40:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[19:40:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:42:24] <logmsgbot>	 !log zabe@deploy1002 Finished scap: create gurwiki T327813 (duration: 07m 19s)
[19:42:28] <stashbot>	 T327813: Create Wikipedia Farefare (Gurene) - https://phabricator.wikimedia.org/T327813
[19:45:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[19:45:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:50:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[19:50:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:50:56] <wikibugs>	 (03PS1) 10Gmodena: page-content-change: docker image version bump. [deployment-charts] - 10https://gerrit.wikimedia.org/r/892549
[19:55:05] <urandom>	 !log power cycling restbase1026
[19:55:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:55:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[19:55:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:56:22] <icinga-wm>	 PROBLEM - Host restbase1026 is DOWN: PING CRITICAL - Packet loss = 100%
[19:57:24] <icinga-wm>	 RECOVERY - SSH on restbase1026 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:57:26] <icinga-wm>	 RECOVERY - Host restbase1026 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[19:57:28] <icinga-wm>	 PROBLEM - confd service on restbase1026 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:57:36] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.48.181:7001 on restbase1026 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[19:57:36] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.48.182:7001 on restbase1026 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[19:57:42] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1026 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:57:42] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.180:7001 on restbase1026 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[19:58:06] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1026 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:58:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[19:58:23] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:58:54] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1026 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:59:18] <icinga-wm>	 RECOVERY - confd service on restbase1026 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:59:56] <icinga-wm>	 RECOVERY - cassandra-b service on restbase1026 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:00:42] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1026 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:01:20] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1026 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:01:40] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[20:02:06] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.48.181:7001 on restbase1026 is OK: SSL OK - Certificate restbase1026-b valid until 2025-02-21 18:43:46 +0000 (expires in 724 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[20:02:06] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.48.182:7001 on restbase1026 is OK: SSL OK - Certificate restbase1026-c valid until 2025-02-21 18:43:48 +0000 (expires in 724 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[20:02:16] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.48.181:9042 on restbase1026 is OK: TCP OK - 0.000 second response time on 10.64.48.181 port 9042 https://phabricator.wikimedia.org/T93886
[20:02:30] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.48.182:9042 on restbase1026 is OK: TCP OK - 0.000 second response time on 10.64.48.182 port 9042 https://phabricator.wikimedia.org/T93886
[20:02:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] planet: add Kosta Harla to feeds [puppet] - 10https://gerrit.wikimedia.org/r/892538 (owner: 10Dzahn)
[20:03:12] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.180:7001 on restbase1026 is OK: SSL OK - Certificate restbase1026-a valid until 2025-02-21 18:43:44 +0000 (expires in 724 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[20:03:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[20:03:23] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:03:30] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.180:9042 on restbase1026 is OK: TCP OK - 0.001 second response time on 10.64.48.180 port 9042 https://phabricator.wikimedia.org/T93886
[20:03:54] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] peopleweb: add bacula file set srv-org-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/891920 (owner: 10Dzahn)
[20:05:47] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[20:06:05] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[20:06:40] <MatmaRex>	 hi, could someone check on a maintenance script run for me? it probably has finished, but i'd like to confirm. https://phabricator.wikimedia.org/T315510#8630577
[20:07:01] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[20:08:29] <taavi>	 MatmaRex: still running, 'Processed 1306600 (updated 2293) of 7328137 rows'
[20:08:57] <mutante>	 is mwmaint failing over to other DC? I haven't checked
[20:09:12] <taavi>	 I would assume it is
[20:09:13] <MatmaRex>	 taavi: hmm, thanks
[20:09:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[20:09:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:10:05] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[20:11:30] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[20:11:45] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new ms-fe and thanos nodes - pt1979@cumin2002"
[20:12:35] <wikibugs>	 (03PS2) 10Zabe: Initial configuration for vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875304 (https://phabricator.wikimedia.org/T320890) (owner: 10Urbanecm)
[20:12:55] <wikibugs>	 (03PS3) 10Zabe: Initial configuration for vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875304 (https://phabricator.wikimedia.org/T320890) (owner: 10Urbanecm)
[20:13:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Initial configuration for vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875304 (https://phabricator.wikimedia.org/T320890) (owner: 10Urbanecm)
[20:16:41] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[20:17:48] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new ms-fe and thanos nodes - pt1979@cumin2002"
[20:17:48] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:20:32] <wikibugs>	 (03PS4) 10Zabe: Initial configuration for vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875304 (https://phabricator.wikimedia.org/T320890) (owner: 10Urbanecm)
[20:22:25] <wikibugs>	 (03PS5) 10Zabe: Initial configuration for vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875304 (https://phabricator.wikimedia.org/T320890) (owner: 10Urbanecm)
[20:23:45] <wikibugs>	 (03PS6) 10Zabe: Initial configuration for vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875304 (https://phabricator.wikimedia.org/T320890) (owner: 10Urbanecm)
[20:24:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[20:24:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:24:44] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2071.mgmt.codfw.wmnet with reboot policy FORCED
[20:25:55] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Initial configuration for vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875304 (https://phabricator.wikimedia.org/T320890) (owner: 10Urbanecm)
[20:26:42] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875304 (https://phabricator.wikimedia.org/T320890) (owner: 10Urbanecm)
[20:29:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[20:29:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:29:52] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2071.mgmt.codfw.wmnet with reboot policy FORCED
[20:30:18] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2071.mgmt.codfw.wmnet with reboot policy FORCED
[20:31:14] <wikibugs>	 (03PS1) 10BCornwall: config: Add brett for router configuration [homer/public] - 10https://gerrit.wikimedia.org/r/892551 (https://phabricator.wikimedia.org/T321309)
[20:31:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] config: Add brett for router configuration [homer/public] - 10https://gerrit.wikimedia.org/r/892551 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall)
[20:33:24] <wikibugs>	 (03PS1) 10Zabe: Set language code for vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892552 (https://phabricator.wikimedia.org/T320890)
[20:33:34] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Set language code for vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892552 (https://phabricator.wikimedia.org/T320890) (owner: 10Zabe)
[20:33:40] <wikibugs>	 (03PS2) 10BCornwall: config: Add brett for router configuration [homer/public] - 10https://gerrit.wikimedia.org/r/892551 (https://phabricator.wikimedia.org/T321309)
[20:34:17] <wikibugs>	 (03Merged) 10jenkins-bot: Set language code for vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892552 (https://phabricator.wikimedia.org/T320890) (owner: 10Zabe)
[20:34:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[20:34:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:35:46] <mutante>	 zabe: you just became a Wikidata Q number, heh
[20:35:53] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] page-content-change: docker image version bump. [deployment-charts] - 10https://gerrit.wikimedia.org/r/892549 (owner: 10Gmodena)
[20:36:28] <zabe>	 heh
[20:36:59] <mutante>	 because I liked to add a value for "creator" for the Wikipedia editions
[20:37:08] <mutante>	 and that's you
[20:38:10] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2071.mgmt.codfw.wmnet with reboot policy FORCED
[20:38:38] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2072.mgmt.codfw.wmnet with reboot policy FORCED
[20:39:27] <zabe>	 !log create Wikimedia Venezuela wiki # T320890
[20:39:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:32] <stashbot>	 T320890: Create a wiki for Wikimedia Venezuela - https://phabricator.wikimedia.org/T320890
[20:40:21] <TheresNoTime>	 *cough cough* any chance I can do the next wiki creation then? ^^'
[20:40:30] <taavi>	 yeah I should probably do one too :D
[20:40:32] <logmsgbot>	 !log zabe@deploy1002 Started scap: create vewikimedia T320890
[20:40:42] <zabe>	 :p
[20:41:24] <TheresNoTime>	 free wikidata number :D
[20:41:42] <zabe>	 just doing wiki number 1000
[20:41:57] <TheresNoTime>	 woo!
[20:42:01] <RhinosF1>	 Ace!
[20:42:13] <mutante>	 party time :)
[20:42:40] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[20:42:41] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[20:43:06] <TheresNoTime>	 zabe: are you going to be clear by the deployment window in 15m?
[20:43:30] <zabe>	 I should
[20:44:09] <mutante>	 sign up at https://phabricator.wikimedia.org/project/profile/2941/ :)
[20:44:33] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[20:44:33] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:44:35] <wikibugs>	 (03PS1) 10Zabe: Enable Translate on vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892553 (https://phabricator.wikimedia.org/T320890)
[20:45:16] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Enable Translate on vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892553 (https://phabricator.wikimedia.org/T320890) (owner: 10Zabe)
[20:45:59] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Translate on vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892553 (https://phabricator.wikimedia.org/T320890) (owner: 10Zabe)
[20:48:02] <logmsgbot>	 !log zabe@deploy1002 Finished scap: create vewikimedia T320890 (duration: 07m 29s)
[20:48:07] <stashbot>	 T320890: Create a wiki for Wikimedia Venezuela - https://phabricator.wikimedia.org/T320890
[20:48:41] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[20:48:47] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[20:49:02] <wikibugs>	 (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891884
[20:49:04] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891884 (owner: 10Zabe)
[20:49:33] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[20:49:33] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:49:44] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891884 (owner: 10Zabe)
[20:49:50] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[20:50:15] <logmsgbot>	 !log zabe@deploy1002 Started scap: install Translate on vewikimedia and update interwiki cache
[20:50:21] <logmsgbot>	 !log zabe@deploy1002 sync-world aborted: install Translate on vewikimedia and update interwiki cache (duration: 00m 06s)
[20:50:24] <logmsgbot>	 !log zabe@deploy1002 Started scap: install Translate on vewikimedia and update interwiki cache T320890
[20:52:10] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2072.mgmt.codfw.wmnet with reboot policy FORCED
[20:52:40] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2073.mgmt.codfw.wmnet with reboot policy FORCED
[20:53:24] <wikibugs>	 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T330372 (10Papaul) 05Open→03Resolved This was one of the new ms-be node it should me good now
[20:55:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[20:55:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:57:50] <logmsgbot>	 !log zabe@deploy1002 Finished scap: install Translate on vewikimedia and update interwiki cache T320890 (duration: 07m 26s)
[20:57:55] <stashbot>	 T320890: Create a wiki for Wikimedia Venezuela - https://phabricator.wikimedia.org/T320890
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T2100).
[21:00:04] <jouncebot>	 Superpes: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:08] <taavi>	 ok, who wants to do the backports? :D
[21:00:14] <Superpes>	 Hi :D
[21:00:17] * TheresNoTime can deploy
[21:00:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[21:00:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:00:26] <TheresNoTime>	 zabe: clear to?
[21:00:31] <taavi>	 also what's up with the flapping latency alerts?
[21:00:51] <TheresNoTime>	 (no idea, been flapping for a few hours iirc)
[21:01:00] <zabe>	 yep, have fun
[21:01:12] <wikibugs>	 (03PS3) 10Samtar: [eswiki] Create new 'templateeditor' usergroup and protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891814 (https://phabricator.wikimedia.org/T330470) (owner: 10Superpes15)
[21:01:12] <wikibugs>	 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T330701 (10phaultfinder)
[21:02:19] <TheresNoTime>	 Superpes: starting with 891814 :)
[21:02:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891814 (https://phabricator.wikimedia.org/T330470) (owner: 10Superpes15)
[21:02:30] <Superpes>	 TheresNoTime Perfect ;)
[21:03:08] <RhinosF1>	 taavi: has something happened to prep for tomorrow/Wednesday
[21:03:10] <wikibugs>	 (03Merged) 10jenkins-bot: [eswiki] Create new 'templateeditor' usergroup and protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891814 (https://phabricator.wikimedia.org/T330470) (owner: 10Superpes15)
[21:03:21] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2073.mgmt.codfw.wmnet with reboot policy FORCED
[21:03:27] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:891814|[eswiki] Create new 'templateeditor' usergroup and protection level (T330470)]]
[21:03:31] <stashbot>	 T330470: New protection level and right to edit through it for eswiki - https://phabricator.wikimedia.org/T330470
[21:04:17] <zabe>	 !log zabe@mwmaint1002:~$ mwscript createAndPromote.php --wiki vewikimedia --bureaucrat Zabe REDACTED
[21:04:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[21:04:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:04:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:05:06] <logmsgbot>	 !log samtar@deploy1002 superpes and samtar: Backport for [[gerrit:891814|[eswiki] Create new 'templateeditor' usergroup and protection level (T330470)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[21:05:27] <TheresNoTime>	 Superpes: that's live on (any) mwdebug, can you test? (Have you done a backport before?)
[21:05:31] <Superpes>	 Checking :)
[21:05:35] <TheresNoTime>	 cool :)
[21:05:50] <wikibugs>	 (03PS4) 10Samtar: [extwiki] Change wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892467 (https://phabricator.wikimedia.org/T330588) (owner: 10Superpes15)
[21:05:54] <wikibugs>	 (03CR) 10RLazarus: Switch deployment server to deploy2002.codfw.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/892373 (https://phabricator.wikimedia.org/T330651) (owner: 10Clément Goubert)
[21:06:26] <Superpes>	 TheresNoTime everything seems fine! Thanks :)
[21:06:31] <TheresNoTime>	 syncing
[21:07:48] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering, 10Event-Platform Value Stream: Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata)
[21:09:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[21:09:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:10:51] <wikibugs>	 (03PS1) 10Samtar: Add Apache configuration for amical.wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/892555 (https://phabricator.wikimedia.org/T330390)
[21:12:09] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:891814|[eswiki] Create new 'templateeditor' usergroup and protection level (T330470)]] (duration: 08m 42s)
[21:12:14] <stashbot>	 T330470: New protection level and right to edit through it for eswiki - https://phabricator.wikimedia.org/T330470
[21:12:23] <TheresNoTime>	 Superpes: that should be live now :) starting 892467
[21:12:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892467 (https://phabricator.wikimedia.org/T330588) (owner: 10Superpes15)
[21:12:37] <Superpes>	 TheresNoTime Thanks :P
[21:13:19] <wikibugs>	 (03Merged) 10jenkins-bot: [extwiki] Change wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892467 (https://phabricator.wikimedia.org/T330588) (owner: 10Superpes15)
[21:13:34] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:892467|[extwiki] Change wordmark and tagline (T330588)]]
[21:13:39] <stashbot>	 T330588: Extremaduran Wikipedia - Updates in the address bar and in some versions of the wiki - https://phabricator.wikimedia.org/T330588
[21:13:56] <LuchoCR>	 Its live https://usercontent.irccloud-cdn.com/file/4DDeRrpe/image.png
[21:13:59] <LuchoCR>	 Tyvm! 
[21:15:14] <logmsgbot>	 !log samtar@deploy1002 samtar and superpes: Backport for [[gerrit:892467|[extwiki] Change wordmark and tagline (T330588)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[21:15:25] <TheresNoTime>	 Superpes: that's live on mwdebug, can you test?
[21:15:41] <Superpes>	 Yep it works :D TheresNoTime
[21:15:55] <herzog>	 hola LuchoCR 
[21:15:56] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering, 10Event-Platform Value Stream: Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10gmodena)
[21:16:03] <TheresNoTime>	 ack
[21:21:49] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:892467|[extwiki] Change wordmark and tagline (T330588)]] (duration: 08m 14s)
[21:21:53] <stashbot>	 T330588: Extremaduran Wikipedia - Updates in the address bar and in some versions of the wiki - https://phabricator.wikimedia.org/T330588
[21:22:10] <TheresNoTime>	 Superpes: should be live (and have purged the cache)
[21:22:19] <Superpes>	 Wonderful!!!
[21:22:32] <Superpes>	 Many thanks for your time and support TheresNoTime :D
[21:22:43] <TheresNoTime>	 you're very welcome :)
[21:22:55] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) After multiple reviews, fixes, and the last translations being done, the message has been sent to 832 c...
[21:24:52] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering, 10Event-Platform Value Stream: Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata)
[21:25:21] <TheresNoTime>	 !log close UTC late backport window
[21:25:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:28:08] <wikibugs>	 (03PS1) 10Dzahn: httpbb: update/fix tests for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/892564 (https://phabricator.wikimedia.org/T330090)
[21:28:37] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] httpbb: update/fix tests for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/892564 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn)
[21:28:50] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "[deploy1002:~] $ httpbb --hosts miscweb2002.codfw.wmnet ./test_miscweb.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/892564 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn)
[21:36:56] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2070']
[21:50:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[21:50:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:53:29] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2070']
[21:53:48] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2071']
[21:55:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[21:55:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:56:39] <wikibugs>	 (03CR) 10Stevemunene: Update airflow conf compatibility with airflow 2.5.0 connect postgresql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[21:58:24] <wikibugs>	 (03PS65) 10Stevemunene: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580)
[21:58:25] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[21:58:49] <wikibugs>	 (03PS1) 10Zabe: Remove vewikimedia from deleted wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892566 (https://phabricator.wikimedia.org/T320890)
[21:58:52] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:59:16] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Remove vewikimedia from deleted wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892566 (https://phabricator.wikimedia.org/T320890) (owner: 10Zabe)
[22:00:02] <wikibugs>	 (03Merged) 10jenkins-bot: Remove vewikimedia from deleted wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892566 (https://phabricator.wikimedia.org/T320890) (owner: 10Zabe)
[22:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: gettimeofday() says it's time for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T2200)
[22:01:31] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:892566|Remove vewikimedia from deleted wikis (T320890)]]
[22:01:35] <stashbot>	 T320890: Create a wiki for Wikimedia Venezuela - https://phabricator.wikimedia.org/T320890
[22:02:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[22:02:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:02:42] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be2071']
[22:02:56] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39856/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[22:03:16] <logmsgbot>	 !log zabe@deploy1002 zabe: Backport for [[gerrit:892566|Remove vewikimedia from deleted wikis (T320890)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[22:04:27] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[22:04:29] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2072']
[22:07:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[22:07:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:09:02] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:892566|Remove vewikimedia from deleted wikis (T320890)]] (duration: 07m 30s)
[22:09:07] <stashbot>	 T320890: Create a wiki for Wikimedia Venezuela - https://phabricator.wikimedia.org/T320890
[22:15:10] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[22:15:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[22:15:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:15:34] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[22:16:12] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[22:16:25] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[22:18:30] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2072']
[22:18:45] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2073']
[22:19:26] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[22:19:47] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[22:25:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[22:25:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:26:04] <wikibugs>	 (03PS1) 10Jon Harald Søby: Add `guc` and `gur` to InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892568
[22:29:50] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] switch annual.wikimedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/891406 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn)
[22:30:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[22:30:17] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:31:11] <ryankemper>	 !log [apifeatureusage] T329957 Restarted `logstash` on `apifeatureusage[1-2]001`
[22:31:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:31:16] <stashbot>	 T329957: Restart Elastic/Blazegraph services to pick up JRE updates - https://phabricator.wikimedia.org/T329957
[22:31:45] <wikibugs>	 (03PS1) 10RLazarus: mediawiki-cache-warmup: Rename `Request` to `Task` [puppet] - 10https://gerrit.wikimedia.org/r/892569 (https://phabricator.wikimedia.org/T290989)
[22:31:47] <wikibugs>	 (03PS1) 10RLazarus: mediawiki-cache-warmup: Add POSTs [puppet] - 10https://gerrit.wikimedia.org/r/892570 (https://phabricator.wikimedia.org/T290989)
[22:35:08] <wikibugs>	 (03PS2) 10RLazarus: mediawiki-cache-warmup: Add POSTs [puppet] - 10https://gerrit.wikimedia.org/r/892570 (https://phabricator.wikimedia.org/T290989)
[22:35:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[22:35:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:35:48] <wikibugs>	 (03PS3) 10Ryan Kemper: [WIP] Revert "wdqs: disable notifs on not-yet-in-service hosts" [puppet] - 10https://gerrit.wikimedia.org/r/891626 (https://phabricator.wikimedia.org/T301167)
[22:35:57] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10MoritzMuehlenhoff) 05Open→03Resolved Great :-) Closing the task, then.
[22:35:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Revert "wdqs: disable notifs on not-yet-in-service hosts" [puppet] - 10https://gerrit.wikimedia.org/r/891626 (https://phabricator.wikimedia.org/T301167) (owner: 10Ryan Kemper)
[22:36:54] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2073']
[22:36:57] <mutante>	 !log switching https://annual.wikimedia.org from eqiad to codfw T330090
[22:37:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:37:02] <stashbot>	 T330090: Switchover static miscweb services to codfw - https://phabricator.wikimedia.org/T330090
[22:40:37] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "all tests pass - using new discovery name and using codfw deploymenmt server" [puppet] - 10https://gerrit.wikimedia.org/r/891406 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn)
[22:42:54] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2070']
[22:43:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[22:43:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:48:18] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[22:48:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:49:34] <wikibugs>	 (03PS1) 10Dzahn: swich https://15.wikipedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/892571 (https://phabricator.wikimedia.org/T330090)
[22:52:02] <wikibugs>	 (03PS4) 10Ryan Kemper: Revert "wdqs: disable notifs on not-yet-in-service hosts" [puppet] - 10https://gerrit.wikimedia.org/r/891626 (https://phabricator.wikimedia.org/T301167)
[22:52:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "wdqs: disable notifs on not-yet-in-service hosts" [puppet] - 10https://gerrit.wikimedia.org/r/891626 (https://phabricator.wikimedia.org/T301167) (owner: 10Ryan Kemper)
[22:54:38] <wikibugs>	 (03PS5) 10Ryan Kemper: Revert "wdqs: disable notifs on not-yet-in-service hosts" [puppet] - 10https://gerrit.wikimedia.org/r/891626 (https://phabricator.wikimedia.org/T301167)
[22:59:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[22:59:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[23:00:10] <icinga-wm>	 RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:00:49] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer
[23:00:50] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[23:01:07] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer
[23:04:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[23:04:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[23:07:06] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[23:09:20] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] swich https://15.wikipedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/892571 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn)
[23:09:21] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer
[23:11:09] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Add `guc` and `gur` to InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892568 (owner: 10Jon Harald Søby)
[23:11:38] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] "Thanks for catching this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892568 (owner: 10Jon Harald Søby)
[23:11:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892568 (owner: 10Jon Harald Søby)
[23:11:55] <wikibugs>	 (03Merged) 10jenkins-bot: Add `guc` and `gur` to InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892568 (owner: 10Jon Harald Søby)
[23:12:09] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:892568|Add `guc` and `gur` to InterwikiSortOrders]]
[23:12:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[23:12:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[23:13:59] <logmsgbot>	 !log zabe@deploy1002 jhsoby and zabe: Backport for [[gerrit:892568|Add `guc` and `gur` to InterwikiSortOrders]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[23:15:15] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[23:17:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[23:17:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[23:18:17] <jinxer-wm>	 (MediaWikiMemcachedHighErrorRate) firing: (2) MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[23:19:51] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:892568|Add `guc` and `gur` to InterwikiSortOrders]] (duration: 07m 41s)
[23:19:52] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer
[23:23:17] <jinxer-wm>	 (MediaWikiMemcachedHighErrorRate) resolved: (2) MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[23:25:52] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[23:26:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[23:26:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[23:26:19] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer
[23:31:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[23:31:18] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[23:32:12] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[23:32:30] <wikibugs>	 (03PS1) 10Dzahn: switch https://bienvenida.wikimedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/892574 (https://phabricator.wikimedia.org/T330090)
[23:34:02] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] switch https://bienvenida.wikimedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/892574 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn)
[23:37:45] * herzog wonders what ^ is about
[23:40:23] <mutante>	 herzog: well, see https://bienvenida.wikimedia.org/
[23:40:28] * zabe wonders about lists being listed as out of scope on T329193
[23:40:34] <mutante>	 "here is some music from Latin America and download the app"
[23:40:38] <herzog>	 mutante: I did
[23:40:49] <zabe>	 how could be something out of scope when this is supposed to test emergency failover capabilitise
[23:42:26] <mutante>	 herzog: https://phabricator.wikimedia.org/T207816
[23:42:38] <mutante>	 Mexico Awareness
[23:42:50] <wikibugs>	 (03PS1) 10Samtar: Initial configuration for amicalwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892575 (https://phabricator.wikimedia.org/T330390)
[23:43:52] <wikibugs>	 (03CR) 10Zabe: Add Apache configuration for amical.wikimedia (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/892555 (https://phabricator.wikimedia.org/T330390) (owner: 10Samtar)
[23:45:01] <herzog>	 mutante: 2018 campaign, site still needed?
[23:45:30] <zabe>	 we tend to not delete stuff
[23:46:28] <mutante>	 herzog: yea, URLs are needed forever because https://www.w3.org/Provider/Style/URI  and if you do it means new work to add rewrite rules
[23:46:57] <mutante>	 doesnt gain from deleting a virtual host on miscweb only to have to add it on cluster
[23:47:33] <wikibugs>	 (03PS2) 10Samtar: Add Apache configuration for amical.wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/892555 (https://phabricator.wikimedia.org/T330390)
[23:48:23] <wikibugs>	 (03CR) 10Samtar: Add Apache configuration for amical.wikimedia (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/892555 (https://phabricator.wikimedia.org/T330390) (owner: 10Samtar)
[23:49:48] <mutante>	 but we can move it to k8s. then there wont be failovers like above anymore
[23:50:20] <mutante>	 also see nostalgia.wikipedia.org
[23:51:08] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[23:52:23] <mutante>	 maybe that 9/11 wiki was actually deleted
[23:53:16] <mutante>	 https://sep11.wikipedia.org
[23:54:16] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)