[00:38:39] (NodeTextfileStale) firing: (12) Stale textfile for puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:23:59] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Tgr) Just as a heads-up: we recently increased AQS traffic from MediaWiki PHP code (T324675) which seems to work fine (it's causing some timeouts:... [02:01:24] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:21:45] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:54:58] (KubernetesCalicoDown) firing: dse-k8s-worker1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1007.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:34:01] (03CR) 10Ladsgroup: [C: 03+1] "I checked all of them with orch/dbctl and they all are correct." [dns] - 10https://gerrit.wikimedia.org/r/891552 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert) [04:38:39] (NodeTextfileStale) firing: (12) Stale textfile for puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:29:58] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Joe) >>! In T327920#8647335, @Tgr wrote: > Just as a heads-up: we recently increased AQS traffic from MediaWiki PHP code (T324675) which seems to w... [06:54:58] (KubernetesCalicoDown) firing: dse-k8s-worker1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1007.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:18:55] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Tgr) MwHttpRequest (that is, Guzzle/php-curl) and the URLs from https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews. I don't know if RESTBa... [07:42:25] !log Enable replication codfw -> eqiad on pcX T330619 [07:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:30] T330619: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T330619 [07:45:26] (03CR) 10Muehlenhoff: [C: 03+2] Add d-i config for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/891832 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [07:45:49] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/891839 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [07:49:31] (03PS1) 10Elukey: admin_ng: add configuration for the istio knative local gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/892353 (https://phabricator.wikimedia.org/T327767) [07:51:39] (03PS2) 10Elukey: admin_ng: add configuration for the istio knative local gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/892353 (https://phabricator.wikimedia.org/T327767) [07:53:54] (03CR) 10Jelto: [C: 03+1] Lower TTL on gitlab records to 300 seconds to facilitate failover [dns] - 10https://gerrit.wikimedia.org/r/891886 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney) [07:55:15] (03CR) 10Jelto: [C: 03+1] "lgtm, should be merged during the switchover" [puppet] - 10https://gerrit.wikimedia.org/r/891863 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney) [07:57:17] !log Enable replication codfw -> eqiad on x1 T330619 [07:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:21] T330619: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T330619 [07:57:21] (03CR) 10Muehlenhoff: "Merging since Simon is off this week." [puppet] - 10https://gerrit.wikimedia.org/r/891798 (https://phabricator.wikimedia.org/T330364) (owner: 10Slyngshede) [07:57:23] (03CR) 10Muehlenhoff: [C: 03+2] Access to analytics-privatedata-users for bscarone [puppet] - 10https://gerrit.wikimedia.org/r/891798 (https://phabricator.wikimedia.org/T330364) (owner: 10Slyngshede) [08:00:05] Amir1 and Urbanecm: Dear deployers, time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:01:42] RECOVERY - Check systemd state on ms-be1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:03:19] (03CR) 10Jelto: [C: 03+1] "lgtm, should be merged during the switchover. We should carefully rebase this once the TTL change is merged" [dns] - 10https://gerrit.wikimedia.org/r/891888 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney) [08:05:58] (03PS1) 10Muehlenhoff: Readd email address for reactivated bscarone account [puppet] - 10https://gerrit.wikimedia.org/r/892355 (https://phabricator.wikimedia.org/T330364) [08:07:35] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/891837 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [08:17:59] (03PS1) 10Nicolas Fraison: presto: add gc tag -XX:+PrintAdaptiveSizePolicy -XX:+PrintTenuringDistribution [puppet] - 10https://gerrit.wikimedia.org/r/892357 [08:21:34] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39838/console" [puppet] - 10https://gerrit.wikimedia.org/r/892357 (owner: 10Nicolas Fraison) [08:21:57] (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] presto: add gc tag -XX:+PrintAdaptiveSizePolicy -XX:+PrintTenuringDistribution [puppet] - 10https://gerrit.wikimedia.org/r/892357 (owner: 10Nicolas Fraison) [08:25:31] (03CR) 10Elukey: [C: 03+1] "LGTM! To be safe please disable puppet on an-coord100[12] before merging and test the change on an-test-coord1001 first." [puppet] - 10https://gerrit.wikimedia.org/r/891850 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison) [08:29:13] (03PS3) 10Nicolas Fraison: hive: add gc logs to hiveservers and hive-metastore [puppet] - 10https://gerrit.wikimedia.org/r/891850 (https://phabricator.wikimedia.org/T303168) [08:30:36] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39839/console" [puppet] - 10https://gerrit.wikimedia.org/r/891850 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison) [08:32:25] !log Enable replication codfw -> eqiad on es4 and es5 T330619 [08:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:30] T330619: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T330619 [08:36:05] (03CR) 10Muehlenhoff: [C: 03+2] Readd email address for reactivated bscarone account [puppet] - 10https://gerrit.wikimedia.org/r/892355 (https://phabricator.wikimedia.org/T330364) (owner: 10Muehlenhoff) [08:38:39] (NodeTextfileStale) firing: (12) Stale textfile for puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:46:19] 10SRE, 10SRE-Access-Requests, 10Research, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @bscarone I have activated your access. You should have also gotte... [08:51:36] !log Enable replication codfw -> eqiad on s2 T330619 [08:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:41] T330619: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T330619 [08:51:52] !log Disable GTID on es% x1 and s% on codfw masters T330619 [08:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:15] (03PS2) 10EoghanGaffney: Lower TTL on gitlab records to 300 seconds to facilitate failover [dns] - 10https://gerrit.wikimedia.org/r/891886 (https://phabricator.wikimedia.org/T329931) [08:53:10] (03CR) 10Jelto: [C: 04-1] "see in-ine" [dns] - 10https://gerrit.wikimedia.org/r/891886 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney) [08:54:57] !log test haproxy hardening in cp4045 - T323944 [08:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:01] T323944: haproxy: work on systemd unit hardening (cp hosts) - https://phabricator.wikimedia.org/T323944 [08:55:01] (03CR) 10Vgutierrez: [C: 03+2] hiera: enable haproxy systemd hardening on cp4045 [puppet] - 10https://gerrit.wikimedia.org/r/863332 (https://phabricator.wikimedia.org/T323944) (owner: 10Ssingh) [08:56:00] !log updating mw/codfw to PHP 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u2 T330270 [08:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:30] (03PS2) 10EoghanGaffney: Update records for gitlab from gitlab1004 -> gitlab2002 [dns] - 10https://gerrit.wikimedia.org/r/891888 (https://phabricator.wikimedia.org/T329931) [09:00:04] hashar and jnuche: Time to snap out of that daydream and deploy Jenkins upgrade. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T0900). [09:00:10] (03PS3) 10EoghanGaffney: Update records for gitlab from gitlab1004 -> gitlab2002 [dns] - 10https://gerrit.wikimedia.org/r/891888 (https://phabricator.wikimedia.org/T329931) [09:01:34] (03PS3) 10EoghanGaffney: Lower TTL on gitlab records to 300 seconds to facilitate failover [dns] - 10https://gerrit.wikimedia.org/r/891886 (https://phabricator.wikimedia.org/T329931) [09:01:43] (03PS4) 10Jbond: Add bookworm pxelinux.cfg [puppet] - 10https://gerrit.wikimedia.org/r/891839 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [09:01:45] (03PS1) 10Jbond: spdx: update spdx new files to ignore files regardless of path [puppet] - 10https://gerrit.wikimedia.org/r/892361 [09:02:42] (03CR) 10Jbond: [C: 03+2] spdx: update spdx new files to ignore files regardless of path [puppet] - 10https://gerrit.wikimedia.org/r/892361 (owner: 10Jbond) [09:03:13] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/891839 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [09:03:40] jelto: can you leave us gitlab up for a few minutes? jnuche and I would like to update the CI Jenkins right now [09:04:13] or maybe it is not even needed ;) [09:04:14] hashar: GitLab maintenance is planned for 10UTC, so in one hour [09:04:19] AH great [09:04:46] so we have ample time [09:04:49] thank you! [09:06:24] (03PS4) 10EoghanGaffney: Lower TTL on gitlab records to 300 seconds to facilitate failover [dns] - 10https://gerrit.wikimedia.org/r/891886 (https://phabricator.wikimedia.org/T329931) [09:06:38] (03CR) 10EoghanGaffney: Lower TTL on gitlab records to 300 seconds to facilitate failover (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/891886 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney) [09:07:17] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:10:26] (03CR) 10Marostegui: [C: 03+1] "We can do pcX later on." [dns] - 10https://gerrit.wikimedia.org/r/891552 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert) [09:12:17] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:12:45] !log Enable replication codfw -> eqiad on s3 T330619 [09:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:50] T330619: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T330619 [09:13:04] latency spike since 8:57 [09:13:19] for parsoid [09:13:54] eqiad only [09:14:51] (03PS5) 10Jelto: Lower TTL on gitlab records to 300 seconds to facilitate failover [dns] - 10https://gerrit.wikimedia.org/r/891886 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney) [09:15:17] !log hashar@deploy1002 Started deploy [releng/jenkins-deploy@0e465ac] (releasing): (no justification provided) [09:15:30] we are doing the Jenkins updates [09:16:04] !log hashar@deploy1002 Finished deploy [releng/jenkins-deploy@0e465ac] (releasing): (no justification provided) (duration: 00m 46s) [09:16:26] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/891848 (https://phabricator.wikimedia.org/T306421) (owner: 10Ayounsi) [09:16:40] (03PS1) 10Filippo Giunchedi: sre: more readable varnish/haproxy frontend unavailable [alerts] - 10https://gerrit.wikimedia.org/r/892362 (https://phabricator.wikimedia.org/T330405) [09:16:58] (03CR) 10Elukey: [C: 03+2] Re-image: clear DHCP cache sooner [cookbooks] - 10https://gerrit.wikimedia.org/r/891848 (https://phabricator.wikimedia.org/T306421) (owner: 10Ayounsi) [09:17:17] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:19:25] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [09:20:08] !log Restarting CI Jenkins T330045 [09:20:10] (03CR) 10Jelto: [C: 03+1] "IPv6 PTR record for gitlab.wikimedia.org was missing, I amended it with 300 seconds." [dns] - 10https://gerrit.wikimedia.org/r/891886 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney) [09:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:13] T330045: Upgrade Jenkins to latest LTS 2.375.3 - https://phabricator.wikimedia.org/T330045 [09:20:27] (03CR) 10Jelto: [C: 03+2] Lower TTL on gitlab records to 300 seconds to facilitate failover [dns] - 10https://gerrit.wikimedia.org/r/891886 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney) [09:21:50] (03PS1) 10Sergio Gimeno: GrowthExperiments: Enable link recommendation for 7th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892363 (https://phabricator.wikimedia.org/T304551) [09:21:52] (03PS1) 10Sergio Gimeno: GrowthExperiments: Enable link recommendation for 8th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892364 (https://phabricator.wikimedia.org/T308133) [09:21:54] (03PS1) 10Sergio Gimeno: GrowthExperiments: Enable link recommendation for 9th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892365 (https://phabricator.wikimedia.org/T308134) [09:21:57] latency seems trending down for parsoid [09:22:17] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:24:18] (03CR) 10Volans: [C: 03+2] icinga: uniform code and add test [software/spicerack] - 10https://gerrit.wikimedia.org/r/891803 (owner: 10Volans) [09:26:15] (03PS1) 10Volans: OS_VERSIONS: add bookworm to the allowed versions [cookbooks] - 10https://gerrit.wikimedia.org/r/892368 (https://phabricator.wikimedia.org/T330495) [09:26:30] !log Enable replication codfw -> eqiad on s8 T330619 [09:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:34] T330619: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T330619 [09:27:44] (03CR) 10Jcrespo: "I think this is more readable and less confusing (among a lot more work to be done in this regard), but it should be ultimately the servic" [alerts] - 10https://gerrit.wikimedia.org/r/892362 (https://phabricator.wikimedia.org/T330405) (owner: 10Filippo Giunchedi) [09:27:52] !log Enable replication codfw -> eqiad on s7 T330619 [09:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:05] (03Merged) 10jenkins-bot: icinga: uniform code and add test [software/spicerack] - 10https://gerrit.wikimedia.org/r/891803 (owner: 10Volans) [09:28:48] (03CR) 10Muehlenhoff: [C: 03+1] OS_VERSIONS: add bookworm to the allowed versions [cookbooks] - 10https://gerrit.wikimedia.org/r/892368 (https://phabricator.wikimedia.org/T330495) (owner: 10Volans) [09:30:15] (03PS1) 10Marostegui: realm.pp: Add private tables [puppet] - 10https://gerrit.wikimedia.org/r/892369 (https://phabricator.wikimedia.org/T330502) [09:31:08] (03PS4) 10Jelto: Update records for gitlab from gitlab1004 -> gitlab2002 [dns] - 10https://gerrit.wikimedia.org/r/891888 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney) [09:31:47] (03CR) 10Volans: [C: 03+2] OS_VERSIONS: add bookworm to the allowed versions [cookbooks] - 10https://gerrit.wikimedia.org/r/892368 (https://phabricator.wikimedia.org/T330495) (owner: 10Volans) [09:32:31] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [09:33:14] (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] hive: add gc logs to hiveservers and hive-metastore [puppet] - 10https://gerrit.wikimedia.org/r/891850 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison) [09:33:18] (03PS6) 10Jbond: redfish: Add simple supermicro class [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 [09:33:20] (03PS5) 10Jelto: Update records for gitlab from gitlab1004 -> gitlab2002 [dns] - 10https://gerrit.wikimedia.org/r/891888 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney) [09:33:47] (03Merged) 10jenkins-bot: OS_VERSIONS: add bookworm to the allowed versions [cookbooks] - 10https://gerrit.wikimedia.org/r/892368 (https://phabricator.wikimedia.org/T330495) (owner: 10Volans) [09:34:27] !log Enable replication codfw -> eqiad on s6 T330619 [09:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:32] T330619: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T330619 [09:34:39] jelto: we have completed the Jenkins upgrades ;] [09:35:00] (03CR) 10Jelto: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/891888 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney) [09:35:16] hashar: thanks for letting us know! [09:35:39] (03CR) 10Jbond: "thanks updated" [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 (owner: 10Jbond) [09:36:44] (03CR) 10CI reject: [V: 04-1] redfish: Add simple supermicro class [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 (owner: 10Jbond) [09:36:51] !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1003.eqiad.wmnet with OS bullseye [09:36:59] !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1003.eqiad.wmnet with OS bullseye [09:37:53] (03PS1) 10Elukey: Revert "Re-image: clear DHCP cache sooner" [cookbooks] - 10https://gerrit.wikimedia.org/r/891981 [09:38:17] (03CR) 10Elukey: [C: 03+2] "The facter command fails to run, reverting.." [cookbooks] - 10https://gerrit.wikimedia.org/r/891848 (https://phabricator.wikimedia.org/T306421) (owner: 10Ayounsi) [09:39:29] !log Enable replication codfw -> eqiad on s5 T330619 [09:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:34] T330619: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T330619 [09:39:44] (03PS3) 10Majavah: P:toolforge::k8s::haproxy: remove support for hash node list [puppet] - 10https://gerrit.wikimedia.org/r/891826 [09:39:46] (03PS5) 10Majavah: P:toolforge::k8s::haproxy: add api gateway load balancer [puppet] - 10https://gerrit.wikimedia.org/r/891827 (https://phabricator.wikimedia.org/T329443) [09:39:48] (03PS1) 10Majavah: P:toolforge: use api gateway for jobs cli [puppet] - 10https://gerrit.wikimedia.org/r/892370 (https://phabricator.wikimedia.org/T329443) [09:39:59] (03CR) 10Klausman: [C: 03+1] admin_ng: add configuration for the istio knative local gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/892353 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [09:40:36] (03CR) 10Elukey: [C: 03+2] Revert "Re-image: clear DHCP cache sooner" [cookbooks] - 10https://gerrit.wikimedia.org/r/891981 (owner: 10Elukey) [09:43:50] (03PS7) 10Jbond: redfish: Add simple supermicro class [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 [09:44:54] !log Enable replication codfw -> eqiad on s1 T330619 [09:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:59] T330619: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T330619 [09:46:45] !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1003.eqiad.wmnet with OS bullseye [09:46:52] !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1003.eqiad.wmnet with OS bullseye [09:47:30] !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1003.eqiad.wmnet with OS bullseye [09:49:49] (03CR) 10Marostegui: "This can be merged anytime btw" [puppet] - 10https://gerrit.wikimedia.org/r/892369 (https://phabricator.wikimedia.org/T330502) (owner: 10Marostegui) [09:52:16] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [09:53:22] elukey: could that be you?^ [10:00:45] 10SRE-OnFire, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q3), 10User-fgiunchedi: Improve AlertManager alert titles as sent to VictorOps - https://phabricator.wikimedia.org/T317240 (10fgiunchedi) I think the titles are indeed far easier to read and already led to other improvements (T330405)... [10:02:23] (03CR) 10Jaime Nuche: scap: add required Python3 venv package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891837 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [10:04:40] (03CR) 10Clément Goubert: [C: 04-2] "Preparation for 2023-02-28 datacenter service switchover." [dns] - 10https://gerrit.wikimedia.org/r/892372 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert) [10:05:29] (03CR) 10Clément Goubert: [C: 04-2] "Preparation for 2023-02-28 datacenter service switchover." [puppet] - 10https://gerrit.wikimedia.org/r/892373 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert) [10:05:33] marostegui: o/ in theory no, I didn't merge puppet changes today [10:05:38] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 (owner: 10Jbond) [10:06:28] elukey: ah ok, I saw the +2 from you [10:06:48] (03CR) 10Clément Goubert: [C: 03+1] re-introduce webserver-misc-static.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/891384 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [10:07:06] it looks like it is from nfraison [10:07:10] nfraison, jbond --^ [10:07:15] there are two commits from you [10:07:46] (03CR) 10Clément Goubert: [C: 03+2] sre.switchdc.mediawiki: use python warmup script [cookbooks] - 10https://gerrit.wikimedia.org/r/891520 (https://phabricator.wikimedia.org/T288867) (owner: 10Clément Goubert) [10:07:47] elukey: yes I've requested jbond if I can merge in #sre [10:08:32] !log Enable replication codfw -> eqiad on s4 T330619 [10:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:37] T330619: Enable DB replication eqiad -> codfw before the switchover - https://phabricator.wikimedia.org/T330619 [10:09:15] nfraison: ahh okok [10:09:45] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: use python warmup script [cookbooks] - 10https://gerrit.wikimedia.org/r/891520 (https://phabricator.wikimedia.org/T288867) (owner: 10Clément Goubert) [10:10:19] (03PS8) 10Jbond: redfish: Add simple supermicro class [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 [10:11:52] (03PS1) 10Jbond: ceph: remove cloud data [labs/private] - 10https://gerrit.wikimedia.org/r/892376 [10:13:38] (03PS1) 10Elukey: sre.hosts.reimage: add full path for facter and run clear dchp earlier [cookbooks] - 10https://gerrit.wikimedia.org/r/892377 (https://phabricator.wikimedia.org/T306421) [10:13:51] (03PS9) 10Jbond: redfish: Add simple supermicro class [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 [10:17:50] !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab1004.wikimedia.org with reason: Running failover to gitlab2002- T329931 [10:17:54] T329931: Switchover gitlab (gitlab1004 -> gitlab2002) - https://phabricator.wikimedia.org/T329931 [10:18:03] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab1004.wikimedia.org with reason: Running failover to gitlab2002- T329931 [10:18:16] (03CR) 10Volans: sre.hosts.reimage: add full path for facter and run clear dchp earlier (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/892377 (https://phabricator.wikimedia.org/T306421) (owner: 10Elukey) [10:19:07] !log live testing cache warmup cookbook [10:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:24] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-optional-warmup-caches [10:20:11] (03CR) 10Jbond: [C: 03+2] redfish: Add simple supermicro class [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 (owner: 10Jbond) [10:20:34] (03CR) 10Btullis: [V: 03+2 C: 03+2] "Many thanks jbond." [labs/private] - 10https://gerrit.wikimedia.org/r/892376 (owner: 10Jbond) [10:20:51] (03CR) 10Ladsgroup: [C: 03+1] "do you want me to merge it (and restart sanitarium hosts and their masters?)" [puppet] - 10https://gerrit.wikimedia.org/r/892369 (https://phabricator.wikimedia.org/T330502) (owner: 10Marostegui) [10:21:11] (03CR) 10Marostegui: realm.pp: Add private tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/892369 (https://phabricator.wikimedia.org/T330502) (owner: 10Marostegui) [10:21:16] (03CR) 10Marostegui: [C: 03+2] realm.pp: Add private tables [puppet] - 10https://gerrit.wikimedia.org/r/892369 (https://phabricator.wikimedia.org/T330502) (owner: 10Marostegui) [10:22:00] (03PS2) 10Elukey: sre.hosts.reimage: add full path for facter and run clear dchp earlier [cookbooks] - 10https://gerrit.wikimedia.org/r/892377 (https://phabricator.wikimedia.org/T306421) [10:22:09] (03CR) 10Elukey: sre.hosts.reimage: add full path for facter and run clear dchp earlier (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/892377 (https://phabricator.wikimedia.org/T306421) (owner: 10Elukey) [10:22:30] PROBLEM - Check systemd state on ms-be1069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:33] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-optional-warmup-caches (exit_code=0) [10:23:24] PROBLEM - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 22: Connection refused https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [10:23:35] (03CR) 10Volans: [C: 03+1] "LGTM to try again" [cookbooks] - 10https://gerrit.wikimedia.org/r/892377 (https://phabricator.wikimedia.org/T306421) (owner: 10Elukey) [10:23:42] PROBLEM - Gitlab HTTPS SSL Expiry on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 443: Connection refused https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [10:23:43] !log dcaro@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1003.eqiad.wmnet with OS bullseye [10:23:47] (03Merged) 10jenkins-bot: redfish: Add simple supermicro class [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 (owner: 10Jbond) [10:23:54] !log root@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1003'] [10:24:06] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 443: Connection refused https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [10:24:17] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/885363 (owner: 10Jbond) [10:24:51] (03CR) 10Muehlenhoff: [C: 03+2] Add bookworm pxelinux.cfg [puppet] - 10https://gerrit.wikimedia.org/r/891839 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [10:25:00] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [10:25:11] (03CR) 10Elukey: [C: 03+2] sre.hosts.reimage: add full path for facter and run clear dchp earlier [cookbooks] - 10https://gerrit.wikimedia.org/r/892377 (https://phabricator.wikimedia.org/T306421) (owner: 10Elukey) [10:25:45] (JobUnavailable) firing: Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:26:12] !log Restart codfw sanitarium hosts T330502 [10:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:16] T330502: Create oathauth_types and oathauth_devices tables - https://phabricator.wikimedia.org/T330502 [10:26:20] (03CR) 10Muehlenhoff: [C: 03+2] Blacklist f2fs [puppet] - 10https://gerrit.wikimedia.org/r/891817 (owner: 10Muehlenhoff) [10:26:35] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [10:26:57] (03CR) 10Jbond: "i see this is merged so i wouldn't worry about comments below unless you end up touching things again 😊" [cookbooks] - 10https://gerrit.wikimedia.org/r/892377 (https://phabricator.wikimedia.org/T306421) (owner: 10Elukey) [10:29:41] (03CR) 10Volans: [C: 03+1] sre.hosts.reimage: add full path for facter and run clear dchp earlier (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/892377 (https://phabricator.wikimedia.org/T306421) (owner: 10Elukey) [10:30:01] jouncebot: nowandnext [10:30:01] No deployments scheduled for the next 0 hour(s) and 29 minute(s) [10:30:01] In 0 hour(s) and 29 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1100) [10:30:33] (03PS2) 10Jbond: ssh config: Add ControlPath and ControlPersist parameters [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891568 [10:30:40] (03CR) 10Jbond: [V: 03+2 C: 03+2] ssh config: Add ControlPath and ControlPersist parameters [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/891568 (owner: 10Jbond) [10:31:50] !log root@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1003'] [10:31:59] (03PS1) 10Ladsgroup: Completely get rid of responsiveimages removal [extensions/MobileFrontend] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891982 (https://phabricator.wikimedia.org/T326147) [10:32:24] !log Restart eqiad sanitarium hosts T330502 [10:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:28] T330502: Create oathauth_types and oathauth_devices tables - https://phabricator.wikimedia.org/T330502 [10:35:03] (03CR) 10Ladsgroup: [C: 03+2] Completely get rid of responsiveimages removal [extensions/MobileFrontend] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891982 (https://phabricator.wikimedia.org/T326147) (owner: 10Ladsgroup) [10:36:15] (03CR) 10Jbond: sre.hosts.reimage: add full path for facter and run clear dchp earlier (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/892377 (https://phabricator.wikimedia.org/T306421) (owner: 10Elukey) [10:38:21] (03PS1) 10Jbond: sre.hosts.reimage: Drop -p and add no-external-facts [cookbooks] - 10https://gerrit.wikimedia.org/r/892378 [10:39:04] !log root@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1003'] [10:39:07] (03CR) 10Elukey: [C: 03+1] sre.hosts.reimage: Drop -p and add no-external-facts [cookbooks] - 10https://gerrit.wikimedia.org/r/892378 (owner: 10Jbond) [10:39:15] (03PS2) 10Jbond: sre.hosts.reimage: Drop -p and add no-external-facts [cookbooks] - 10https://gerrit.wikimedia.org/r/892378 [10:41:22] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/892378 (owner: 10Jbond) [10:42:17] (03CR) 10Jbond: "also see comment at:" [cookbooks] - 10https://gerrit.wikimedia.org/r/892378 (owner: 10Jbond) [10:42:25] (03CR) 10Jbond: [C: 03+2] sre.hosts.reimage: Drop -p and add no-external-facts [cookbooks] - 10https://gerrit.wikimedia.org/r/892378 (owner: 10Jbond) [10:42:45] (03CR) 10Elukey: [C: 03+2] admin_ng: add configuration for the istio knative local gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/892353 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [10:43:56] !log root@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1003'] [10:44:16] (03Merged) 10jenkins-bot: sre.hosts.reimage: Drop -p and add no-external-facts [cookbooks] - 10https://gerrit.wikimedia.org/r/892378 (owner: 10Jbond) [10:46:44] (03CR) 10CI reject: [V: 04-1] Completely get rid of responsiveimages removal [extensions/MobileFrontend] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891982 (https://phabricator.wikimedia.org/T326147) (owner: 10Ladsgroup) [10:48:28] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:48:31] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:49:39] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Marostegui) [10:52:53] (03PS1) 10Btullis: Move ceph profile authentication token to the role [labs/private] - 10https://gerrit.wikimedia.org/r/892380 (https://phabricator.wikimedia.org/T324660) [10:53:50] (03PS2) 10David Caro: cloud: add tests for >buster os [puppet] - 10https://gerrit.wikimedia.org/r/891593 [10:53:52] (03CR) 10David Caro: cloud: add tests for >buster os (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/891593 (owner: 10David Caro) [10:54:09] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1003'] [10:54:24] (03CR) 10Muehlenhoff: [C: 03+2] clamd.conf: Remove some config entries [puppet] - 10https://gerrit.wikimedia.org/r/890799 (https://phabricator.wikimedia.org/T330129) (owner: 10Muehlenhoff) [10:54:58] (KubernetesCalicoDown) firing: dse-k8s-worker1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1007.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:55:05] (03CR) 10Btullis: [V: 03+2 C: 03+2] Move ceph profile authentication token to the role [labs/private] - 10https://gerrit.wikimedia.org/r/892380 (https://phabricator.wikimedia.org/T324660) (owner: 10Btullis) [10:55:14] (03PS2) 10Ladsgroup: Completely get rid of responsiveimages removal [extensions/MobileFrontend] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891982 (https://phabricator.wikimedia.org/T326147) [10:56:18] (03CR) 10David Caro: [C: 03+2] cloud: add tests for >buster os [puppet] - 10https://gerrit.wikimedia.org/r/891593 (owner: 10David Caro) [10:56:45] (03PS3) 10ArielGlenn: Add dumpsdata1004 and dumpsdata1005 as spare dumps hosts and rsync pullers [puppet] - 10https://gerrit.wikimedia.org/r/892033 (https://phabricator.wikimedia.org/T330573) [10:59:28] (03CR) 10ArielGlenn: [C: 03+2] Add dumpsdata1004 and dumpsdata1005 as spare dumps hosts and rsync pullers [puppet] - 10https://gerrit.wikimedia.org/r/892033 (https://phabricator.wikimedia.org/T330573) (owner: 10ArielGlenn) [10:59:54] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudcephosd1003'] [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1100) [11:04:25] 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Marostegui) [11:04:26] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1003'] [11:04:31] 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Marostegui) Circular replication is now enabled (T330619) everywhere where it is supposed to be. It is one of our pr... [11:05:51] 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Clement_Goubert) >>! In T330302#8648213, @Marostegui wrote: > Circular replication is now enabled (T330619) everywhe... [11:05:59] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [11:07:17] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [11:08:25] !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1003.eqiad.wmnet with OS bullseye [11:08:37] 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Marostegui) It is probably something we still need to test before the switch anyways, as it is key, especially for t... [11:10:03] (03CR) 10CI reject: [V: 04-1] Completely get rid of responsiveimages removal [extensions/MobileFrontend] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891982 (https://phabricator.wikimedia.org/T326147) (owner: 10Ladsgroup) [11:10:52] !log rsync private xmldatadumps dir from dumpsdata1003 to dumpsdata1004; running from ariel screen session on dumpsdata1003, no bandwidth cap [11:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:16] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10jbond) >>! In T326848#8645012, @Papaul wrote: > @jbond > ` > poweredge-r450: picking DellDriverCategory.BIOS update file > We have found multiple ent... [11:15:09] (03PS1) 10Btullis: Add keydata for ceph mgr daemons [labs/private] - 10https://gerrit.wikimedia.org/r/892388 (https://phabricator.wikimedia.org/T324660) [11:15:45] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add keydata for ceph mgr daemons [labs/private] - 10https://gerrit.wikimedia.org/r/892388 (https://phabricator.wikimedia.org/T324660) (owner: 10Btullis) [11:16:04] (03CR) 10Ladsgroup: "This change is ready for review." [extensions/MobileFrontend] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891984 (owner: 10Ladsgroup) [11:16:46] 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Clement_Goubert) We can probably just run `sre.switchdc.mediawiki.03-set-db-readonly` and `sre.switchdc.mediawiki.06... [11:17:26] 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Marostegui) That works for me :) We might need to make a not that having circular replication is a hard dependency [11:20:20] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [11:21:28] (03CR) 10EoghanGaffney: [C: 03+2] Switch active gitlab host from gitlab1004 (eqiad) -> gitlab2002 (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/891863 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney) [11:22:08] PROBLEM - Host ms-fe2013 is DOWN: PING CRITICAL - Packet loss = 100% [11:22:22] hmmm Emperor ^^ :? [11:23:18] vgutierrez: it's being worked on and isn't in service [11:23:28] 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Clement_Goubert) Added https://wikitech.wikimedia.org/wiki/Switch_Datacenter#03-set-db-readonly as well as a note in... [11:23:40] ack, I've missed the SAL entry, sorry [11:23:58] j.bond is working on it, I've asked him to downtime it in the mean time :) [11:24:06] (03CR) 10Ladsgroup: [C: 03+2] .nvmrc: Update to 16.19.1 after CI update [extensions/MobileFrontend] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891984 (owner: 10Ladsgroup) [11:27:08] (03PS1) 10Muehlenhoff: Sync more clamd.conf settings from 0.103.8 [puppet] - 10https://gerrit.wikimedia.org/r/892389 (https://phabricator.wikimedia.org/T330129) [11:28:10] !log jbond@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on ms-fe2013.codfw.wmnet with reason: testing redfish T326848 [11:28:14] T326848: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 [11:28:25] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-fe2013.codfw.wmnet with reason: testing redfish T326848 [11:29:30] !log rsync public (huge!) xmldatadumps dir from dumpsdata1003 to dumpsdata1004; running from ariel screen session on dumpsdata1003, no bandwidth cap [11:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44759 and previous config saved to /var/cache/conftool/dbconfig/20230227-112937-root.json [11:31:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1127', diff saved to https://phabricator.wikimedia.org/P44760 and previous config saved to /var/cache/conftool/dbconfig/20230227-113130-root.json [11:34:26] (03CR) 10EoghanGaffney: [C: 03+2] Update records for gitlab from gitlab1004 -> gitlab2002 [dns] - 10https://gerrit.wikimedia.org/r/891888 (https://phabricator.wikimedia.org/T329931) (owner: 10EoghanGaffney) [11:35:45] (JobUnavailable) firing: (2) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:37:26] RECOVERY - Gitlab HTTPS SSL Expiry on gitlab.wikimedia.org is OK: OK - Certificate gitlab.wikimedia.org will expire on Mon 01 May 2023 06:51:05 PM GMT +0000. https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [11:38:42] (03Merged) 10jenkins-bot: .nvmrc: Update to 16.19.1 after CI update [extensions/MobileFrontend] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891984 (owner: 10Ladsgroup) [11:39:58] (03PS3) 10Ladsgroup: Completely get rid of responsiveimages removal [extensions/MobileFrontend] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891982 (https://phabricator.wikimedia.org/T326147) [11:40:01] (03CR) 10Ladsgroup: [C: 03+2] Completely get rid of responsiveimages removal [extensions/MobileFrontend] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891982 (https://phabricator.wikimedia.org/T326147) (owner: 10Ladsgroup) [11:42:25] !log hnowlan@deploy1002 Started deploy [restbase/deploy@bcb0a69]: Add azwikimedia T317120 [11:42:30] T317120: Add azwikimedia to RESTBase - https://phabricator.wikimedia.org/T317120 [11:42:54] jouncebot: next [11:42:54] In 2 hour(s) and 17 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1400) [11:43:50] !log hnowlan@deploy1002 Finished deploy [restbase/deploy@bcb0a69]: Add azwikimedia T317120 (duration: 01m 25s) [11:44:03] Superpes: hi, are you planning on scheduling T330470? [11:44:04] T330470: New protection level and right to edit through it for eswiki - https://phabricator.wikimedia.org/T330470 [11:44:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 3%: Repooling', diff saved to https://phabricator.wikimedia.org/P44761 and previous config saved to /var/cache/conftool/dbconfig/20230227-114442-root.json [11:45:07] herzog Yep In the afternoon or evening (or maybe tomorrow - based on commitments in RL) :P [11:45:29] Superpes: copy, I was asked :) [11:45:47] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: incorrect error status [cookbooks] - 10https://gerrit.wikimedia.org/r/892393 [11:45:50] I will tell em to take a cup of tea [11:46:04] Well there was the weekend in between otherwise I would have already scheduled it lol :P [11:46:34] (03CR) 10Vgutierrez: [C: 03+2] varnish: Set `X-Content-Type-Options: nosniff` on upload requests [puppet] - 10https://gerrit.wikimedia.org/r/890512 (https://phabricator.wikimedia.org/T309787) (owner: 10Legoktm) [11:47:10] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 141880 bytes in 1.778 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [11:48:56] RECOVERY - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [11:49:36] !log set "X-Content-Type-Options: nosniff" on upload.wm.o requests - T309787 [11:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:41] T309787: Remove IEContentAnalyzer - https://phabricator.wikimedia.org/T309787 [11:50:45] (JobUnavailable) firing: (2) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:51:04] PROBLEM - Check systemd state on dumpsdata1005 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:28] RECOVERY - Host ms-fe2013 is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms [11:51:59] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [11:53:36] !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1003.eqiad.wmnet with OS bullseye [11:55:13] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: incorrect error status [cookbooks] - 10https://gerrit.wikimedia.org/r/892393 (https://phabricator.wikimedia.org/T326848) [11:55:52] (03Merged) 10jenkins-bot: Completely get rid of responsiveimages removal [extensions/MobileFrontend] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/891982 (https://phabricator.wikimedia.org/T326147) (owner: 10Ladsgroup) [11:56:46] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10jbond) @MatthewVernon the firmware, bios and network have all been upgraded so should be good to procead [11:58:43] 10SRE, 10MediaWiki-File-management, 10Traffic, 10MW-1.40-notes (1.40.0-wmf.25; 2023-02-27), and 2 others: Remove IEContentAnalyzer - https://phabricator.wikimedia.org/T309787 (10Vgutierrez) ` vgutierrez@cp6001:~$ curl -H 'Host: upload.wikimedia.org' -k https://127.0.0.1/favicon.ico -s -v -o /dev/null 2>&1... [11:59:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44762 and previous config saved to /var/cache/conftool/dbconfig/20230227-115947-root.json [12:00:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1127', diff saved to https://phabricator.wikimedia.org/P44763 and previous config saved to /var/cache/conftool/dbconfig/20230227-120002-root.json [12:00:28] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:04:13] 10SRE, 10Traffic, 10serviceops, 10Datacenter-Switchover: March 2023 Traffic Switchover checklist - https://phabricator.wikimedia.org/T330650 (10Clement_Goubert) [12:04:55] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [12:05:45] (JobUnavailable) resolved: (2) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:05:46] (03PS2) 10Clément Goubert: traffic: Depool eqiad from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/891554 (https://phabricator.wikimedia.org/T330650) [12:06:04] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:08:24] 10SRE, 10serviceops, 10Datacenter-Switchover: March 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [12:08:41] 10SRE, 10serviceops, 10Datacenter-Switchover: March 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [12:08:47] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [12:09:01] 10SRE, 10serviceops, 10Datacenter-Switchover: March 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) p:05Triage→03High [12:09:52] 10SRE, 10serviceops, 10Datacenter-Switchover: March 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [12:10:02] (03PS2) 10Clément Goubert: wmnet: Switch deployment CNAMEs to codfw [dns] - 10https://gerrit.wikimedia.org/r/892372 (https://phabricator.wikimedia.org/T330651) [12:10:09] (03PS3) 10Clément Goubert: Switch deployment server to deploy2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/892373 (https://phabricator.wikimedia.org/T330651) [12:10:36] !log jbond@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-airflow1005.eqiad.wmnet with reason: host still been configuered - T327970 [12:10:41] T327970: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 [12:10:51] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-airflow1005.eqiad.wmnet with reason: host still been configuered - T327970 [12:10:59] 10SRE, 10Traffic, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Traffic Switchover checklist - https://phabricator.wikimedia.org/T330650 (10Clement_Goubert) [12:11:17] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [12:12:20] 10SRE, 10Traffic, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Traffic Switchover checklist - https://phabricator.wikimedia.org/T330650 (10Clement_Goubert) p:05Triage→03High [12:12:20] !log installing apr-util security updates on buster [12:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1182', diff saved to https://phabricator.wikimedia.org/P44764 and previous config saved to /var/cache/conftool/dbconfig/20230227-121846-root.json [12:21:31] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [12:21:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44765 and previous config saved to /var/cache/conftool/dbconfig/20230227-122131-root.json [12:22:56] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [12:23:16] ACKNOWLEDGEMENT - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service,monitor_refine_eventlogging_legacy.service John Bond T330652 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:23:19] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [12:25:27] (03CR) 10Clément Goubert: [C: 04-2] "Preparation for 2023-03-01 mediawiki switchover" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892428 (https://phabricator.wikimedia.org/T327920) (owner: 10Clément Goubert) [12:26:04] RECOVERY - Check systemd state on aphlict2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1200 db1111 db1168 db1143 T330653', diff saved to https://phabricator.wikimedia.org/P44766 and previous config saved to /var/cache/conftool/dbconfig/20230227-122804-root.json [12:28:09] T330653: Upgrade 10.6.10 hosts to 10.6.12 - https://phabricator.wikimedia.org/T330653 [12:31:36] !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1003.eqiad.wmnet with OS bullseye [12:34:31] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [12:34:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44767 and previous config saved to /var/cache/conftool/dbconfig/20230227-123447-root.json [12:34:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44768 and previous config saved to /var/cache/conftool/dbconfig/20230227-123454-root.json [12:34:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44769 and previous config saved to /var/cache/conftool/dbconfig/20230227-123459-root.json [12:35:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1137 T330653', diff saved to https://phabricator.wikimedia.org/P44770 and previous config saved to /var/cache/conftool/dbconfig/20230227-123514-root.json [12:35:18] T330653: Upgrade 10.6.10 hosts to 10.6.12 - https://phabricator.wikimedia.org/T330653 [12:36:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 3%: Repooling', diff saved to https://phabricator.wikimedia.org/P44771 and previous config saved to /var/cache/conftool/dbconfig/20230227-123636-root.json [12:37:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44772 and previous config saved to /var/cache/conftool/dbconfig/20230227-123701-root.json [12:37:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44773 and previous config saved to /var/cache/conftool/dbconfig/20230227-123742-root.json [12:38:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1022 es2022 T330653', diff saved to https://phabricator.wikimedia.org/P44774 and previous config saved to /var/cache/conftool/dbconfig/20230227-123814-root.json [12:39:21] ACKNOWLEDGEMENT - Check systemd state on centrallog1002 is CRITICAL: CRITICAL - degraded: The following units failed: kafkatee.service John Bond T330654 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:40:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44775 and previous config saved to /var/cache/conftool/dbconfig/20230227-124050-root.json [12:41:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44776 and previous config saved to /var/cache/conftool/dbconfig/20230227-124100-root.json [12:41:32] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:43:33] !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1003.eqiad.wmnet with OS bullseye [12:45:20] (03PS1) 10Muehlenhoff: Add library hint for apr-util [puppet] - 10https://gerrit.wikimedia.org/r/892436 [12:48:56] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44777 and previous config saved to /var/cache/conftool/dbconfig/20230227-124952-root.json [12:49:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44778 and previous config saved to /var/cache/conftool/dbconfig/20230227-124959-root.json [12:50:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44779 and previous config saved to /var/cache/conftool/dbconfig/20230227-125003-root.json [12:51:07] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for apr-util [puppet] - 10https://gerrit.wikimedia.org/r/892436 (owner: 10Muehlenhoff) [12:51:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44780 and previous config saved to /var/cache/conftool/dbconfig/20230227-125141-root.json [12:52:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44781 and previous config saved to /var/cache/conftool/dbconfig/20230227-125206-root.json [12:52:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44782 and previous config saved to /var/cache/conftool/dbconfig/20230227-125247-root.json [12:54:32] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:35] !log ladsgroup@deploy1002 ladsgroup: Completely get rid of responsiveimages removal, part I (T308932) synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [12:55:39] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [12:55:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44783 and previous config saved to /var/cache/conftool/dbconfig/20230227-125555-root.json [12:56:00] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Joe) We should probably test that both scap works and a scap3 deployment also works (e.g. `docker-pkg`) when we've migrated the deployment server.... [12:56:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44784 and previous config saved to /var/cache/conftool/dbconfig/20230227-125605-root.json [12:56:55] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/891827 (https://phabricator.wikimedia.org/T329443) (owner: 10Majavah) [12:59:28] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/891826 (owner: 10Majavah) [13:02:00] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:04:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44785 and previous config saved to /var/cache/conftool/dbconfig/20230227-130457-root.json [13:05:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44786 and previous config saved to /var/cache/conftool/dbconfig/20230227-130503-root.json [13:05:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44787 and previous config saved to /var/cache/conftool/dbconfig/20230227-130508-root.json [13:05:40] !log installing openssl security updates on Buster [13:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44788 and previous config saved to /var/cache/conftool/dbconfig/20230227-130646-root.json [13:07:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44789 and previous config saved to /var/cache/conftool/dbconfig/20230227-130711-root.json [13:07:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44790 and previous config saved to /var/cache/conftool/dbconfig/20230227-130752-root.json [13:08:38] (03PS1) 10ArielGlenn: for dumpsdata1004,5 use the partman recipe for reimaging [puppet] - 10https://gerrit.wikimedia.org/r/892437 (https://phabricator.wikimedia.org/T330573) [13:11:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44791 and previous config saved to /var/cache/conftool/dbconfig/20230227-131100-root.json [13:14:58] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:19:17] (03PS1) 10Jbond: sre.hosts.reimage: uses networking.mac [cookbooks] - 10https://gerrit.wikimedia.org/r/892438 [13:19:34] (03PS2) 10Jbond: sre.hosts.reimage: uses networking.mac [cookbooks] - 10https://gerrit.wikimedia.org/r/892438 [13:20:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44793 and previous config saved to /var/cache/conftool/dbconfig/20230227-132002-root.json [13:20:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44794 and previous config saved to /var/cache/conftool/dbconfig/20230227-132008-root.json [13:20:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44795 and previous config saved to /var/cache/conftool/dbconfig/20230227-132013-root.json [13:21:40] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (10JMeybohm) [13:21:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44796 and previous config saved to /var/cache/conftool/dbconfig/20230227-132151-root.json [13:22:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44797 and previous config saved to /var/cache/conftool/dbconfig/20230227-132215-root.json [13:22:30] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. We always want to issue this for the primary interface (one that has done DHCP), so if facter will take one with GW it should be sa" [cookbooks] - 10https://gerrit.wikimedia.org/r/892438 (owner: 10Jbond) [13:22:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44798 and previous config saved to /var/cache/conftool/dbconfig/20230227-132257-root.json [13:25:39] (03CR) 10Jelto: [C: 03+2] scap: add required Python3 venv package [puppet] - 10https://gerrit.wikimedia.org/r/891837 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [13:26:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44799 and previous config saved to /var/cache/conftool/dbconfig/20230227-132605-root.json [13:26:09] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for Schwirz - https://phabricator.wikimedia.org/T330095 (10JMeybohm) Adding @KFrancis for signing NDA [13:26:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44800 and previous config saved to /var/cache/conftool/dbconfig/20230227-132615-root.json [13:30:07] !log ladsgroup@deploy1002 sync-file aborted: Completely get rid of responsiveimages removal, part I (T308932) (duration: 44m 38s) [13:30:11] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [13:32:06] !log ladsgroup@deploy1002 ladsgroup: Completely get rid of responsiveimages removal, part I (T326147) synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:32:11] T326147: Stop fragmenting ParserCache entries for mobile frontend - https://phabricator.wikimedia.org/T326147 [13:32:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2175 T330653', diff saved to https://phabricator.wikimedia.org/P44805 and previous config saved to /var/cache/conftool/dbconfig/20230227-133231-root.json [13:32:36] T330653: Upgrade 10.6.10 hosts to 10.6.12 - https://phabricator.wikimedia.org/T330653 [13:32:57] (03CR) 10Jelto: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39845/console" [puppet] - 10https://gerrit.wikimedia.org/r/891837 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [13:35:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44808 and previous config saved to /var/cache/conftool/dbconfig/20230227-133506-root.json [13:35:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44809 and previous config saved to /var/cache/conftool/dbconfig/20230227-133506-root.json [13:35:13] (03CR) 10Jelto: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39846/console" [puppet] - 10https://gerrit.wikimedia.org/r/891837 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [13:35:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44810 and previous config saved to /var/cache/conftool/dbconfig/20230227-133513-root.json [13:35:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44811 and previous config saved to /var/cache/conftool/dbconfig/20230227-133518-root.json [13:36:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44813 and previous config saved to /var/cache/conftool/dbconfig/20230227-133657-root.json [13:37:17] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/892393 (https://phabricator.wikimedia.org/T326848) (owner: 10Jbond) [13:37:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44814 and previous config saved to /var/cache/conftool/dbconfig/20230227-133720-root.json [13:38:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44815 and previous config saved to /var/cache/conftool/dbconfig/20230227-133801-root.json [13:39:51] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Volans) [nit] the `enable-puppet` + `run-puppe-agent` can be simplified with `run-puppet-agent --enable "reason"`. [13:40:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2122 T330653', diff saved to https://phabricator.wikimedia.org/P44817 and previous config saved to /var/cache/conftool/dbconfig/20230227-134018-root.json [13:40:23] T330653: Upgrade 10.6.10 hosts to 10.6.12 - https://phabricator.wikimedia.org/T330653 [13:41:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44818 and previous config saved to /var/cache/conftool/dbconfig/20230227-134110-root.json [13:41:20] !log ladsgroup@deploy1002 Synchronized php-1.40.0-wmf.24/extensions/MobileFrontend/extension.json: Completely get rid of responsiveimages removal, part I (T326147) (duration: 10m 48s) [13:41:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44819 and previous config saved to /var/cache/conftool/dbconfig/20230227-134120-root.json [13:41:24] T326147: Stop fragmenting ParserCache entries for mobile frontend - https://phabricator.wikimedia.org/T326147 [13:42:09] (03PS1) 10Cathal Mooney: Move execution of clear_dhcp_cache() until after Puppet run [cookbooks] - 10https://gerrit.wikimedia.org/r/892442 (https://phabricator.wikimedia.org/T306421) [13:44:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44821 and previous config saved to /var/cache/conftool/dbconfig/20230227-134405-root.json [13:47:12] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Update makevm to include completion of the installation with the puppet runs - https://phabricator.wikimedia.org/T306661 (10Volans) @MoritzMuehlenhoff in theory no, the makevm cookbook should call the reimage one directly and do all a... [13:47:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44823 and previous config saved to /var/cache/conftool/dbconfig/20230227-134753-root.json [13:47:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44824 and previous config saved to /var/cache/conftool/dbconfig/20230227-134756-root.json [13:50:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44825 and previous config saved to /var/cache/conftool/dbconfig/20230227-135010-root.json [13:50:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44826 and previous config saved to /var/cache/conftool/dbconfig/20230227-135011-root.json [13:50:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44827 and previous config saved to /var/cache/conftool/dbconfig/20230227-135018-root.json [13:50:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44828 and previous config saved to /var/cache/conftool/dbconfig/20230227-135023-root.json [13:50:32] !log ladsgroup@deploy1002 ladsgroup: Completely get rid of responsiveimages removal, part II (T326147) synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [13:50:37] T326147: Stop fragmenting ParserCache entries for mobile frontend - https://phabricator.wikimedia.org/T326147 [13:52:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44830 and previous config saved to /var/cache/conftool/dbconfig/20230227-135202-root.json [13:52:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44831 and previous config saved to /var/cache/conftool/dbconfig/20230227-135225-root.json [13:52:37] (03PS1) 10Jbond: standard_packages: also manage the rasdaemon service [puppet] - 10https://gerrit.wikimedia.org/r/892444 [13:53:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44832 and previous config saved to /var/cache/conftool/dbconfig/20230227-135306-root.json [13:54:29] (03PS2) 10Jbond: standard_packages: also manage the rasdaemon service [puppet] - 10https://gerrit.wikimedia.org/r/892444 [13:55:25] (03PS2) 10Cathal Mooney: Move execution of clear_dhcp_cache() and use default facter MAC [cookbooks] - 10https://gerrit.wikimedia.org/r/892442 (https://phabricator.wikimedia.org/T306421) [13:56:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44833 and previous config saved to /var/cache/conftool/dbconfig/20230227-135615-root.json [13:56:22] RECOVERY - Check systemd state on dumpsdata1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44834 and previous config saved to /var/cache/conftool/dbconfig/20230227-135625-root.json [13:56:27] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Update makevm to include completion of the installation with the puppet runs - https://phabricator.wikimedia.org/T306661 (10MoritzMuehlenhoff) >>! In T306661#8648754, @Volans wrote: > @MoritzMuehlenhoff in theory no, the makevm cookbo... [13:56:34] !log ladsgroup@deploy1002 Synchronized php-1.40.0-wmf.24/extensions/MobileFrontend/includes/MobileFrontendHooks.php: Completely get rid of responsiveimages removal, part II (T326147) (duration: 07m 24s) [13:56:38] T326147: Stop fragmenting ParserCache entries for mobile frontend - https://phabricator.wikimedia.org/T326147 [13:57:38] (03CR) 10Volans: "LGTM, question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/892442 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [13:58:37] !log restarting apache on mw canaries to pick up apr-util updates [13:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2178 db2146 db2180 T330653', diff saved to https://phabricator.wikimedia.org/P44835 and previous config saved to /var/cache/conftool/dbconfig/20230227-135856-root.json [13:59:01] T330653: Upgrade 10.6.10 hosts to 10.6.12 - https://phabricator.wikimedia.org/T330653 [13:59:04] !log ladsgroup@deploy1002 ladsgroup: Completely get rid of responsiveimages removal, part III (T326147) synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:59:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44836 and previous config saved to /var/cache/conftool/dbconfig/20230227-135910-root.json [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1400). nyaa~ [14:00:04] No Gerrit patches in the queue for this window AFAICS. [14:02:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44837 and previous config saved to /var/cache/conftool/dbconfig/20230227-140244-root.json [14:02:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44838 and previous config saved to /var/cache/conftool/dbconfig/20230227-140249-root.json [14:02:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44839 and previous config saved to /var/cache/conftool/dbconfig/20230227-140255-root.json [14:03:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44840 and previous config saved to /var/cache/conftool/dbconfig/20230227-140301-root.json [14:03:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44841 and previous config saved to /var/cache/conftool/dbconfig/20230227-140310-root.json [14:04:22] (03PS3) 10Cathal Mooney: Move execution of clear_dhcp_cache() and use default facter MAC [cookbooks] - 10https://gerrit.wikimedia.org/r/892442 (https://phabricator.wikimedia.org/T306421) [14:05:05] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Update makevm to include completion of the installation with the puppet runs - https://phabricator.wikimedia.org/T306661 (10Volans) Right, and also re-thinking about it given that VMs can't change cluster currently and we don't use ot... [14:05:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44842 and previous config saved to /var/cache/conftool/dbconfig/20230227-140515-root.json [14:05:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44843 and previous config saved to /var/cache/conftool/dbconfig/20230227-140523-root.json [14:05:26] !log ladsgroup@deploy1002 Synchronized php-1.40.0-wmf.24/extensions/MobileFrontend/includes/MobileContext.php: Completely get rid of responsiveimages removal, part III (T326147) (duration: 07m 36s) [14:05:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44844 and previous config saved to /var/cache/conftool/dbconfig/20230227-140527-root.json [14:05:31] T326147: Stop fragmenting ParserCache entries for mobile frontend - https://phabricator.wikimedia.org/T326147 [14:07:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44845 and previous config saved to /var/cache/conftool/dbconfig/20230227-140707-root.json [14:07:09] (03CR) 10Cathal Mooney: sre.hosts.reimage: uses networking.mac [cookbooks] - 10https://gerrit.wikimedia.org/r/892438 (owner: 10Jbond) [14:08:02] (03Abandoned) 10Cathal Mooney: sre.hosts.reimage: uses networking.mac [cookbooks] - 10https://gerrit.wikimedia.org/r/892438 (owner: 10Jbond) [14:08:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44846 and previous config saved to /var/cache/conftool/dbconfig/20230227-140811-root.json [14:08:26] (03CR) 10Cathal Mooney: Move execution of clear_dhcp_cache() and use default facter MAC (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/892442 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [14:08:31] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/892442 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [14:08:53] (03CR) 10David Caro: [C: 03+1] "LGTM, 👍" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/891844 (https://phabricator.wikimedia.org/T330484) (owner: 10Jbond) [14:09:44] (03CR) 10Cathal Mooney: [C: 03+2] Move execution of clear_dhcp_cache() and use default facter MAC [cookbooks] - 10https://gerrit.wikimedia.org/r/892442 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [14:09:56] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10Jclark-ctr) 05Open→03Resolved rebalanced power [14:11:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44847 and previous config saved to /var/cache/conftool/dbconfig/20230227-141120-root.json [14:11:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44848 and previous config saved to /var/cache/conftool/dbconfig/20230227-141130-root.json [14:11:36] (03Merged) 10jenkins-bot: Move execution of clear_dhcp_cache() and use default facter MAC [cookbooks] - 10https://gerrit.wikimedia.org/r/892442 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [14:11:50] RECOVERY - Check systemd state on durum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:48] (03PS1) 10Raymond Ndibe: puppet: update firewall rules for cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/892446 (https://phabricator.wikimedia.org/T303663) [14:13:55] (03CR) 10Bking: [C: 03+2] dse-k8s: raise memory for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/891577 (https://phabricator.wikimedia.org/T302494) (owner: 10Bking) [14:14:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44849 and previous config saved to /var/cache/conftool/dbconfig/20230227-141415-root.json [14:14:43] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [14:14:52] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [14:16:20] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [14:17:14] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [14:17:37] 10SRE, 10DC-Ops, 10Infrastructure-Foundations: private repo deployment - perccli implementation - https://phabricator.wikimedia.org/T308027 (10RobH) [14:17:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) 05Open→03In progress a:05Jclark-ctr→03RobH If I have an overwhelming number of notifications in a short period (seems I did around January 18th) I may miss... [14:17:47] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [14:17:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44850 and previous config saved to /var/cache/conftool/dbconfig/20230227-141749-root.json [14:17:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44851 and previous config saved to /var/cache/conftool/dbconfig/20230227-141754-root.json [14:18:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44852 and previous config saved to /var/cache/conftool/dbconfig/20230227-141800-root.json [14:18:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44853 and previous config saved to /var/cache/conftool/dbconfig/20230227-141806-root.json [14:18:07] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [14:18:09] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [14:18:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44854 and previous config saved to /var/cache/conftool/dbconfig/20230227-141815-root.json [14:18:18] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [14:18:30] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [14:18:52] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [14:19:20] (03Merged) 10jenkins-bot: dse-k8s: raise memory for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/891577 (https://phabricator.wikimedia.org/T302494) (owner: 10Bking) [14:20:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44855 and previous config saved to /var/cache/conftool/dbconfig/20230227-142020-root.json [14:21:15] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10Jhancock.wm) @Papaul network cable was reseated and showing as connected now on wdqs2022. [14:21:25] (03PS1) 10AikoChou: httpbb: update tests for revert-risk and outlink model [puppet] - 10https://gerrit.wikimedia.org/r/892448 (https://phabricator.wikimedia.org/T327787) [14:22:27] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [14:23:06] (03PS2) 10AikoChou: httpbb: update tests for revert-risk and outlink model [puppet] - 10https://gerrit.wikimedia.org/r/892448 (https://phabricator.wikimedia.org/T327787) [14:23:50] 10SRE, 10Data-Engineering, 10Event-Platform Value Stream, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) [14:27:20] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: incorrect error status [cookbooks] - 10https://gerrit.wikimedia.org/r/892393 (https://phabricator.wikimedia.org/T326848) (owner: 10Jbond) [14:28:34] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1006.eqiad.wmnet with reason: host reimage [14:29:00] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: incorrect error status [cookbooks] - 10https://gerrit.wikimedia.org/r/892393 (https://phabricator.wikimedia.org/T326848) (owner: 10Jbond) [14:29:10] (03CR) 10Jbond: [C: 03+2] differ: add more types to the exclusion of core types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/891844 (https://phabricator.wikimedia.org/T330484) (owner: 10Jbond) [14:29:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44857 and previous config saved to /var/cache/conftool/dbconfig/20230227-142919-root.json [14:30:50] (03CR) 10Elukey: [C: 03+2] httpbb: update tests for revert-risk and outlink model [puppet] - 10https://gerrit.wikimedia.org/r/892448 (https://phabricator.wikimedia.org/T327787) (owner: 10AikoChou) [14:30:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10Papaul) @Jhancock.wm thank you I have also wdqs2015 see my comment on the 23rd. Thanks [14:31:30] (03Merged) 10jenkins-bot: differ: add more types to the exclusion of core types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/891844 (https://phabricator.wikimedia.org/T330484) (owner: 10Jbond) [14:31:30] jouncebot: nowandnext [14:31:30] For the next 0 hour(s) and 28 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1400) [14:31:30] In 1 hour(s) and 58 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1630) [14:32:06] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1006.eqiad.wmnet with reason: host reimage [14:32:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44858 and previous config saved to /var/cache/conftool/dbconfig/20230227-143254-root.json [14:33:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44859 and previous config saved to /var/cache/conftool/dbconfig/20230227-143259-root.json [14:33:01] !log jbond@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on idm2001.wikimedia.org with reason: host still been configuered - T320797 [14:33:05] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on idm2001.wikimedia.org with reason: host still been configuered - T320797 [14:33:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44860 and previous config saved to /var/cache/conftool/dbconfig/20230227-143305-root.json [14:33:07] T320797: Initial production deployment of the IDM - https://phabricator.wikimedia.org/T320797 [14:33:11] !log jbond@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on idm2001.wikimedia.org with reason: host still been configuered - T320797 [14:33:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44861 and previous config saved to /var/cache/conftool/dbconfig/20230227-143311-root.json [14:33:15] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on idm2001.wikimedia.org with reason: host still been configuered - T320797 [14:33:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44862 and previous config saved to /var/cache/conftool/dbconfig/20230227-143321-root.json [14:33:47] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [14:34:18] !log live testing sre.switchdc.mediawiki.03-set-db-readonly and sre.switchdc.mediawiki.06-set-db-readwrite back to back - T330302 [14:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:23] T330302: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 [14:34:28] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly [14:34:59] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) [14:35:01] !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite [14:35:03] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) [14:35:13] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39850/console" [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [14:35:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44863 and previous config saved to /var/cache/conftool/dbconfig/20230227-143525-root.json [14:35:51] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1006.eqiad.wmnet with OS bullseye [14:35:53] !log done live testing sre.switchdc.mediawiki.03-set-db-readonly and sre.switchdc.mediawiki.06-set-db-readwrite back to back - T330302 [14:35:57] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye [14:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:21] 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover: sre.switchdc.mediawiki.03-set-db-readonly fails in live-test mode - https://phabricator.wikimedia.org/T330302 (10Clement_Goubert) 05Open→03Resolved Looks good, resolving. [14:37:24] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) [14:37:57] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [14:38:21] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover live test - https://phabricator.wikimedia.org/T330271 (10Clement_Goubert) 05In progress→03Resolved All code paths exercised and fixes applied and tested. Resolving. [14:38:31] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) [14:38:47] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [14:38:54] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Ensure sre.switchdc.mediawiki live test multi-DC compatibility - https://phabricator.wikimedia.org/T329065 (10Clement_Goubert) 05In progress→03Resolved All code paths exercised for multi-DC, fixes applied and working. Resolving. [14:39:06] (03CR) 10Btullis: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [14:39:12] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) 05Open→03Resolved All blockers resolved. [14:43:24] 10SRE, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover SRE-side Communication - https://phabricator.wikimedia.org/T329042 (10Clement_Goubert) [14:43:52] (03PS24) 10Btullis: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) [14:44:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44864 and previous config saved to /var/cache/conftool/dbconfig/20230227-144424-root.json [14:45:12] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-etcd2001.codfw.wmnet with OS bullseye [14:45:54] Anyone is around for a deployment? Sorry I just got home :/ [14:46:01] RECOVERY - Check systemd state on ms-fe1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44865 and previous config saved to /var/cache/conftool/dbconfig/20230227-144759-root.json [14:48:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44866 and previous config saved to /var/cache/conftool/dbconfig/20230227-144804-root.json [14:48:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44867 and previous config saved to /var/cache/conftool/dbconfig/20230227-144810-root.json [14:48:13] Superpes: I’m around, but there’s nothing in the calendar afaict [14:48:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44868 and previous config saved to /var/cache/conftool/dbconfig/20230227-144816-root.json [14:48:21] RECOVERY - Check systemd state on thanos-fe2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44869 and previous config saved to /var/cache/conftool/dbconfig/20230227-144826-root.json [14:49:27] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [14:49:57] Lucas_WMDE Yes, I know, I didn't add anything because I didn't know if I could be here in time :( [14:50:10] if it’s a config change I can probably deploy it [14:50:13] (03PS1) 10Nicolas Fraison: Failover hive to standby server [dns] - 10https://gerrit.wikimedia.org/r/892460 (https://phabricator.wikimedia.org/T303168) [14:50:19] probably not enough time for a backport gate-and-submit now though [14:50:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44870 and previous config saved to /var/cache/conftool/dbconfig/20230227-145030-root.json [14:51:29] Oh, thanks, so it's probably better to schedule it for another window [14:52:13] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1006.eqiad.wmnet with OS bullseye [14:52:17] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (10Ottomata) Approved. [14:52:21] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata1006 (**FAIL**) - Removed from Puppet and PuppetD... [14:52:30] (03PS1) 10Elukey: role::etcd::v3::ml_etcd: set cluster status to existing [puppet] - 10https://gerrit.wikimedia.org/r/892462 (https://phabricator.wikimedia.org/T330662) [14:52:34] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1006.eqiad.wmnet with OS bullseye [14:52:41] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye [14:53:41] RECOVERY - Check systemd state on sretest1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:56] (03CR) 10Klausman: [C: 03+1] role::etcd::v3::ml_etcd: set cluster status to existing [puppet] - 10https://gerrit.wikimedia.org/r/892462 (https://phabricator.wikimedia.org/T330662) (owner: 10Elukey) [14:54:09] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye [14:54:46] (03CR) 10Elukey: [C: 03+2] role::etcd::v3::ml_etcd: set cluster status to existing [puppet] - 10https://gerrit.wikimedia.org/r/892462 (https://phabricator.wikimedia.org/T330662) (owner: 10Elukey) [14:54:58] (KubernetesCalicoDown) firing: dse-k8s-worker1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1007.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:56:22] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-etcd2001.codfw.wmnet with reason: host reimage [14:56:30] 10SRE, 10Infrastructure-Foundations, 10Mail: wmf-auto-restart: fails for exim4 - https://phabricator.wikimedia.org/T330660 (10MoritzMuehlenhoff) [14:58:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ArielGlenn) Awesome, I would have looked for you on irc in a few days if I hadn't heard anything, no worries. Happy to see this moving along! [14:59:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44871 and previous config saved to /var/cache/conftool/dbconfig/20230227-145929-root.json [15:01:04] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-etcd2001.codfw.wmnet with reason: host reimage [15:02:06] RECOVERY - Check systemd state on thanos-fe2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. On a totally unrelated, yet important note: It seems this should have been ge buster, I guess we need rasdaemon Bullseye as we" [puppet] - 10https://gerrit.wikimedia.org/r/892444 (owner: 10Jbond) [15:03:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44872 and previous config saved to /var/cache/conftool/dbconfig/20230227-150304-root.json [15:03:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44873 and previous config saved to /var/cache/conftool/dbconfig/20230227-150309-root.json [15:03:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44874 and previous config saved to /var/cache/conftool/dbconfig/20230227-150315-root.json [15:03:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44875 and previous config saved to /var/cache/conftool/dbconfig/20230227-150322-root.json [15:03:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44876 and previous config saved to /var/cache/conftool/dbconfig/20230227-150331-root.json [15:04:32] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1006.eqiad.wmnet with OS bullseye [15:04:37] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata1006 (**FAIL**) - Removed from Puppet and PuppetD... [15:05:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) LVM data still exists on disks from a previous failed install attempt and the dd method didn't seem to remove, suspended instllation on dumpsdata1006 and set it to... [15:05:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44877 and previous config saved to /var/cache/conftool/dbconfig/20230227-150535-root.json [15:06:01] !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1003.eqiad.wmnet with OS bullseye [15:06:19] 10SRE, 10Infrastructure-Foundations, 10Mail: wmf-auto-restart: fails for exim4 - https://phabricator.wikimedia.org/T330660 (10jbond) > The wmf-auto-restart failure is ultimately fallout from earlier failures of Exim itself should we create a new task to add a proper systemd unit file for exim. as this did... [15:08:15] (03PS1) 10Elukey: role::etcd::v3::ml_etcd: set the new discovery records [puppet] - 10https://gerrit.wikimedia.org/r/892466 (https://phabricator.wikimedia.org/T330662) [15:08:17] (03CR) 10Hnowlan: [C: 03+2] Remove vendored thumbor-community-core [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/888210 (owner: 10Hnowlan) [15:08:30] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1007.eqiad.wmnet with reason: host reimage [15:08:46] (03CR) 10Klausman: [C: 03+1] role::etcd::v3::ml_etcd: set the new discovery records [puppet] - 10https://gerrit.wikimedia.org/r/892466 (https://phabricator.wikimedia.org/T330662) (owner: 10Elukey) [15:08:50] (03CR) 10Elukey: [C: 03+2] role::etcd::v3::ml_etcd: set the new discovery records [puppet] - 10https://gerrit.wikimedia.org/r/892466 (https://phabricator.wikimedia.org/T330662) (owner: 10Elukey) [15:10:29] (03CR) 10Hnowlan: service, k8s: Add service definitions for rest-gateway (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891510 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [15:11:31] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1007.eqiad.wmnet with reason: host reimage [15:11:58] !log bking@deploy1002 applying https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/891577 on dse-k8s-cluster via helmfile [15:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:24] (03PS1) 10Superpes15: [extwiki] Change wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892467 (https://phabricator.wikimedia.org/T330588) [15:12:26] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:12:27] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:13:33] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:13:36] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:14:17] (03CR) 10Jforrester: [C: 03+1] filebackend: Replace stringified class names with ::class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891962 (owner: 10Reedy) [15:14:22] (03CR) 10Jforrester: [C: 03+1] filebackend: Opinionated reformatting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891964 (owner: 10Reedy) [15:14:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44878 and previous config saved to /var/cache/conftool/dbconfig/20230227-151434-root.json [15:14:42] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [15:15:07] (03PS2) 10Hnowlan: WIP: helmfile: add device-analytics configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/886358 (https://phabricator.wikimedia.org/T320967) [15:15:44] (03Merged) 10jenkins-bot: Remove vendored thumbor-community-core [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/888210 (owner: 10Hnowlan) [15:15:49] (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892467 (https://phabricator.wikimedia.org/T330588) (owner: 10Superpes15) [15:17:17] (03CR) 10Stevemunene: Update airflow conf compatibility with airflow 2.5.0 connect postgresql (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [15:18:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44880 and previous config saved to /var/cache/conftool/dbconfig/20230227-151808-root.json [15:18:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44881 and previous config saved to /var/cache/conftool/dbconfig/20230227-151813-root.json [15:18:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44882 and previous config saved to /var/cache/conftool/dbconfig/20230227-151819-root.json [15:18:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44883 and previous config saved to /var/cache/conftool/dbconfig/20230227-151826-root.json [15:18:28] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1003.eqiad.wmnet with reason: host reimage [15:18:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44884 and previous config saved to /var/cache/conftool/dbconfig/20230227-151836-root.json [15:18:59] (03CR) 10CI reject: [V: 04-1] WIP: helmfile: add device-analytics configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/886358 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [15:19:17] (MediaWikiLatencyExceeded) firing: Average latency high: codfw appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceede [15:19:38] PROBLEM - Check systemd state on ml-etcd1003 is CRITICAL: CRITICAL - degraded: The following units failed: etcd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:20] this is probably me, checking --^ [15:20:24] PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [15:20:44] PROBLEM - Check systemd state on ml-etcd2003 is CRITICAL: CRITICAL - degraded: The following units failed: etcd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:21:31] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1003.eqiad.wmnet with reason: host reimage [15:21:40] RECOVERY - Etcd cluster health on ml-etcd1003 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [15:23:44] (03PS3) 10Superpes15: [extwiki] Change wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892467 (https://phabricator.wikimedia.org/T330588) [15:24:18] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExcee [15:24:32] !log cgoubert@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:24:36] !log cgoubert@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:24:45] (JobUnavailable) firing: Reduced availability for job ml_etcd in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:24:58] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (GET clusterinformations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:25:26] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-aqu-singleuser-conda-analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:27:40] RECOVERY - Check systemd state on ml-etcd2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:28:30] PROBLEM - Host ms-fe2013 is DOWN: PING CRITICAL - Packet loss = 100% [15:29:45] (JobUnavailable) resolved: Reduced availability for job ml_etcd in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:29:58] (KubernetesCalicoDown) resolved: dse-k8s-worker1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1007.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:29:58] (KubernetesAPILatency) firing: (18) High Kubernetes API latency (GET clusterinformations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:31:10] PROBLEM - Check systemd state on ml-etcd1001 is CRITICAL: CRITICAL - degraded: The following units failed: etcd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:38] (03PS3) 10Urbanecm: cswiki: Grant changetags only to bots/sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891503 (https://phabricator.wikimedia.org/T330383) [15:31:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891503 (https://phabricator.wikimedia.org/T330383) (owner: 10Urbanecm) [15:32:33] (03Merged) 10jenkins-bot: cswiki: Grant changetags only to bots/sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891503 (https://phabricator.wikimedia.org/T330383) (owner: 10Urbanecm) [15:32:49] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:891503|cswiki: Grant changetags only to bots/sysops (T330383)]] [15:32:54] T330383: Remove changetags from user at cswiki - https://phabricator.wikimedia.org/T330383 [15:33:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44886 and previous config saved to /var/cache/conftool/dbconfig/20230227-153313-root.json [15:33:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44887 and previous config saved to /var/cache/conftool/dbconfig/20230227-153318-root.json [15:33:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44888 and previous config saved to /var/cache/conftool/dbconfig/20230227-153324-root.json [15:34:06] RECOVERY - Check systemd state on ml-etcd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:34:34] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:891503|cswiki: Grant changetags only to bots/sysops (T330383)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [15:34:52] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [15:34:58] (KubernetesAPILatency) firing: (25) High Kubernetes API latency (GET clusterinformations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:35:50] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye [15:36:19] !log dcaro@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - dcaro@cumin1001" [15:37:01] 10SRE, 10Infrastructure-Foundations, 10Mail: wmf-auto-restart: fails for exim4 - https://phabricator.wikimedia.org/T330660 (10MoritzMuehlenhoff) >>! In T330660#8649059, @jbond wrote: > should we create a new task to add a proper systemd unit file for exim. as this did not show in icinga or systemd status d... [15:37:16] 10SRE, 10Traffic: haproxy: work on systemd unit hardening (cp hosts) - https://phabricator.wikimedia.org/T323944 (10Vgutierrez) 05Stalled→03In progress yes, it's currently running on cp4045 and I'm planning to extend the experiment to ulsfo tomorrow EU morning [15:37:52] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [15:39:37] (03CR) 10Ottomata: [C: 03+1] Update airflow conf compatibility with airflow 2.5.0 connect postgresql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [15:39:58] (KubernetesAPILatency) resolved: (34) High Kubernetes API latency (GET clusterinformations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:40:28] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:891503|cswiki: Grant changetags only to bots/sysops (T330383)]] (duration: 07m 39s) [15:40:33] T330383: Remove changetags from user at cswiki - https://phabricator.wikimedia.org/T330383 [15:40:48] PROBLEM - cassandra-c CQL 10.64.48.182:9042 on restbase1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [15:41:11] !log dcaro@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - dcaro@cumin1001" [15:41:16] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1003.eqiad.wmnet with OS bullseye [15:41:42] !log root@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1004'] [15:41:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2015.codfw.wmnet with OS bullseye [15:41:56] jouncebot: next [15:41:56] In 0 hour(s) and 48 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1630) [15:42:01] jouncebot: now [15:42:01] No deployments scheduled for the next 0 hour(s) and 47 minute(s) [15:42:12] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2015.codfw.wmnet with OS bullseye [15:42:56] (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892478 (https://phabricator.wikimedia.org/T330653) [15:43:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) ` Virtual Disk 238: RAID10, 43.661TB, Ready, Initialization 6% Virtual Disk 239: RAID1, 446.625GB, Ready... [15:43:49] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye [15:44:19] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye [15:44:22] RECOVERY - Check systemd state on ml-etcd1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:02] PROBLEM - cassandra-a CQL 10.64.48.180:9042 on restbase1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [15:46:36] PROBLEM - cassandra-b CQL 10.64.48.181:9042 on restbase1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [15:47:19] (03PS2) 10Marostegui: ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892478 (https://phabricator.wikimedia.org/T330653) [15:47:20] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:09] !log root@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1004'] [15:48:46] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [15:50:36] 10SRE, 10Infrastructure-Foundations, 10Mail: wmf-auto-restart: fails for exim4 - https://phabricator.wikimedia.org/T330660 (10jbond) >>! In T330660#8649219, @MoritzMuehlenhoff wrote: >>>! In T330660#8649059, @jbond wrote: >> should we create a new task to add a proper systemd unit file for exim. as this di... [15:52:00] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ml-etcd2001.codfw.wmnet with reason: etcd cluster upgrade failed, waiting for k8s upgrade [15:52:02] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ml-etcd2001.codfw.wmnet with reason: etcd cluster upgrade failed, waiting for k8s upgrade [15:52:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2022.codfw.wmnet with OS bullseye [15:52:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2022.codfw.wmnet with OS bullseye [15:52:38] !log root@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1004'] [15:56:25] !log elukey@cumin1001 END (ERROR) - Cookbook sre.ganeti.reimage (exit_code=97) for host ml-etcd2001.codfw.wmnet with OS bullseye [15:58:10] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:49] !log root@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudcephosd1004'] [16:00:23] (03PS1) 10Elukey: role::ml_k8s::{master,worker}: upgrade ml-serve-codfw to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/892482 (https://phabricator.wikimedia.org/T330669) [16:02:26] !log hashar@deploy1002 Started deploy [integration/docroot@cd7c263]: build: Pin PHPUnit to 9.5.28 like in other repos [16:02:38] !log hashar@deploy1002 Finished deploy [integration/docroot@cd7c263]: build: Pin PHPUnit to 9.5.28 like in other repos (duration: 00m 12s) [16:02:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2015.codfw.wmnet with reason: host reimage [16:03:09] (03PS1) 10Elukey: admin_ng: upgrade ml-serve-codfw's settings to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/892483 (https://phabricator.wikimedia.org/T330669) [16:03:56] (03PS1) 10Vgutierrez: hiera: Enable haproxy systemd hardening in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/892484 (https://phabricator.wikimedia.org/T323944) [16:05:34] (03CR) 10Klausman: [C: 03+1] admin_ng: upgrade ml-serve-codfw's settings to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/892483 (https://phabricator.wikimedia.org/T330669) (owner: 10Elukey) [16:06:01] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1008.eqiad.wmnet with reason: host reimage [16:06:02] (03CR) 10Klausman: [C: 03+1] role::ml_k8s::{master,worker}: upgrade ml-serve-codfw to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/892482 (https://phabricator.wikimedia.org/T330669) (owner: 10Elukey) [16:06:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2015.codfw.wmnet with reason: host reimage [16:07:17] (MediaWikiLatencyExceeded) firing: Average latency high: ... [16:07:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:07:36] PROBLEM - Host dse-k8s-worker1008 is DOWN: PING CRITICAL - Packet loss = 100% [16:07:36] PROBLEM - Host dse-k8s-worker1005 is DOWN: PING CRITICAL - Packet loss = 100% [16:08:09] 10SRE, 10Traffic: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ssingh) [16:08:37] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1008.eqiad.wmnet with reason: host reimage [16:08:39] !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1004.eqiad.wmnet with OS bullseye [16:08:42] 10SRE, 10Traffic: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ssingh) p:05Triage→03Medium [16:09:25] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39853/console" [puppet] - 10https://gerrit.wikimedia.org/r/892484 (https://phabricator.wikimedia.org/T323944) (owner: 10Vgutierrez) [16:09:48] (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:09:51] ACKNOWLEDGEMENT - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-aqu-singleuser-conda-analytics.service John Bond T330671 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:10:00] uh? [16:10:04] (03CR) 10Ssingh: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/892484 (https://phabricator.wikimedia.org/T323944) (owner: 10Vgutierrez) [16:11:18] (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:11:47] here [16:11:57] * jbond here [16:12:02] vgutierrez: im gussing not expected [16:12:14] * Emperor here [16:12:49] esams [16:12:57] but only ip6? [16:13:04] possibly allready cleared https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fservice&var-module=All&orgId=1&from=1677513366216&to=1677514341111 [16:13:07] from AM seems so [16:13:14] * brett here (somehow still on call) [16:13:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2022.codfw.wmnet with reason: host reimage [16:14:16] esams v6 still looks only 50% available to me [16:14:48] (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:15:27] Is it just probes or is there an actual problem? [16:15:42] quite a rise in slow-but-successful too [16:16:18] (ProbeDown) resolved: (2) Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:16:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2022.codfw.wmnet with reason: host reimage [16:19:48] (ProbeDown) resolved: Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:21:10] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:23:58] (KubernetesCalicoDown) firing: (2) dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:25:12] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [16:25:54] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:17] (03PS1) 10Volans: sre.hosts.reimage: clear DHCP cache for row E/F [cookbooks] - 10https://gerrit.wikimedia.org/r/892487 [16:26:23] 10SRE, 10Data-Engineering, 10Data-Persistence, 10Infrastructure-Foundations, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10MPhamWMF) [16:28:29] (03CR) 10David Caro: "Looks ok to me, some questions about `Optional` there, and feel free to ignore any `nit` thingies." [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [16:28:46] RECOVERY - Host dse-k8s-worker1005 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [16:28:51] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1005.eqiad.wmnet with reason: host reimage [16:30:04] jan_drewniak: OwO what's this, a deployment window?? Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1630). nyaa~ [16:31:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1005.eqiad.wmnet with reason: host reimage [16:32:07] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:33:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:33:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2015.codfw.wmnet with OS bullseye [16:33:14] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2015.codfw.wmnet with OS bullseye completed: - wdqs2015 (**PA... [16:38:24] q/a [16:40:34] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:18] (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:41:39] 10SRE, 10Traffic: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10Volans) @ssing 1) for the cookbooks all that I see is that they use the `A:dns-auth` cumin alias, so they will follow along. 2) for pywmflib there is a [[ https://gerrit.wi... [16:43:58] (KubernetesCalicoDown) firing: (2) dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:44:48] (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:46:09] (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:46:18] (ProbeDown) resolved: (2) Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:46:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:46:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2022.codfw.wmnet with OS bullseye [16:47:01] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2022.codfw.wmnet with OS bullseye completed: - wdqs2022 (**PA... [16:47:18] PROBLEM - SSH on restbase1026 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:47:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye [16:47:26] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10Papaul) [16:48:05] (03PS1) 10Zabe: Initial configuration for gucwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892492 (https://phabricator.wikimedia.org/T321880) [16:48:30] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10Papaul) 05Open→03Resolved complete @bking @Gehel all yours [16:48:46] (03CR) 10CI reject: [V: 04-1] Initial configuration for gucwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892492 (https://phabricator.wikimedia.org/T321880) (owner: 10Zabe) [16:49:14] (03PS2) 10Zabe: Initial configuration for gucwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892492 (https://phabricator.wikimedia.org/T321880) [16:54:28] !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1004.eqiad.wmnet with OS bullseye [16:54:33] 10SRE, 10Znuny, 10serviceops-collab, 10Patch-For-Review: Puppet template for /etc/clamav/clamd.conf needs to be updated - https://phabricator.wikimedia.org/T330129 (10Jelto) p:05Triage→03Medium a:03Arnoldokoth [16:54:52] !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1004.eqiad.wmnet with OS bullseye [16:55:09] (03CR) 10AOkoth: [C: 03+1] Sync more clamd.conf settings from 0.103.8 [puppet] - 10https://gerrit.wikimedia.org/r/892389 (https://phabricator.wikimedia.org/T330129) (owner: 10Muehlenhoff) [17:03:57] (03CR) 10Ladsgroup: [C: 03+1] ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892478 (https://phabricator.wikimedia.org/T330653) (owner: 10Marostegui) [17:04:48] (ProbeDown) resolved: (2) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:08:49] 10Puppet, 10Infrastructure-Foundations, 10Instrument-ClientError, 10patch-welcome: Prevent Firefox and Chrome extensions from being able to trigger alerts - https://phabricator.wikimedia.org/T330680 (10Jdlrobson) [17:09:39] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1004.eqiad.wmnet with reason: host reimage [17:10:18] (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:11:09] (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:12:17] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [17:12:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:12:45] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1004.eqiad.wmnet with reason: host reimage [17:14:48] (ProbeDown) resolved: (2) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:15:18] (ProbeDown) resolved: (2) Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:15:53] jouncebot: nowandnext [17:15:53] No deployments scheduled for the next 0 hour(s) and 44 minute(s) [17:15:53] In 0 hour(s) and 44 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1800) [17:15:53] In 0 hour(s) and 44 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1800) [17:16:17] (MediaWikiLatencyExceeded) firing: Average latency high: ... [17:16:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:19:30] (03CR) 10Zabe: [C: 03+2] Initial configuration for gucwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892492 (https://phabricator.wikimedia.org/T321880) (owner: 10Zabe) [17:20:14] (03Merged) 10jenkins-bot: Initial configuration for gucwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892492 (https://phabricator.wikimedia.org/T321880) (owner: 10Zabe) [17:20:29] (03PS3) 10Dzahn: re-introduce webserver-misc-static.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/891384 (https://phabricator.wikimedia.org/T330090) [17:20:48] 10Puppet, 10Infrastructure-Foundations, 10Instrument-ClientError, 10patch-welcome: Prevent Firefox and Chrome extensions from being able to trigger alerts - https://phabricator.wikimedia.org/T330680 (10Jdlrobson) @cwhite I might need your help with this sometime this week. [17:21:06] RECOVERY - Host dse-k8s-worker1008 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [17:22:18] !log create Wikipedia Wayuu # T321880 [17:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:22] T321880: Create Wikipedia Wayuu - https://phabricator.wikimedia.org/T321880 [17:24:15] (03PS1) 10Dzahn: planet: add https://design.wikimedia.org/blog/feed.xml to feeds [puppet] - 10https://gerrit.wikimedia.org/r/892510 [17:25:09] (03CR) 10Dzahn: [C: 03+2] planet: add https://design.wikimedia.org/blog/feed.xml to feeds [puppet] - 10https://gerrit.wikimedia.org/r/892510 (owner: 10Dzahn) [17:26:46] !log zabe@deploy1002 Started scap: create gucwiki T321880 [17:27:00] PROBLEM - MegaRAID on db2110 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:27:01] ACKNOWLEDGEMENT - MegaRAID on db2110 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T330681 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:27:07] 10SRE, 10ops-codfw: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T330681 (10ops-monitoring-bot) [17:28:33] !log zabe@deploy1002 zabe: create gucwiki T321880 synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [17:28:37] T321880: Create Wikipedia Wayuu - https://phabricator.wikimedia.org/T321880 [17:28:50] (03CR) 10Dzahn: [C: 03+2] "please let me know if there are more URLs in https://gist.github.com/Krinkle/e0d13f84b91e829afffa7b27822482be or elsewhere that are Wikime" [puppet] - 10https://gerrit.wikimedia.org/r/892510 (owner: 10Dzahn) [17:29:14] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [17:29:44] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [17:33:24] (03CR) 10Dzahn: [C: 03+2] re-introduce webserver-misc-static.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/891384 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [17:33:48] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [17:35:13] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [17:35:16] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [17:35:41] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [17:36:17] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [17:36:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:36:18] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [17:36:19] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [17:36:30] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [17:36:50] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [17:36:56] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [17:37:01] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [17:37:11] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [17:37:22] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [17:37:51] !log zabe@deploy1002 Finished scap: create gucwiki T321880 (duration: 11m 05s) [17:37:56] T321880: Create Wikipedia Wayuu - https://phabricator.wikimedia.org/T321880 [17:37:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:38:01] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [17:38:02] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [17:38:25] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye [17:38:58] (KubernetesCalicoDown) resolved: dse-k8s-worker1008.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1008.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:39:32] (03PS1) 10BCornwall: ntp/esams: set to dns3002 [dns] - 10https://gerrit.wikimedia.org/r/892516 [17:42:17] (MediaWikiLatencyExceeded) firing: Average latency high: ... [17:42:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:42:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:43:59] (03CR) 10BBlack: [C: 03+1] ntp/esams: set to dns3002 [dns] - 10https://gerrit.wikimedia.org/r/892516 (owner: 10BCornwall) [17:44:29] (03CR) 10BCornwall: [C: 03+2] ntp/esams: set to dns3002 [dns] - 10https://gerrit.wikimedia.org/r/892516 (owner: 10BCornwall) [17:44:34] RECOVERY - SSH on restbase1026 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:50:02] (03PS1) 10Elukey: admin_ng: add istio network policy config for DSE [deployment-charts] - 10https://gerrit.wikimedia.org/r/892519 (https://phabricator.wikimedia.org/T330261) [17:50:06] PROBLEM - SSH on restbase1026 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:50:48] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10Papaul) I need the partman recipe for those nodes [17:51:39] (03CR) 10Nicolas Fraison: Configure the new ceph servers with mon and mgr daemons (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [17:52:39] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891883 [17:52:41] (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891883 (owner: 10Zabe) [17:53:26] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891883 (owner: 10Zabe) [17:53:46] RECOVERY - SSH on restbase1026 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:54:53] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T330681 (10Marostegui) [17:55:04] (03PS1) 10Elukey: role::dse_k8s::worker: update istio-cni version [puppet] - 10https://gerrit.wikimedia.org/r/892522 (https://phabricator.wikimedia.org/T330261) [17:55:10] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T330681 (10Marostegui) p:05Triage→03Medium [17:55:43] (03CR) 10Elukey: [C: 03+2] admin_ng: add istio network policy config for DSE [deployment-charts] - 10https://gerrit.wikimedia.org/r/892519 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey) [17:56:18] (03CR) 10Elukey: [C: 03+2] role::dse_k8s::worker: update istio-cni version [puppet] - 10https://gerrit.wikimedia.org/r/892522 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey) [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1800) [18:00:04] ryankemper: Dear deployers, time to do the Wikidata Query Service weekly deploy deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T1800). [18:00:08] (03Merged) 10jenkins-bot: admin_ng: add istio network policy config for DSE [deployment-charts] - 10https://gerrit.wikimedia.org/r/892519 (https://phabricator.wikimedia.org/T330261) (owner: 10Elukey) [18:01:03] !log zabe@deploy1002 Synchronized wmf-config/interwiki.php: (no justification provided) (duration: 06m 54s) [18:01:10] PROBLEM - SSH on restbase1026 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:01:18] whops forgot mentioning the patch [18:02:58] RECOVERY - SSH on restbase1026 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:03:08] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [18:03:10] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [18:07:12] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@c8dc6d5]: cirrus namespaces: Work arround missing domain_name in upstream [18:08:31] zabe: you doing RESTBase and Pywikibot too? I can +2 the later [18:08:57] sure, can do [18:09:23] (03CR) 10BCornwall: [C: 03+1] "Thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/892362 (https://phabricator.wikimedia.org/T330405) (owner: 10Filippo Giunchedi) [18:09:42] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@c8dc6d5]: cirrus namespaces: Work arround missing domain_name in upstream (duration: 02m 29s) [18:10:22] PROBLEM - SSH on restbase1026 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:10:25] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T330681 (10wiki_willy) a:03Papaul [18:10:47] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T330218 (10wiki_willy) a:03Papaul [18:11:21] maybe I am going to wait with RESTBase so that it can go together with gurwiki, since that is a bit of work to deploy [18:11:22] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T330372 (10wiki_willy) a:03Papaul [18:12:43] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10wiki_willy) a:03Jclark-ctr [18:13:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Reset management module of mc1039 - https://phabricator.wikimedia.org/T330072 (10wiki_willy) a:03Jclark-ctr [18:13:43] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1099.eqiad.wmnet - https://phabricator.wikimedia.org/T329181 (10wiki_willy) a:03Jclark-ctr [18:15:14] ack, and since the wiki shouldn't be editted yet until they finish importing stuff it's not important [18:17:57] (03PS1) 10David Caro: cloudcephosd1004: use the right interface names [puppet] - 10https://gerrit.wikimedia.org/r/892526 (https://phabricator.wikimedia.org/T329502) [18:18:31] (03CR) 10David Caro: [C: 03+2] cloudcephosd1004: use the right interface names [puppet] - 10https://gerrit.wikimedia.org/r/892526 (https://phabricator.wikimedia.org/T329502) (owner: 10David Caro) [18:20:37] (03CR) 10David Caro: Configure the new ceph servers with mon and mgr daemons (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [18:24:47] !log dcaro@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - dcaro@cumin1001" [18:28:48] RECOVERY - SSH on restbase1026 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:29:19] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:29:48] !log start running "foreachwikiindblist s3.dblist migrateRevisionCommentTemp.php --sleep 2" in screen # T275246 [18:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:57] T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246 [18:31:34] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new ms-be nodes - pt1979@cumin2002" [18:31:34] (03PS1) 10Urbanecm: Enable Growth features by default on testwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892529 [18:31:57] (03PS2) 10Urbanecm: Enable Growth features by default on testwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892529 [18:32:34] apparently I somehow managed to break my account on gucwiki :| [18:32:55] oh [18:36:06] zabe: what does that mean? and anything i can help with? [18:36:17] gucwiki works for me [18:37:01] when I try to login it throws TypeError: Argument 1 passed to MediaWiki\Extension\OATHAuth\Auth\SecondaryAuthenticationProvider::getProviderForModule() must be an instance of MediaWiki\Extension\OATHAuth\IModule, null given, [18:37:06] also my global userpage is gone [18:37:47] !log dcaro@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - dcaro@cumin1001" [18:37:51] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1004.eqiad.wmnet with OS bullseye [18:38:59] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new ms-be nodes - pt1979@cumin2002" [18:39:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:39:04] for the error: 2d75670c-06e7-4523-b81b-fb30cc8c96e2 [18:39:17] i tested it with my bot account, and i managed to log in [18:40:34] hmm, I still get it. [18:41:12] your row in oathauth_users seems OK [18:42:17] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [18:42:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:43:17] (MediaWikiLatencyExceeded) firing: Average latency high: ... [18:43:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:43:52] ok, clearing global user cache through shell.php worked [18:44:27] great [18:47:07] (03CR) 10Kosta Harlan: [C: 04-1] "Thanks! I want to make a phab task for this, for documentation and to share with the team. But once that is done, we could sync this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892529 (owner: 10Urbanecm) [18:47:32] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [18:47:33] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:51:00] PROBLEM - SSH on restbase1026 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:56:23] 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10bscarone) Thanks @MoritzMuehlenhoff, I am not being able to log in to JupyterHub, who should I contact regarding this issue? [18:58:16] RECOVERY - SSH on restbase1026 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:58:23] zabe: do you know what happened to your account there? I'm worried some of my OATHAuth changes broke something [18:58:51] 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10Urbanecm) 05Resolved→03Open According to [LDAP tool](https://ldap.toolforge.org/user/bscarone), this is missing the `nda` LDAP group, which is requir... [18:59:03] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [18:59:56] tbh, not really, I just guess it is due to the account being autocreated at a time where the new didn't exist on all wikis yet, although still this never happened before [19:00:34] (03CR) 10Raymond Ndibe: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39855/console" [puppet] - 10https://gerrit.wikimedia.org/r/892446 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [19:01:49] (03PS1) 10Dzahn: planet: add Kosta Harla to feeds [puppet] - 10https://gerrit.wikimedia.org/r/892538 [19:01:59] (03CR) 10CI reject: [V: 04-1] planet: add Kosta Harla to feeds [puppet] - 10https://gerrit.wikimedia.org/r/892538 (owner: 10Dzahn) [19:02:12] (03PS2) 10Dzahn: planet: add Kosta Harla to feeds [puppet] - 10https://gerrit.wikimedia.org/r/892538 [19:02:42] 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10KFrancis) >>! In T330364#8643699, @MoritzMuehlenhoff wrote: >>>! In T330364#8643473, @SLyngshede-WMF wrote: >> @KFrancis Given that this is a reactivatio... [19:03:50] PROBLEM - SSH on restbase1026 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:03:55] * where the new wiki didn't exist on all appservers yet [19:04:01] not sure what I wrote above [19:07:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) ` Virtual Disk 238: RAID10, 43.661TB, Ready, Initialization 42% ` [19:09:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2070.mgmt.codfw.wmnet with reboot policy FORCED [19:11:12] RECOVERY - SSH on restbase1026 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:13:30] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for Schwirz - https://phabricator.wikimedia.org/T330095 (10KFrancis) @JMeybohm Please provide Norman Schwirz's email address and I'll put the agreement together. Please send it to kfrancis@wikimedia.org if you'd rather not post it here. [19:14:25] Amir1, could you do your magic for gucwiki in Wikidata? :) [19:14:38] sure [19:14:59] 🎉 [19:16:20] 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10MoritzMuehlenhoff) >>! In T330364#8650188, @Urbanecm wrote: > According to [LDAP tool](https://ldap.toolforge.org/user/bscarone), this is missing the `nd... [19:18:36] PROBLEM - SSH on restbase1026 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:18:51] 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10bscarone) @MoritzMuehlenhoff works now, thanks for the quick response! [19:18:55] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [19:23:26] 10SRE-swift-storage, 10Data-Engineering, 10Event-Platform Value Stream: Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10gmodena) [19:25:17] (MediaWikiLatencyExceeded) firing: Average latency high: ... [19:25:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:25:43] 10SRE-swift-storage, 10Data-Engineering, 10Event-Platform Value Stream: Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) [19:26:33] (03PS1) 10Sbailey: enable Linter use namespace field and tag and template UI in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892543 (https://phabricator.wikimedia.org/T175177) [19:28:52] (03PS1) 10Zabe: Initial configuration for gurwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892544 (https://phabricator.wikimedia.org/T327813) [19:29:40] (03CR) 10CI reject: [V: 04-1] Initial configuration for gurwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892544 (https://phabricator.wikimedia.org/T327813) (owner: 10Zabe) [19:30:18] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [19:30:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:30:27] (03PS2) 10Zabe: Initial configuration for gurwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892544 (https://phabricator.wikimedia.org/T327813) [19:31:11] 10SRE-swift-storage, 10Data-Engineering, 10Event-Platform Value Stream: Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10gmodena) [19:32:09] jouncebot: nowandnext [19:32:09] No deployments scheduled for the next 1 hour(s) and 27 minute(s) [19:32:09] In 1 hour(s) and 27 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T2100) [19:32:20] (03CR) 10Zabe: [C: 03+2] Initial configuration for gurwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892544 (https://phabricator.wikimedia.org/T327813) (owner: 10Zabe) [19:33:08] (03Merged) 10jenkins-bot: Initial configuration for gurwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892544 (https://phabricator.wikimedia.org/T327813) (owner: 10Zabe) [19:33:17] (MediaWikiLatencyExceeded) firing: Average latency high: ... [19:33:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:34:36] !log create Wikipedia Farefare (Gurene) # T327813 [19:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:41] T327813: Create Wikipedia Farefare (Gurene) - https://phabricator.wikimedia.org/T327813 [19:35:04] !log zabe@deploy1002 Started scap: create gurwiki T327813 [19:36:31] zabe: am I okay to 'deploy' a beta-only change (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/892543) ? [19:37:01] TheresNoTime: yep [19:37:02] RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2070.mgmt.codfw.wmnet with reboot policy FORCED [19:38:17] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [19:38:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:38:30] !log samtar@deploy1002 Backport cancelled. [19:38:41] (okay that really doesn't need to log) [19:38:57] (03CR) 10Samtar: [C: 03+2] "beta deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892543 (https://phabricator.wikimedia.org/T175177) (owner: 10Sbailey) [19:39:44] (03Merged) 10jenkins-bot: enable Linter use namespace field and tag and template UI in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892543 (https://phabricator.wikimedia.org/T175177) (owner: 10Sbailey) [19:40:17] (MediaWikiLatencyExceeded) firing: Average latency high: ... [19:40:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:42:24] !log zabe@deploy1002 Finished scap: create gurwiki T327813 (duration: 07m 19s) [19:42:28] T327813: Create Wikipedia Farefare (Gurene) - https://phabricator.wikimedia.org/T327813 [19:45:17] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [19:45:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:50:17] (MediaWikiLatencyExceeded) firing: Average latency high: ... [19:50:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:50:56] (03PS1) 10Gmodena: page-content-change: docker image version bump. [deployment-charts] - 10https://gerrit.wikimedia.org/r/892549 [19:55:05] !log power cycling restbase1026 [19:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:17] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [19:55:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:56:22] PROBLEM - Host restbase1026 is DOWN: PING CRITICAL - Packet loss = 100% [19:57:24] RECOVERY - SSH on restbase1026 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:57:26] RECOVERY - Host restbase1026 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [19:57:28] PROBLEM - confd service on restbase1026 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:57:36] PROBLEM - cassandra-b SSL 10.64.48.181:7001 on restbase1026 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [19:57:36] PROBLEM - cassandra-c SSL 10.64.48.182:7001 on restbase1026 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [19:57:42] PROBLEM - cassandra-a service on restbase1026 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:57:42] PROBLEM - cassandra-a SSL 10.64.48.180:7001 on restbase1026 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [19:58:06] PROBLEM - cassandra-b service on restbase1026 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:58:17] (MediaWikiLatencyExceeded) firing: Average latency high: ... [19:58:23] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:58:54] PROBLEM - cassandra-c service on restbase1026 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:59:18] RECOVERY - confd service on restbase1026 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:59:56] RECOVERY - cassandra-b service on restbase1026 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:00:42] RECOVERY - cassandra-c service on restbase1026 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:01:20] RECOVERY - cassandra-a service on restbase1026 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:01:40] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:02:06] RECOVERY - cassandra-b SSL 10.64.48.181:7001 on restbase1026 is OK: SSL OK - Certificate restbase1026-b valid until 2025-02-21 18:43:46 +0000 (expires in 724 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [20:02:06] RECOVERY - cassandra-c SSL 10.64.48.182:7001 on restbase1026 is OK: SSL OK - Certificate restbase1026-c valid until 2025-02-21 18:43:48 +0000 (expires in 724 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [20:02:16] RECOVERY - cassandra-b CQL 10.64.48.181:9042 on restbase1026 is OK: TCP OK - 0.000 second response time on 10.64.48.181 port 9042 https://phabricator.wikimedia.org/T93886 [20:02:30] RECOVERY - cassandra-c CQL 10.64.48.182:9042 on restbase1026 is OK: TCP OK - 0.000 second response time on 10.64.48.182 port 9042 https://phabricator.wikimedia.org/T93886 [20:02:53] (03CR) 10Dzahn: [C: 03+2] planet: add Kosta Harla to feeds [puppet] - 10https://gerrit.wikimedia.org/r/892538 (owner: 10Dzahn) [20:03:12] RECOVERY - cassandra-a SSL 10.64.48.180:7001 on restbase1026 is OK: SSL OK - Certificate restbase1026-a valid until 2025-02-21 18:43:44 +0000 (expires in 724 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [20:03:17] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [20:03:23] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:03:30] RECOVERY - cassandra-a CQL 10.64.48.180:9042 on restbase1026 is OK: TCP OK - 0.001 second response time on 10.64.48.180 port 9042 https://phabricator.wikimedia.org/T93886 [20:03:54] (03CR) 10Dzahn: [C: 03+2] peopleweb: add bacula file set srv-org-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/891920 (owner: 10Dzahn) [20:05:47] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [20:06:05] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:06:40] hi, could someone check on a maintenance script run for me? it probably has finished, but i'd like to confirm. https://phabricator.wikimedia.org/T315510#8630577 [20:07:01] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [20:08:29] MatmaRex: still running, 'Processed 1306600 (updated 2293) of 7328137 rows' [20:08:57] is mwmaint failing over to other DC? I haven't checked [20:09:12] I would assume it is [20:09:13] taavi: hmm, thanks [20:09:17] (MediaWikiLatencyExceeded) firing: Average latency high: ... [20:09:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:10:05] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [20:11:30] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:11:45] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new ms-fe and thanos nodes - pt1979@cumin2002" [20:12:35] (03PS2) 10Zabe: Initial configuration for vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875304 (https://phabricator.wikimedia.org/T320890) (owner: 10Urbanecm) [20:12:55] (03PS3) 10Zabe: Initial configuration for vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875304 (https://phabricator.wikimedia.org/T320890) (owner: 10Urbanecm) [20:13:53] (03CR) 10CI reject: [V: 04-1] Initial configuration for vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875304 (https://phabricator.wikimedia.org/T320890) (owner: 10Urbanecm) [20:16:41] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:17:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new ms-fe and thanos nodes - pt1979@cumin2002" [20:17:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:20:32] (03PS4) 10Zabe: Initial configuration for vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875304 (https://phabricator.wikimedia.org/T320890) (owner: 10Urbanecm) [20:22:25] (03PS5) 10Zabe: Initial configuration for vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875304 (https://phabricator.wikimedia.org/T320890) (owner: 10Urbanecm) [20:23:45] (03PS6) 10Zabe: Initial configuration for vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875304 (https://phabricator.wikimedia.org/T320890) (owner: 10Urbanecm) [20:24:17] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [20:24:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:24:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2071.mgmt.codfw.wmnet with reboot policy FORCED [20:25:55] (03CR) 10Zabe: [C: 03+2] Initial configuration for vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875304 (https://phabricator.wikimedia.org/T320890) (owner: 10Urbanecm) [20:26:42] (03Merged) 10jenkins-bot: Initial configuration for vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875304 (https://phabricator.wikimedia.org/T320890) (owner: 10Urbanecm) [20:29:17] (MediaWikiLatencyExceeded) firing: Average latency high: ... [20:29:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:29:52] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2071.mgmt.codfw.wmnet with reboot policy FORCED [20:30:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2071.mgmt.codfw.wmnet with reboot policy FORCED [20:31:14] (03PS1) 10BCornwall: config: Add brett for router configuration [homer/public] - 10https://gerrit.wikimedia.org/r/892551 (https://phabricator.wikimedia.org/T321309) [20:31:51] (03CR) 10CI reject: [V: 04-1] config: Add brett for router configuration [homer/public] - 10https://gerrit.wikimedia.org/r/892551 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [20:33:24] (03PS1) 10Zabe: Set language code for vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892552 (https://phabricator.wikimedia.org/T320890) [20:33:34] (03CR) 10Zabe: [C: 03+2] Set language code for vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892552 (https://phabricator.wikimedia.org/T320890) (owner: 10Zabe) [20:33:40] (03PS2) 10BCornwall: config: Add brett for router configuration [homer/public] - 10https://gerrit.wikimedia.org/r/892551 (https://phabricator.wikimedia.org/T321309) [20:34:17] (03Merged) 10jenkins-bot: Set language code for vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892552 (https://phabricator.wikimedia.org/T320890) (owner: 10Zabe) [20:34:17] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [20:34:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:35:46] zabe: you just became a Wikidata Q number, heh [20:35:53] (03CR) 10Ottomata: [C: 03+2] page-content-change: docker image version bump. [deployment-charts] - 10https://gerrit.wikimedia.org/r/892549 (owner: 10Gmodena) [20:36:28] heh [20:36:59] because I liked to add a value for "creator" for the Wikipedia editions [20:37:08] and that's you [20:38:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2071.mgmt.codfw.wmnet with reboot policy FORCED [20:38:38] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2072.mgmt.codfw.wmnet with reboot policy FORCED [20:39:27] !log create Wikimedia Venezuela wiki # T320890 [20:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:32] T320890: Create a wiki for Wikimedia Venezuela - https://phabricator.wikimedia.org/T320890 [20:40:21] *cough cough* any chance I can do the next wiki creation then? ^^' [20:40:30] yeah I should probably do one too :D [20:40:32] !log zabe@deploy1002 Started scap: create vewikimedia T320890 [20:40:42] :p [20:41:24] free wikidata number :D [20:41:42] just doing wiki number 1000 [20:41:57] woo! [20:42:01] Ace! [20:42:13] party time :) [20:42:40] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [20:42:41] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [20:43:06] zabe: are you going to be clear by the deployment window in 15m? [20:43:30] I should [20:44:09] sign up at https://phabricator.wikimedia.org/project/profile/2941/ :) [20:44:33] (MediaWikiLatencyExceeded) firing: Average latency high: ... [20:44:33] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:44:35] (03PS1) 10Zabe: Enable Translate on vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892553 (https://phabricator.wikimedia.org/T320890) [20:45:16] (03CR) 10Zabe: [C: 03+2] Enable Translate on vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892553 (https://phabricator.wikimedia.org/T320890) (owner: 10Zabe) [20:45:59] (03Merged) 10jenkins-bot: Enable Translate on vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892553 (https://phabricator.wikimedia.org/T320890) (owner: 10Zabe) [20:48:02] !log zabe@deploy1002 Finished scap: create vewikimedia T320890 (duration: 07m 29s) [20:48:07] T320890: Create a wiki for Wikimedia Venezuela - https://phabricator.wikimedia.org/T320890 [20:48:41] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [20:48:47] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [20:49:02] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891884 [20:49:04] (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891884 (owner: 10Zabe) [20:49:33] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [20:49:33] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:49:44] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891884 (owner: 10Zabe) [20:49:50] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [20:50:15] !log zabe@deploy1002 Started scap: install Translate on vewikimedia and update interwiki cache [20:50:21] !log zabe@deploy1002 sync-world aborted: install Translate on vewikimedia and update interwiki cache (duration: 00m 06s) [20:50:24] !log zabe@deploy1002 Started scap: install Translate on vewikimedia and update interwiki cache T320890 [20:52:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2072.mgmt.codfw.wmnet with reboot policy FORCED [20:52:40] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2073.mgmt.codfw.wmnet with reboot policy FORCED [20:53:24] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T330372 (10Papaul) 05Open→03Resolved This was one of the new ms-be node it should me good now [20:55:17] (MediaWikiLatencyExceeded) firing: Average latency high: ... [20:55:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:57:50] !log zabe@deploy1002 Finished scap: install Translate on vewikimedia and update interwiki cache T320890 (duration: 07m 26s) [20:57:55] T320890: Create a wiki for Wikimedia Venezuela - https://phabricator.wikimedia.org/T320890 [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T2100). [21:00:04] Superpes: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:08] ok, who wants to do the backports? :D [21:00:14] Hi :D [21:00:17] * TheresNoTime can deploy [21:00:17] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [21:00:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:00:26] zabe: clear to? [21:00:31] also what's up with the flapping latency alerts? [21:00:51] (no idea, been flapping for a few hours iirc) [21:01:00] yep, have fun [21:01:12] (03PS3) 10Samtar: [eswiki] Create new 'templateeditor' usergroup and protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891814 (https://phabricator.wikimedia.org/T330470) (owner: 10Superpes15) [21:01:12] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T330701 (10phaultfinder) [21:02:19] Superpes: starting with 891814 :) [21:02:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891814 (https://phabricator.wikimedia.org/T330470) (owner: 10Superpes15) [21:02:30] TheresNoTime Perfect ;) [21:03:08] taavi: has something happened to prep for tomorrow/Wednesday [21:03:10] (03Merged) 10jenkins-bot: [eswiki] Create new 'templateeditor' usergroup and protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891814 (https://phabricator.wikimedia.org/T330470) (owner: 10Superpes15) [21:03:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2073.mgmt.codfw.wmnet with reboot policy FORCED [21:03:27] !log samtar@deploy1002 Started scap: Backport for [[gerrit:891814|[eswiki] Create new 'templateeditor' usergroup and protection level (T330470)]] [21:03:31] T330470: New protection level and right to edit through it for eswiki - https://phabricator.wikimedia.org/T330470 [21:04:17] !log zabe@mwmaint1002:~$ mwscript createAndPromote.php --wiki vewikimedia --bureaucrat Zabe REDACTED [21:04:17] (MediaWikiLatencyExceeded) firing: Average latency high: ... [21:04:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:06] !log samtar@deploy1002 superpes and samtar: Backport for [[gerrit:891814|[eswiki] Create new 'templateeditor' usergroup and protection level (T330470)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [21:05:27] Superpes: that's live on (any) mwdebug, can you test? (Have you done a backport before?) [21:05:31] Checking :) [21:05:35] cool :) [21:05:50] (03PS4) 10Samtar: [extwiki] Change wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892467 (https://phabricator.wikimedia.org/T330588) (owner: 10Superpes15) [21:05:54] (03CR) 10RLazarus: Switch deployment server to deploy2002.codfw.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/892373 (https://phabricator.wikimedia.org/T330651) (owner: 10Clément Goubert) [21:06:26] TheresNoTime everything seems fine! Thanks :) [21:06:31] syncing [21:07:48] 10SRE-swift-storage, 10Data-Engineering, 10Event-Platform Value Stream: Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) [21:09:17] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [21:09:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:10:51] (03PS1) 10Samtar: Add Apache configuration for amical.wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/892555 (https://phabricator.wikimedia.org/T330390) [21:12:09] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:891814|[eswiki] Create new 'templateeditor' usergroup and protection level (T330470)]] (duration: 08m 42s) [21:12:14] T330470: New protection level and right to edit through it for eswiki - https://phabricator.wikimedia.org/T330470 [21:12:23] Superpes: that should be live now :) starting 892467 [21:12:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892467 (https://phabricator.wikimedia.org/T330588) (owner: 10Superpes15) [21:12:37] TheresNoTime Thanks :P [21:13:19] (03Merged) 10jenkins-bot: [extwiki] Change wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892467 (https://phabricator.wikimedia.org/T330588) (owner: 10Superpes15) [21:13:34] !log samtar@deploy1002 Started scap: Backport for [[gerrit:892467|[extwiki] Change wordmark and tagline (T330588)]] [21:13:39] T330588: Extremaduran Wikipedia - Updates in the address bar and in some versions of the wiki - https://phabricator.wikimedia.org/T330588 [21:13:56] Its live https://usercontent.irccloud-cdn.com/file/4DDeRrpe/image.png [21:13:59] Tyvm! [21:15:14] !log samtar@deploy1002 samtar and superpes: Backport for [[gerrit:892467|[extwiki] Change wordmark and tagline (T330588)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [21:15:25] Superpes: that's live on mwdebug, can you test? [21:15:41] Yep it works :D TheresNoTime [21:15:55] hola LuchoCR [21:15:56] 10SRE-swift-storage, 10Data-Engineering, 10Event-Platform Value Stream: Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10gmodena) [21:16:03] ack [21:21:49] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:892467|[extwiki] Change wordmark and tagline (T330588)]] (duration: 08m 14s) [21:21:53] T330588: Extremaduran Wikipedia - Updates in the address bar and in some versions of the wiki - https://phabricator.wikimedia.org/T330588 [21:22:10] Superpes: should be live (and have purged the cache) [21:22:19] Wonderful!!! [21:22:32] Many thanks for your time and support TheresNoTime :D [21:22:43] you're very welcome :) [21:22:55] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) After multiple reviews, fixes, and the last translations being done, the message has been sent to 832 c... [21:24:52] 10SRE-swift-storage, 10Data-Engineering, 10Event-Platform Value Stream: Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) [21:25:21] !log close UTC late backport window [21:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:08] (03PS1) 10Dzahn: httpbb: update/fix tests for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/892564 (https://phabricator.wikimedia.org/T330090) [21:28:37] (03CR) 10Dzahn: [C: 03+2] httpbb: update/fix tests for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/892564 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [21:28:50] (03CR) 10Dzahn: [C: 03+2] "[deploy1002:~] $ httpbb --hosts miscweb2002.codfw.wmnet ./test_miscweb.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/892564 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [21:36:56] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2070'] [21:50:17] (MediaWikiLatencyExceeded) firing: Average latency high: ... [21:50:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:53:29] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2070'] [21:53:48] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2071'] [21:55:17] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [21:55:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:56:39] (03CR) 10Stevemunene: Update airflow conf compatibility with airflow 2.5.0 connect postgresql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [21:58:24] (03PS65) 10Stevemunene: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [21:58:25] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [21:58:49] (03PS1) 10Zabe: Remove vewikimedia from deleted wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892566 (https://phabricator.wikimedia.org/T320890) [21:58:52] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:59:16] (03CR) 10Zabe: [C: 03+2] Remove vewikimedia from deleted wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892566 (https://phabricator.wikimedia.org/T320890) (owner: 10Zabe) [22:00:02] (03Merged) 10jenkins-bot: Remove vewikimedia from deleted wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892566 (https://phabricator.wikimedia.org/T320890) (owner: 10Zabe) [22:00:05] Reedy, sbassett, Maryum, and manfredi: gettimeofday() says it's time for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230227T2200) [22:01:31] !log zabe@deploy1002 Started scap: Backport for [[gerrit:892566|Remove vewikimedia from deleted wikis (T320890)]] [22:01:35] T320890: Create a wiki for Wikimedia Venezuela - https://phabricator.wikimedia.org/T320890 [22:02:17] (MediaWikiLatencyExceeded) firing: Average latency high: ... [22:02:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:02:42] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be2071'] [22:02:56] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39856/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [22:03:16] !log zabe@deploy1002 zabe: Backport for [[gerrit:892566|Remove vewikimedia from deleted wikis (T320890)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [22:04:27] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [22:04:29] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2072'] [22:07:17] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [22:07:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:09:02] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:892566|Remove vewikimedia from deleted wikis (T320890)]] (duration: 07m 30s) [22:09:07] T320890: Create a wiki for Wikimedia Venezuela - https://phabricator.wikimedia.org/T320890 [22:15:10] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:15:17] (MediaWikiLatencyExceeded) firing: Average latency high: ... [22:15:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:15:34] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:16:12] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [22:16:25] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:18:30] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2072'] [22:18:45] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2073'] [22:19:26] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [22:19:47] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:25:17] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [22:25:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:26:04] (03PS1) 10Jon Harald Søby: Add `guc` and `gur` to InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892568 [22:29:50] (03CR) 10Dzahn: [C: 03+2] switch annual.wikimedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/891406 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [22:30:17] (MediaWikiLatencyExceeded) firing: Average latency high: ... [22:30:17] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:31:11] !log [apifeatureusage] T329957 Restarted `logstash` on `apifeatureusage[1-2]001` [22:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:16] T329957: Restart Elastic/Blazegraph services to pick up JRE updates - https://phabricator.wikimedia.org/T329957 [22:31:45] (03PS1) 10RLazarus: mediawiki-cache-warmup: Rename `Request` to `Task` [puppet] - 10https://gerrit.wikimedia.org/r/892569 (https://phabricator.wikimedia.org/T290989) [22:31:47] (03PS1) 10RLazarus: mediawiki-cache-warmup: Add POSTs [puppet] - 10https://gerrit.wikimedia.org/r/892570 (https://phabricator.wikimedia.org/T290989) [22:35:08] (03PS2) 10RLazarus: mediawiki-cache-warmup: Add POSTs [puppet] - 10https://gerrit.wikimedia.org/r/892570 (https://phabricator.wikimedia.org/T290989) [22:35:17] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [22:35:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:35:48] (03PS3) 10Ryan Kemper: [WIP] Revert "wdqs: disable notifs on not-yet-in-service hosts" [puppet] - 10https://gerrit.wikimedia.org/r/891626 (https://phabricator.wikimedia.org/T301167) [22:35:57] 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T330364 (10MoritzMuehlenhoff) 05Open→03Resolved Great :-) Closing the task, then. [22:35:59] (03CR) 10CI reject: [V: 04-1] [WIP] Revert "wdqs: disable notifs on not-yet-in-service hosts" [puppet] - 10https://gerrit.wikimedia.org/r/891626 (https://phabricator.wikimedia.org/T301167) (owner: 10Ryan Kemper) [22:36:54] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2073'] [22:36:57] !log switching https://annual.wikimedia.org from eqiad to codfw T330090 [22:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:02] T330090: Switchover static miscweb services to codfw - https://phabricator.wikimedia.org/T330090 [22:40:37] (03CR) 10Dzahn: [C: 03+2] "all tests pass - using new discovery name and using codfw deploymenmt server" [puppet] - 10https://gerrit.wikimedia.org/r/891406 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [22:42:54] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2070'] [22:43:17] (MediaWikiLatencyExceeded) firing: Average latency high: ... [22:43:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:48:18] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [22:48:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:49:34] (03PS1) 10Dzahn: swich https://15.wikipedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/892571 (https://phabricator.wikimedia.org/T330090) [22:52:02] (03PS4) 10Ryan Kemper: Revert "wdqs: disable notifs on not-yet-in-service hosts" [puppet] - 10https://gerrit.wikimedia.org/r/891626 (https://phabricator.wikimedia.org/T301167) [22:52:12] (03CR) 10CI reject: [V: 04-1] Revert "wdqs: disable notifs on not-yet-in-service hosts" [puppet] - 10https://gerrit.wikimedia.org/r/891626 (https://phabricator.wikimedia.org/T301167) (owner: 10Ryan Kemper) [22:54:38] (03PS5) 10Ryan Kemper: Revert "wdqs: disable notifs on not-yet-in-service hosts" [puppet] - 10https://gerrit.wikimedia.org/r/891626 (https://phabricator.wikimedia.org/T301167) [22:59:17] (MediaWikiLatencyExceeded) firing: Average latency high: ... [22:59:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:00:10] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:00:49] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer [23:00:50] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [23:01:07] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer [23:04:17] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [23:04:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:07:06] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:09:20] (03CR) 10Dzahn: [C: 03+2] swich https://15.wikipedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/892571 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [23:09:21] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer [23:11:09] (03CR) 10Zabe: [C: 03+2] Add `guc` and `gur` to InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892568 (owner: 10Jon Harald Søby) [23:11:38] (03CR) 10Zabe: [C: 03+2] "Thanks for catching this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892568 (owner: 10Jon Harald Søby) [23:11:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892568 (owner: 10Jon Harald Søby) [23:11:55] (03Merged) 10jenkins-bot: Add `guc` and `gur` to InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892568 (owner: 10Jon Harald Søby) [23:12:09] !log zabe@deploy1002 Started scap: Backport for [[gerrit:892568|Add `guc` and `gur` to InterwikiSortOrders]] [23:12:17] (MediaWikiLatencyExceeded) firing: Average latency high: ... [23:12:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:13:59] !log zabe@deploy1002 jhsoby and zabe: Backport for [[gerrit:892568|Add `guc` and `gur` to InterwikiSortOrders]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [23:15:15] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:17:17] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [23:17:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:18:17] (MediaWikiMemcachedHighErrorRate) firing: (2) MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [23:19:51] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:892568|Add `guc` and `gur` to InterwikiSortOrders]] (duration: 07m 41s) [23:19:52] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer [23:23:17] (MediaWikiMemcachedHighErrorRate) resolved: (2) MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [23:25:52] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:26:17] (MediaWikiLatencyExceeded) firing: Average latency high: ... [23:26:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:26:19] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer [23:31:17] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [23:31:18] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:32:12] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:32:30] (03PS1) 10Dzahn: switch https://bienvenida.wikimedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/892574 (https://phabricator.wikimedia.org/T330090) [23:34:02] (03CR) 10Dzahn: [C: 03+2] switch https://bienvenida.wikimedia.org from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/892574 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [23:37:45] * herzog wonders what ^ is about [23:40:23] herzog: well, see https://bienvenida.wikimedia.org/ [23:40:28] * zabe wonders about lists being listed as out of scope on T329193 [23:40:34] "here is some music from Latin America and download the app" [23:40:38] mutante: I did [23:40:49] how could be something out of scope when this is supposed to test emergency failover capabilitise [23:42:26] herzog: https://phabricator.wikimedia.org/T207816 [23:42:38] Mexico Awareness [23:42:50] (03PS1) 10Samtar: Initial configuration for amicalwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892575 (https://phabricator.wikimedia.org/T330390) [23:43:52] (03CR) 10Zabe: Add Apache configuration for amical.wikimedia (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/892555 (https://phabricator.wikimedia.org/T330390) (owner: 10Samtar) [23:45:01] mutante: 2018 campaign, site still needed? [23:45:30] we tend to not delete stuff [23:46:28] herzog: yea, URLs are needed forever because https://www.w3.org/Provider/Style/URI and if you do it means new work to add rewrite rules [23:46:57] doesnt gain from deleting a virtual host on miscweb only to have to add it on cluster [23:47:33] (03PS2) 10Samtar: Add Apache configuration for amical.wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/892555 (https://phabricator.wikimedia.org/T330390) [23:48:23] (03CR) 10Samtar: Add Apache configuration for amical.wikimedia (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/892555 (https://phabricator.wikimedia.org/T330390) (owner: 10Samtar) [23:49:48] but we can move it to k8s. then there wont be failovers like above anymore [23:50:20] also see nostalgia.wikipedia.org [23:51:08] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:52:23] maybe that 9/11 wiki was actually deleted [23:53:16] https://sep11.wikipedia.org [23:54:16] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)