[00:10:06] (03Merged) 10jenkins-bot: help: Fix navigation in the help panel [extensions/GrowthExperiments] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/942468 (https://phabricator.wikimedia.org/T342927) (owner: 10Gergő Tisza) [00:10:20] !log tgr@deploy1002 Started scap: Backport for [[gerrit:942468|help: Fix navigation in the help panel (T342927)]] [00:10:24] T342927: [wmf.19-regression] Help panel - text displayed incorrectly - https://phabricator.wikimedia.org/T342927 [00:10:53] (03PS1) 10Andrew Bogott: Update horizon version [puppet] - 10https://gerrit.wikimedia.org/r/942530 [00:11:47] !log tgr@deploy1002 tgr: Backport for [[gerrit:942468|help: Fix navigation in the help panel (T342927)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [00:14:42] !log tgr@deploy1002 tgr: Continuing with sync [00:19:17] (03PS2) 10Andrew Bogott: Update horizon version [puppet] - 10https://gerrit.wikimedia.org/r/942530 [00:19:54] (03CR) 10Andrew Bogott: [C: 03+2] Update horizon version [puppet] - 10https://gerrit.wikimedia.org/r/942530 (owner: 10Andrew Bogott) [00:20:29] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:942468|help: Fix navigation in the help panel (T342927)]] (duration: 10m 09s) [00:20:34] T342927: [wmf.19-regression] Help panel - text displayed incorrectly - https://phabricator.wikimedia.org/T342927 [00:21:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:26:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:27:29] Verified, works in production. I'm done with the backport. [00:38:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/941914 [00:38:57] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/941914 (owner: 10TrainBranchBot) [01:06:05] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/941914 (owner: 10TrainBranchBot) [01:15:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:20:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:07:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [02:18:23] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:18:35] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:11] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:25] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:32:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:06] (03CR) 10RLazarus: [C: 03+1] sre.hosts.decommission: search in the DNS repo too [cookbooks] - 10https://gerrit.wikimedia.org/r/941760 (owner: 10Volans) [02:51:39] (03CR) 10RLazarus: [C: 03+1] "LGTM, thank you for this!" [cookbooks] - 10https://gerrit.wikimedia.org/r/941759 (https://phabricator.wikimedia.org/T297516) (owner: 10Volans) [04:54:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:03:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-sqoop-wikifunctions-production.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:04:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:06:07] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:18:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:21:07] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:22:07] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:23:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:27:07] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:33:45] (03CR) 10Marostegui: [C: 03+2] mariadb: Upgrade db2097 to use mariadb 10.6 package [puppet] - 10https://gerrit.wikimedia.org/r/942461 (https://phabricator.wikimedia.org/T334650) (owner: 10Jcrespo) [05:38:36] (03PS1) 10Marostegui: mariadb: Enable notifications for db2097, db2141 [puppet] - 10https://gerrit.wikimedia.org/r/942537 (https://phabricator.wikimedia.org/T334650) [05:38:56] (03CR) 10Marostegui: "This required manual rebasing, so it was faster just to push https://gerrit.wikimedia.org/r/c/operations/puppet/+/942537" [puppet] - 10https://gerrit.wikimedia.org/r/942467 (owner: 10Jcrespo) [05:39:18] (03CR) 10Marostegui: "Both hosts are green on icinga, so I am mergning this" [puppet] - 10https://gerrit.wikimedia.org/r/942537 (https://phabricator.wikimedia.org/T334650) (owner: 10Marostegui) [05:39:20] (03CR) 10Marostegui: [C: 03+2] mariadb: Enable notifications for db2097, db2141 [puppet] - 10https://gerrit.wikimedia.org/r/942537 (https://phabricator.wikimedia.org/T334650) (owner: 10Marostegui) [05:43:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [05:48:25] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230728T0600) [06:14:48] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:02] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2114 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/941917 (https://phabricator.wikimedia.org/T342947) [06:40:56] (03CR) 10Slyngshede: [C: 03+2] D:apereo_cas::service support FLAT profiles. [puppet] - 10https://gerrit.wikimedia.org/r/941391 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede) [06:49:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:54:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230728T0700) [07:07:31] (03PS1) 10Slyngshede: R:idp Enable flat OIDC profiles for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/942540 (https://phabricator.wikimedia.org/T320390) [07:08:56] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42721/console" [puppet] - 10https://gerrit.wikimedia.org/r/942540 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede) [07:12:28] (03CR) 10Jelto: "lgtm. gitlab_oidc on idp-test is the only service which has no profile_format: 'FLAT' then. Does it make sense to also add that?" [puppet] - 10https://gerrit.wikimedia.org/r/942540 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede) [07:14:48] (03PS2) 10Slyngshede: R:idp Enable flat OIDC profiles for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/942540 (https://phabricator.wikimedia.org/T320390) [07:16:46] (03CR) 10Jelto: "one comment in line" [puppet] - 10https://gerrit.wikimedia.org/r/942540 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede) [07:18:27] (03PS3) 10Slyngshede: R:idp Enable flat OIDC profiles for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/942540 (https://phabricator.wikimedia.org/T320390) [07:18:56] (03CR) 10Slyngshede: R:idp Enable flat OIDC profiles for gitlab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/942540 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede) [07:20:34] (03CR) 10Jelto: "lgtm thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/942540 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede) [07:20:41] (03CR) 10Jelto: [C: 03+1] R:idp Enable flat OIDC profiles for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/942540 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede) [07:21:38] (03CR) 10Slyngshede: [C: 03+2] R:idp Enable flat OIDC profiles for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/942540 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede) [07:23:03] (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:25:15] (03CR) 10Filippo Giunchedi: [C: 03+1] flink-zk: Enable prometheus scrapes [puppet] - 10https://gerrit.wikimedia.org/r/942494 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [07:26:59] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10karapayneWMDE) this sounds fine to me [07:28:03] (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:28:20] (03Abandoned) 10Jcrespo: Revert "mariadb: Disable notifications for db2097, db2141" [puppet] - 10https://gerrit.wikimedia.org/r/942467 (owner: 10Jcrespo) [07:29:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT certificates) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:39:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT certificates) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:50:08] (03PS1) 10Jelto: gitlab: auto_sign_in_with openid_connect on test instance [puppet] - 10https://gerrit.wikimedia.org/r/942600 (https://phabricator.wikimedia.org/T320390) [07:50:49] (03CR) 10Slyngshede: [C: 03+2] gitlab: auto_sign_in_with openid_connect on test instance [puppet] - 10https://gerrit.wikimedia.org/r/942600 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [07:50:59] (03CR) 10Slyngshede: [C: 03+1] gitlab: auto_sign_in_with openid_connect on test instance [puppet] - 10https://gerrit.wikimedia.org/r/942600 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [07:51:11] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/942600 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [07:53:18] (03PS1) 10Slyngshede: Cloud IDP: Set profile format to flat for GitLab. [puppet] - 10https://gerrit.wikimedia.org/r/942601 (https://phabricator.wikimedia.org/T320390) [07:55:33] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/942601 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede) [07:55:51] (03CR) 10Slyngshede: [C: 03+2] Cloud IDP: Set profile format to flat for GitLab. [puppet] - 10https://gerrit.wikimedia.org/r/942601 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede) [07:59:11] (03CR) 10Jelto: [C: 03+2] gitlab: auto_sign_in_with openid_connect on test instance [puppet] - 10https://gerrit.wikimedia.org/r/942600 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [08:08:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [08:21:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10aborrero) [08:21:29] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) [08:45:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:50:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:56:34] PROBLEM - MariaDB read only analytics_meta on db1108 is CRITICAL: Could not connect to localhost:3352 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:56:56] PROBLEM - MariaDB read only matomo on db1108 is CRITICAL: Could not connect to localhost:3351 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:57:16] PROBLEM - mysqld processes on db1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [09:06:36] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) [09:07:12] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) Started working on `purged` and `prometheus-rdkafka-exporter` [09:07:26] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1108.eqiad.wmnet with reason: db1108 has been replaced with db1208 - leaving for a few days before decom [09:07:40] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1108.eqiad.wmnet with reason: db1108 has been replaced with db1208 - leaving for a few days before decom [09:08:23] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) [09:11:32] (03PS1) 10Giuseppe Lavagetto: noc: stop serving static files from symlinks [puppet] - 10https://gerrit.wikimedia.org/r/942607 (https://phabricator.wikimedia.org/T341859) [09:30:41] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) The fix of using a `FLAT` profile with GitLab oidc was deployed to all idp servers (including wmcs/cloud). Thanks for @SLyn... [09:36:12] (03CR) 10Elukey: [C: 03+1] ml-services: update ores-legacy docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/942455 (https://phabricator.wikimedia.org/T342663) (owner: 10AikoChou) [09:37:09] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/942389 (owner: 10Jbond) [09:38:19] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: update ores-legacy docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/942455 (https://phabricator.wikimedia.org/T342663) (owner: 10AikoChou) [09:44:58] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10Gehel) 05Open→03Resolved [09:51:10] (03CR) 10AikoChou: [C: 03+2] ml-services: update ores-legacy docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/942455 (https://phabricator.wikimedia.org/T342663) (owner: 10AikoChou) [09:51:54] (03Merged) 10jenkins-bot: ml-services: update ores-legacy docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/942455 (https://phabricator.wikimedia.org/T342663) (owner: 10AikoChou) [09:53:52] (03PS1) 10Fabfur: Bump target distribution to bookworm [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/942613 (https://phabricator.wikimedia.org/T342154) [09:55:51] 10SRE, 10Traffic: Recompile fifo-log-demux with hardening options - https://phabricator.wikimedia.org/T342900 (10Fabfur) Same can be done on the `prometheus-rdkafka-exporter` package (https://gerrit.wikimedia.org/r/admin/repos/operations/software/prometheus-rdkafka-exporter,general) [09:57:49] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [10:00:33] !log aikochou@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [10:03:57] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42722/console" [puppet] - 10https://gerrit.wikimedia.org/r/942446 (owner: 10Fabfur) [10:07:53] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) Both `purged` and `prometheus-rdkafka-exporter` are ready for review, and eventually inclusion in wmf repositories. Considering that the `purged` package builds in Bookworm with `pr... [10:16:57] (03PS1) 10Elukey: ml-services: update the ores-legacy docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/942618 [10:18:20] !log T342924: created search indices for wikifunctions [10:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:24] T342924: Search on wikifunctions.org results in a cirrussearch-backend-error and no results - https://phabricator.wikimedia.org/T342924 [10:19:06] (03CR) 10Elukey: [C: 03+2] ml-services: update the ores-legacy docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/942618 (owner: 10Elukey) [10:25:15] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [10:27:03] (03PS1) 10Elukey: ml-services: update docker image for articletopic-outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/942621 [10:27:19] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10Fabfur) [10:27:31] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10Fabfur) Included dc-ops [10:28:58] !log elukey@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [10:29:20] !log elukey@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [10:33:06] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [10:33:36] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [10:35:42] (03PS1) 10Elukey: ml-services: bump prod replicas to 5 for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/942625 [10:36:12] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [10:36:45] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [10:38:19] (03CR) 10Elukey: [C: 03+2] ml-services: update docker image for articletopic-outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/942621 (owner: 10Elukey) [10:38:48] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:39:07] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: bump prod replicas to 5 for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/942625 (owner: 10Elukey) [10:39:12] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:41:36] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:41:41] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:41:50] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:42:42] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:43:39] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [10:44:27] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [10:45:30] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [10:45:43] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [10:45:53] (03CR) 10Elukey: [C: 03+2] ml-services: bump prod replicas to 5 for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/942625 (owner: 10Elukey) [10:49:06] (03PS1) 10Slyngshede: Facter: PHP Version [puppet] - 10https://gerrit.wikimedia.org/r/942628 (https://phabricator.wikimedia.org/T271196) [10:49:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:49:46] (03CR) 10CI reject: [V: 04-1] Facter: PHP Version [puppet] - 10https://gerrit.wikimedia.org/r/942628 (https://phabricator.wikimedia.org/T271196) (owner: 10Slyngshede) [10:50:22] !log elukey@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [10:51:02] !log elukey@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [10:52:04] (03PS1) 10Cathal Mooney: Allow connections from management networks to apt servers on 8080 [homer/public] - 10https://gerrit.wikimedia.org/r/942629 (https://phabricator.wikimedia.org/T337028) [10:54:17] (03PS2) 10Slyngshede: Facter: PHP Version [puppet] - 10https://gerrit.wikimedia.org/r/942628 (https://phabricator.wikimedia.org/T271196) [10:54:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:54:52] (03CR) 10CI reject: [V: 04-1] Facter: PHP Version [puppet] - 10https://gerrit.wikimedia.org/r/942628 (https://phabricator.wikimedia.org/T271196) (owner: 10Slyngshede) [10:59:15] (03PS3) 10Slyngshede: Facter: PHP Version [puppet] - 10https://gerrit.wikimedia.org/r/942628 (https://phabricator.wikimedia.org/T271196) [10:59:51] (03CR) 10CI reject: [V: 04-1] Facter: PHP Version [puppet] - 10https://gerrit.wikimedia.org/r/942628 (https://phabricator.wikimedia.org/T271196) (owner: 10Slyngshede) [11:00:03] (03CR) 10Jbond: [C: 03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/942629 (https://phabricator.wikimedia.org/T337028) (owner: 10Cathal Mooney) [11:03:55] (03PS4) 10Slyngshede: Facter: PHP Version [puppet] - 10https://gerrit.wikimedia.org/r/942628 (https://phabricator.wikimedia.org/T271196) [11:07:16] (03PS1) 10Elukey: services: update changeprop's staging docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/942634 (https://phabricator.wikimedia.org/T341140) [11:13:52] (03CR) 10Cathal Mooney: [C: 03+2] Allow connections from management networks to apt servers on 8080 [homer/public] - 10https://gerrit.wikimedia.org/r/942629 (https://phabricator.wikimedia.org/T337028) (owner: 10Cathal Mooney) [11:14:25] (03Merged) 10jenkins-bot: Allow connections from management networks to apt servers on 8080 [homer/public] - 10https://gerrit.wikimedia.org/r/942629 (https://phabricator.wikimedia.org/T337028) (owner: 10Cathal Mooney) [11:15:34] (03PS1) 10Jcrespo: mariadb: Switch s3 and s6 from db1140, db1225 [puppet] - 10https://gerrit.wikimedia.org/r/942636 (https://phabricator.wikimedia.org/T334650) [11:17:49] (03PS1) 10Jcrespo: mariadb: Disable notifications on db1140, db1225 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/942637 (https://phabricator.wikimedia.org/T334650) [11:18:56] (03CR) 10Jcrespo: [C: 03+2] mariadb: Disable notifications on db1140, db1225 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/942637 (https://phabricator.wikimedia.org/T334650) (owner: 10Jcrespo) [11:19:58] (03PS1) 10Cathal Mooney: Adjust network prepare-upgrade cookbook to use TCP 8080 [cookbooks] - 10https://gerrit.wikimedia.org/r/942638 (https://phabricator.wikimedia.org/T337028) [11:22:27] (03CR) 10Btullis: [C: 04-1] "At the moment, the kpayne user account only exists in the `ldap_only_users` section of the data.yaml file. This is why the CI test is fail" [puppet] - 10https://gerrit.wikimedia.org/r/940863 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [11:25:35] (03PS2) 10Jcrespo: mariadb: Switch s3 and s6 from db1140, db1225 [puppet] - 10https://gerrit.wikimedia.org/r/942636 (https://phabricator.wikimedia.org/T334650) [11:27:22] (03PS3) 10Jcrespo: mariadb: Switch s3 and s6 from db1140, db1225 [puppet] - 10https://gerrit.wikimedia.org/r/942636 (https://phabricator.wikimedia.org/T334650) [11:34:39] (03CR) 10Jcrespo: [C: 03+2] mariadb: Switch s3 and s6 from db1140, db1225 [puppet] - 10https://gerrit.wikimedia.org/r/942636 (https://phabricator.wikimedia.org/T334650) (owner: 10Jcrespo) [11:39:10] (03CR) 10Jbond: "implementation looks good. however we should consider what we want to do on the first puppet run, before puppet is installed. We could" [puppet] - 10https://gerrit.wikimedia.org/r/942628 (https://phabricator.wikimedia.org/T271196) (owner: 10Slyngshede) [11:42:56] (03PS1) 10Cathal Mooney: Allow HTTP return traffic from apt to network devices on TCP 8080 [homer/public] - 10https://gerrit.wikimedia.org/r/942639 (https://phabricator.wikimedia.org/T337028) [11:45:18] (03PS1) 10Slyngshede: Facter: Python version [puppet] - 10https://gerrit.wikimedia.org/r/942641 (https://phabricator.wikimedia.org/T271196) [11:45:54] (03CR) 10CI reject: [V: 04-1] Facter: Python version [puppet] - 10https://gerrit.wikimedia.org/r/942641 (https://phabricator.wikimedia.org/T271196) (owner: 10Slyngshede) [11:48:22] (03PS2) 10Slyngshede: Facter: Python version [puppet] - 10https://gerrit.wikimedia.org/r/942641 (https://phabricator.wikimedia.org/T271196) [11:48:57] (03CR) 10CI reject: [V: 04-1] Facter: Python version [puppet] - 10https://gerrit.wikimedia.org/r/942641 (https://phabricator.wikimedia.org/T271196) (owner: 10Slyngshede) [11:53:42] (03CR) 10Jbond: [C: 03+1] Adjust network prepare-upgrade cookbook to use TCP 8080 [cookbooks] - 10https://gerrit.wikimedia.org/r/942638 (https://phabricator.wikimedia.org/T337028) (owner: 10Cathal Mooney) [12:07:57] (03CR) 10Hnowlan: [C: 03+1] services: update changeprop's staging docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/942634 (https://phabricator.wikimedia.org/T341140) (owner: 10Elukey) [12:10:02] (03PS3) 10Slyngshede: Facter: Python version [puppet] - 10https://gerrit.wikimedia.org/r/942641 (https://phabricator.wikimedia.org/T271196) [12:11:38] (03CR) 10Jbond: "implementation looks good some thoughts inline" [puppet] - 10https://gerrit.wikimedia.org/r/942641 (https://phabricator.wikimedia.org/T271196) (owner: 10Slyngshede) [12:14:50] (03CR) 10Jgiannelos: [C: 03+1] "Other than the redirect path patch looks OK." [deployment-charts] - 10https://gerrit.wikimedia.org/r/938265 (https://phabricator.wikimedia.org/T339119) (owner: 10Hnowlan) [12:15:13] (03PS1) 10Jcrespo: mariadb: Reenable notifications for db1140, db1225 [puppet] - 10https://gerrit.wikimedia.org/r/942645 (https://phabricator.wikimedia.org/T334650) [12:18:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:22:40] (03CR) 10Elukey: [C: 03+2] services: update changeprop's staging docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/942634 (https://phabricator.wikimedia.org/T341140) (owner: 10Elukey) [12:23:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:28:35] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync [12:28:55] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [12:33:56] (03PS1) 10Jelto: gitlab: auto_sign_in_with openid_connect on all instances [puppet] - 10https://gerrit.wikimedia.org/r/942646 (https://phabricator.wikimedia.org/T320390) [12:36:06] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42723/console" [puppet] - 10https://gerrit.wikimedia.org/r/942646 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [12:36:26] (03PS3) 10Hnowlan: rest-gateway: add routes for wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/938265 (https://phabricator.wikimedia.org/T339119) [12:42:14] (03CR) 10Jforrester: [C: 03+1] "Let's land and deploy this on Monday." [deployment-charts] - 10https://gerrit.wikimedia.org/r/942017 (owner: 10Cory Massaro) [12:45:41] (03CR) 10Jforrester: "I've scheduled this for deployment on Monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941995 (https://phabricator.wikimedia.org/T325910) (owner: 10EpicPupper) [12:47:50] (03CR) 10Slyngshede: [C: 03+1] "LGTM, I'm very excited to see this in production." [puppet] - 10https://gerrit.wikimedia.org/r/942646 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [12:50:32] (03CR) 10Jcrespo: [C: 03+2] mariadb: Reenable notifications for db1140, db1225 [puppet] - 10https://gerrit.wikimedia.org/r/942645 (https://phabricator.wikimedia.org/T334650) (owner: 10Jcrespo) [12:55:16] (03PS1) 10Ilias Sarantopoulos: ores-extension: enable Lift Wing for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942649 (https://phabricator.wikimedia.org/T342115) [12:55:55] (03CR) 10CI reject: [V: 04-1] ores-extension: enable Lift Wing for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942649 (https://phabricator.wikimedia.org/T342115) (owner: 10Ilias Sarantopoulos) [12:56:32] (03PS2) 10Ilias Sarantopoulos: ores-extension: enable Lift Wing for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942649 (https://phabricator.wikimedia.org/T342115) [13:00:27] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:09:40] (03PS1) 10Jcrespo: mariadb: Upgrade db1225 to mariadb 10.6 (and generate 10.6 backups) [puppet] - 10https://gerrit.wikimedia.org/r/942652 (https://phabricator.wikimedia.org/T334650) [13:17:18] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/942646 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [13:19:25] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:28:51] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10darthmon_wmde) [13:29:12] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10darthmon_wmde) [13:30:39] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [13:30:43] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm [13:31:07] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10darthmon_wmde) please @adee_wmde fill in you ssh public key the ticket. Intructions are here: https://www.mediawiki.org/wiki/SSH_keys#Generating_a_new_SSH_key. Then remove yoursel... [13:37:12] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10darthmon_wmde) [13:37:55] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10darthmon_wmde) please @roti_WMDE fill in you ssh public key the ticket. Instructions are here: https://www.mediawiki.org/wiki/SSH_keys#Generating_a_new_SSH_key. Then remove yours... [13:38:41] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for lojo_wmde - https://phabricator.wikimedia.org/T342973 (10darthmon_wmde) [13:38:53] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for lojo_wmde - https://phabricator.wikimedia.org/T342973 (10darthmon_wmde) a:03lojo_wmde [13:39:06] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for lojo_wmde - https://phabricator.wikimedia.org/T342973 (10darthmon_wmde) please @lojo_wmde fill in you ssh public key the ticket. Instructions are here: https://www.mediawiki.org/wiki/SSH_keys#Generating_a_new_SSH_key. Then remove yours... [13:39:30] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10darthmon_wmde) [13:39:53] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10darthmon_wmde) [13:40:22] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10darthmon_wmde) [13:42:06] (03PS1) 10Elukey: Add nodejs14-devel [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/942656 [13:45:44] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) [13:47:02] (03PS1) 10Andrew Bogott: cinder-backups: move workload from cloudcontrol1005 to 1007 [puppet] - 10https://gerrit.wikimedia.org/r/942659 [13:48:18] (03CR) 10Andrew Bogott: [C: 03+2] cinder-backups: move workload from cloudcontrol1005 to 1007 [puppet] - 10https://gerrit.wikimedia.org/r/942659 (owner: 10Andrew Bogott) [13:53:19] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T342906 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm logged in to see that PSU1 has lost input. unplugged power cable, pulled out PSU. waited 30 seconds. reseated all components. alert has cleared. [13:54:36] (03CR) 10Ssingh: Bookworm release. Fix minor lintian warning about missing description. (032 comments) [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/942491 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [13:59:34] (03PS3) 10Fabfur: Release 2.0.0-4 [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/942491 (https://phabricator.wikimedia.org/T342154) [13:59:59] (03CR) 10Fabfur: "Thanks for the review, should be ok now" [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/942491 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [14:01:53] (03PS1) 10Jbond: mirror: now bookworm is stable stop using unstable [puppet] - 10https://gerrit.wikimedia.org/r/942663 [14:03:03] (03CR) 10Jbond: [C: 03+2] mirror: now bookworm is stable stop using unstable [puppet] - 10https://gerrit.wikimedia.org/r/942663 (owner: 10Jbond) [14:03:39] (03CR) 10Ssingh: Bump target distribution to bookworm (031 comment) [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/942613 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [14:03:53] !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm [14:03:57] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with er... [14:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:24] (03CR) 10Ssingh: [C: 03+1] "Thanks for fixing the lintian warning as well!" [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/942491 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [14:08:39] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [14:08:44] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm [14:09:37] (03PS1) 10Elukey: ml-services: update Docker image for articletopic-outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/942664 [14:10:34] (03CR) 10Milimetric: [C: 03+1] Create puppet scripting for sqooping Wikifunctions tables (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/939394 (https://phabricator.wikimedia.org/T342199) (owner: 10David Martin) [14:10:58] (03PS1) 10Ilias Sarantopoulos: ml-services: update ores-legacy UI docs [deployment-charts] - 10https://gerrit.wikimedia.org/r/942665 (https://phabricator.wikimedia.org/T341479) [14:11:33] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:12:39] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: update ores-legacy UI docs [deployment-charts] - 10https://gerrit.wikimedia.org/r/942665 (https://phabricator.wikimedia.org/T341479) (owner: 10Ilias Sarantopoulos) [14:13:25] (03Merged) 10jenkins-bot: ml-services: update ores-legacy UI docs [deployment-charts] - 10https://gerrit.wikimedia.org/r/942665 (https://phabricator.wikimedia.org/T341479) (owner: 10Ilias Sarantopoulos) [14:14:20] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet failure on Beta Cluster role::beta::docker_services boxes - https://phabricator.wikimedia.org/T342038 (10Jdforrester-WMF) CCing you on this task now I've found it. :-) [14:14:29] !log isaranto@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [14:15:05] !log milimetric@deploy1002 Started deploy [analytics/refinery@1523f12]: Patch sqoop of wikifunctions [14:15:06] !log isaranto@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [14:15:11] (03CR) 10Elukey: [C: 03+2] ml-services: update Docker image for articletopic-outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/942664 (owner: 10Elukey) [14:15:47] (03PS2) 10Fabfur: Release 0.4 [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/942613 (https://phabricator.wikimedia.org/T342154) [14:15:47] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [14:16:26] (03CR) 10Fabfur: Release 0.4 (031 comment) [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/942613 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [14:16:33] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:45] (03CR) 10Ssingh: [C: 03+1] Release 0.4 [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/942613 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [14:19:13] (03CR) 10Cory Massaro: Add timeout values in milliseconds as environment variables. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/942017 (owner: 10Cory Massaro) [14:21:17] !log milimetric@deploy1002 Finished deploy [analytics/refinery@1523f12]: Patch sqoop of wikifunctions (duration: 06m 11s) [14:22:04] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [14:22:32] (03PS4) 10Fabfur: Version 0.6.4 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/942414 (https://phabricator.wikimedia.org/T342154) [14:22:34] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [14:23:20] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [14:23:37] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [14:25:09] !log milimetric@deploy1002 Started deploy [analytics/refinery@1523f12] (thin): Patch sqoop of wikifunctions [14:25:10] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [14:25:13] !log milimetric@deploy1002 Finished deploy [analytics/refinery@1523f12] (thin): Patch sqoop of wikifunctions (duration: 00m 03s) [14:26:38] (03PS5) 10Fabfur: Release 0.6.4 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/942414 (https://phabricator.wikimedia.org/T342154) [14:29:45] (03CR) 10Ssingh: [C: 03+1] "LGTM. Thanks for cleaning up the depends as well!" [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/942414 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [14:31:24] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) The following packages are ready to be imported into bookworm-wikimedia: * fifo-log-demux * file-read-backwards * prometheus-rdkafka-exporter * prometheus-varnishkafka-exporter See... [14:35:42] (03CR) 10Jcrespo: [C: 04-1] "Not to deploy until Manuel says so. :-D" [puppet] - 10https://gerrit.wikimedia.org/r/942652 (https://phabricator.wikimedia.org/T334650) (owner: 10Jcrespo) [14:36:58] (03CR) 10Jcrespo: [C: 04-1] "@Marostegui Re: dbprov. Installing 10.6 on top of 10.4 fails on puppet (although you may already know that). But otherwise it should "just" [puppet] - 10https://gerrit.wikimedia.org/r/942652 (https://phabricator.wikimedia.org/T334650) (owner: 10Jcrespo) [14:38:42] (03CR) 10Marostegui: mariadb: Upgrade db1225 to mariadb 10.6 (and generate 10.6 backups) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/942652 (https://phabricator.wikimedia.org/T334650) (owner: 10Jcrespo) [14:40:03] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1002.eqiad.wmnet with OS bookworm [14:40:06] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm completed: - sre... [14:41:12] (03CR) 10Ssingh: "Thanks @rzl. Noting that certspotter is sadly no longer being worked on and is not running on the alert host. I will add a note about this" [puppet] - 10https://gerrit.wikimedia.org/r/941972 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [14:41:19] (03PS1) 10Giuseppe Lavagetto: noc: unify methods to fetch the current wiki versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942671 (https://phabricator.wikimedia.org/T341859) [14:41:21] (03PS1) 10Giuseppe Lavagetto: noc: don't use on-disk files but etcd directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942672 (https://phabricator.wikimedia.org/T341859) [14:41:23] (03PS1) 10Giuseppe Lavagetto: noc: centralize file list management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942673 (https://phabricator.wikimedia.org/T341859) [14:41:26] (03PS1) 10Giuseppe Lavagetto: noc: add static file server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942674 (https://phabricator.wikimedia.org/T341859) [14:41:28] (03PS1) 10Giuseppe Lavagetto: noc: remove symlinks and also neutralize createTxtFileSymlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942675 (https://phabricator.wikimedia.org/T341859) [14:41:50] (03PS2) 10Giuseppe Lavagetto: noc: stop serving static files from symlinks [puppet] - 10https://gerrit.wikimedia.org/r/942607 (https://phabricator.wikimedia.org/T341859) [14:41:55] (03CR) 10CI reject: [V: 04-1] noc: remove symlinks and also neutralize createTxtFileSymlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942675 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [14:42:11] (03CR) 10CI reject: [V: 04-1] noc: centralize file list management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942673 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [14:42:14] (03CR) 10CI reject: [V: 04-1] noc: add static file server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942674 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [14:46:06] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10jbond) > I tried to reproduce it with bookworm on sretest1002 but I got an unrelated error in d-i because of the recent point release 12.1. I've updat... [14:47:40] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10jbond) [14:49:27] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: add routes for wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/938265 (https://phabricator.wikimedia.org/T339119) (owner: 10Hnowlan) [14:50:08] (03Merged) 10jenkins-bot: rest-gateway: add routes for wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/938265 (https://phabricator.wikimedia.org/T339119) (owner: 10Hnowlan) [14:52:05] (03CR) 10Ssingh: Use only active authdns hosts for DNS changes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/941758 (owner: 10Volans) [14:52:51] (03CR) 10Hnowlan: "One nit" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/942656 (owner: 10Elukey) [14:54:08] (03PS2) 10Elukey: Add nodejs14-devel [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/942656 [14:54:46] (03CR) 10Elukey: Add nodejs14-devel (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/942656 (owner: 10Elukey) [15:03:47] (03CR) 10Jbond: [C: 03+2] ssh::known_hosts: add new known_hosts functions [puppet] - 10https://gerrit.wikimedia.org/r/942389 (owner: 10Jbond) [15:07:57] (03CR) 10Jforrester: Add wikifunctions.org to certspotter::monitor_domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/941972 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [15:08:37] (03CR) 10Ssingh: Add wikifunctions.org to certspotter::monitor_domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/941972 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [15:08:39] (03PS2) 10Giuseppe Lavagetto: noc: unify methods to fetch the current wiki versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942671 (https://phabricator.wikimedia.org/T341859) [15:08:41] (03PS2) 10Giuseppe Lavagetto: noc: don't use on-disk files but etcd directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942672 (https://phabricator.wikimedia.org/T341859) [15:08:44] (03PS2) 10Giuseppe Lavagetto: noc: centralize file list management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942673 (https://phabricator.wikimedia.org/T341859) [15:08:48] (03PS2) 10Giuseppe Lavagetto: noc: add static file server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942674 (https://phabricator.wikimedia.org/T341859) [15:08:50] (03PS2) 10Giuseppe Lavagetto: noc: remove symlinks and also neutralize createTxtFileSymlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942675 (https://phabricator.wikimedia.org/T341859) [15:18:36] (03CR) 10Hnowlan: [C: 03+1] Add nodejs14-devel [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/942656 (owner: 10Elukey) [15:18:51] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add nodejs14-devel [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/942656 (owner: 10Elukey) [15:22:30] (03PS1) 10Elukey: Revert "Add nodejs14-devel" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/942480 [15:22:36] (03CR) 10Elukey: [V: 03+2 C: 03+2] Revert "Add nodejs14-devel" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/942480 (owner: 10Elukey) [15:24:12] (03CR) 10Jbond: [C: 03+2] pcc: add support for GERRIT_PRIVATE_CHANGE_NUMBER [puppet] - 10https://gerrit.wikimedia.org/r/937530 (https://phabricator.wikimedia.org/T265633) (owner: 10Jbond) [15:24:15] (03CR) 10Jbond: [C: 03+2] pcc: update the parse commit method to support "Change-Private:" footer (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/937534 (https://phabricator.wikimedia.org/T265633) (owner: 10Jbond) [15:25:00] !log kamila@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:25:10] !log kamila@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:26:07] !log k8s: delete and recreate the benthos-cache-invalidator namespace [15:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:23] !log kamila@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:26:38] (03PS1) 10Ssingh: hiera: add comment to alerting_host.yaml about certspotter [puppet] - 10https://gerrit.wikimedia.org/r/942678 [15:27:09] (03CR) 10Ssingh: [C: 03+2] hiera: add comment to alerting_host.yaml about certspotter [puppet] - 10https://gerrit.wikimedia.org/r/942678 (owner: 10Ssingh) [15:27:39] !log kamila@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:28:08] (03PS1) 10David Martin: Wikifunctions sqoop job: Add missing commandline elements [puppet] - 10https://gerrit.wikimedia.org/r/942680 (https://phabricator.wikimedia.org/T342199) [15:32:09] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [15:33:17] (03CR) 10Ssingh: [C: 03+2] Release dnsdist 1.8.0-1+wmf12u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/941966 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh) [15:34:06] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:34:21] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:40:35] (03CR) 10Jbond: "adding this here as they seem to be doing something similar" [puppet] - 10https://gerrit.wikimedia.org/r/940403 (https://phabricator.wikimedia.org/T342458) (owner: 10Jbond) [15:40:38] (03PS1) 10Jforrester: DumpInterwiki: Add f: as interwiki for wikifunctions [extensions/WikimediaMaintenance] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/942482 (https://phabricator.wikimedia.org/T325908) [15:41:10] (03CR) 10Ssingh: [C: 03+1] "LGTM but as discussed, we can skip this as well." [puppet] - 10https://gerrit.wikimedia.org/r/942446 (owner: 10Fabfur) [15:42:13] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply [15:42:13] (03PS1) 10Cathal Mooney: Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) [15:42:37] (03CR) 10CI reject: [V: 04-1] Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [15:43:39] (03PS1) 10Jforrester: DumpInterwiki: Set Forward=yes to wikifunctions: [extensions/WikimediaMaintenance] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/942483 (https://phabricator.wikimedia.org/T342909) [15:44:14] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [15:45:01] (03PS2) 10Cathal Mooney: Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) [15:45:37] (03CR) 10CI reject: [V: 04-1] Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [15:46:28] (03PS1) 10Ssingh: aptrepo: add component/dnsdist to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/942683 (https://phabricator.wikimedia.org/T342154) [15:47:02] (03CR) 10Vgutierrez: [C: 03+1] fifo-log-demux: Add socat as companion package (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/942446 (owner: 10Fabfur) [15:48:24] (03PS3) 10Cathal Mooney: Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) [15:51:14] (03CR) 10CI reject: [V: 04-1] Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [15:54:17] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply [15:57:19] (03CR) 10RLazarus: [C: 03+2] Add wikifunctions.org to certspotter::monitor_domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/941972 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [15:58:45] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10MatthewVernon) Here are a couple of rough graphs - frequency distribution of thumbnails (served by swift on 24 July, and all thum... [15:59:36] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:25] (03PS2) 10Jforrester: tests: Add some PHP testing on logos/config.yaml [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942463 [16:02:27] (03PS1) 10Jforrester: Wikifunctions: Disable the Collection extension for now, broken [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942684 (https://phabricator.wikimedia.org/T342931) [16:02:32] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:06:02] (03PS1) 10Jforrester: Wikifunctions: Add WF as alias for NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942685 (https://phabricator.wikimedia.org/T342964) [16:09:59] (03PS1) 10Btullis: Use the new DataHub images built with GitLab-CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/942687 (https://phabricator.wikimedia.org/T341194) [16:13:24] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:17:49] (03PS1) 10Andrew Bogott: profile::docker::runner: Fix call to ferm [puppet] - 10https://gerrit.wikimedia.org/r/942688 (https://phabricator.wikimedia.org/T342038) [16:18:49] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Puppet failure on Beta Cluster role::beta::docker_services boxes - https://phabricator.wikimedia.org/T342038 (10Andrew) The attached patch will fix some but not all failures. Because the port wasn't actually used here the different uses of this pro... [16:20:25] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [16:20:36] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [16:23:59] (03PS1) 10Andrew Bogott: profile::docker::runner: don't configure ferm [puppet] - 10https://gerrit.wikimedia.org/r/942690 (https://phabricator.wikimedia.org/T342038) [16:24:49] (03CR) 10Jforrester: [C: 03+1] profile::docker::runner: don't configure ferm [puppet] - 10https://gerrit.wikimedia.org/r/942690 (https://phabricator.wikimedia.org/T342038) (owner: 10Andrew Bogott) [16:25:08] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10dancy) In https://gitlab.wikimedia.org/repos/releng/gitlab-settings/-/blob/main/group-management/helpers.py#L223 the ldap group sy... [16:25:16] (03CR) 10Jforrester: "I11f3344b73bd379c905b839eea079cd8c7c23aa2 is a probably-better alternative." [puppet] - 10https://gerrit.wikimedia.org/r/942688 (https://phabricator.wikimedia.org/T342038) (owner: 10Andrew Bogott) [16:25:34] (03Abandoned) 10Andrew Bogott: profile::docker::runner: Fix call to ferm [puppet] - 10https://gerrit.wikimedia.org/r/942688 (https://phabricator.wikimedia.org/T342038) (owner: 10Andrew Bogott) [16:26:43] (03CR) 10Andrew Bogott: [C: 03+2] profile::docker::runner: don't configure ferm [puppet] - 10https://gerrit.wikimedia.org/r/942690 (https://phabricator.wikimedia.org/T342038) (owner: 10Andrew Bogott) [16:30:44] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet failure on Beta Cluster role::beta::docker_services boxes - https://phabricator.wikimedia.org/T342038 (10Andrew) 05Open→03Resolved a:03Andrew https://gerrit.wikimedia.org/r/c/operations/puppet/+/942690 resolved puppet compilation on the 3 hosts I spot-tested. [16:32:06] 10SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10BTullis) [16:41:01] (03PS1) 10Majavah: P:wmcs::graphite: disable incoming metrics [puppet] - 10https://gerrit.wikimedia.org/r/942691 (https://phabricator.wikimedia.org/T326266) [16:41:03] (03PS1) 10Majavah: wmcs: Disable Graphite query access [puppet] - 10https://gerrit.wikimedia.org/r/942692 (https://phabricator.wikimedia.org/T326266) [16:51:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:52:21] (03PS4) 10Arturo Borrero Gonzalez: cloudservices1006: prepare service [puppet] - 10https://gerrit.wikimedia.org/r/941383 (https://phabricator.wikimedia.org/T342161) [16:57:00] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:01:41] (03PS2) 10David Martin: Wikifunctions sqoop job: Add missing commandline elements [puppet] - 10https://gerrit.wikimedia.org/r/942680 (https://phabricator.wikimedia.org/T342199) [17:03:46] (03PS4) 10Jforrester: [DNM] Move wikifunctions.org from locked-down to limited deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941515 (https://phabricator.wikimedia.org/T342820) [17:05:06] RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 192, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:05:18] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:32:20] (03PS1) 10Andrew Bogott: profile::urldownloader: define towikimedia on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/942694 [17:33:24] 10SRE, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10observability, and 5 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10lmata) [17:37:19] (03CR) 10Andrew Bogott: [C: 03+2] profile::urldownloader: define towikimedia on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/942694 (owner: 10Andrew Bogott) [17:40:26] (03PS1) 10Cathal Mooney: Increase the number of retries for ZTP provision cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/942695 (https://phabricator.wikimedia.org/T336485) [17:41:10] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) I've done some work on this to allow for serving the JunOS image as part of the process. In the initial commits... [17:47:24] (03PS1) 10Andrew Bogott: Revert "profile::urldownloader: define towikimedia on deployment-prep" [puppet] - 10https://gerrit.wikimedia.org/r/942696 [17:50:54] (03CR) 10Andrew Bogott: [C: 03+2] Revert "profile::urldownloader: define towikimedia on deployment-prep" [puppet] - 10https://gerrit.wikimedia.org/r/942696 (owner: 10Andrew Bogott) [17:54:02] (03PS1) 10Andrew Bogott: network::constants: define mw_appserver_networks_private for cloud-vps [puppet] - 10https://gerrit.wikimedia.org/r/942698 [17:54:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:56:53] (03CR) 10Andrew Bogott: [C: 03+2] network::constants: define mw_appserver_networks_private for cloud-vps [puppet] - 10https://gerrit.wikimedia.org/r/942698 (owner: 10Andrew Bogott) [17:59:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:03:33] (03PS5) 10Fabfur: fifo-log-demux: Add socat as companion package [puppet] - 10https://gerrit.wikimedia.org/r/942446 (https://phabricator.wikimedia.org/T342154) [18:03:54] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [18:03:58] (03CR) 10Fabfur: fifo-log-demux: Add socat as companion package (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/942446 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [18:04:26] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:12:25] !log zabe@mwmaint1002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=brwikimedia --logwiki=metawiki 'Viniciuspontesoficial' 'Eusouvinipontes' # T343013 [18:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:30] T343013: Unblock stuck global rename of Viniciuspontesoficial - https://phabricator.wikimedia.org/T343013 [18:13:48] zabe: did you figure out why that was stuck? I tried looking but found nothing from the logs [18:18:20] no, me neither, I have only unstucked it now so that the account is usable again [18:18:42] lets hope it was a random thing as it often is with ca [18:23:30] PROBLEM - Check systemd state on db2097 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@s1.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:27:26] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [18:27:56] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:32] RECOVERY - Check systemd state on db2097 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:53:14] PROBLEM - Check systemd state on db1140 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@s6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:54:42] RECOVERY - Check systemd state on db1140 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:05] !log xcollazo@deploy1002 Started deploy [airflow-dags/analytics@4d8c3db]: Deploying T342926 and https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/469 [19:37:10] T342926: Don't pollute skein logs. Part II. - https://phabricator.wikimedia.org/T342926 [19:37:20] !log xcollazo@deploy1002 Finished deploy [airflow-dags/analytics@4d8c3db]: Deploying T342926 and https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/469 (duration: 00m 14s) [19:45:29] 10SRE, 10SRE-Access-Requests: Requesting access to Wiki Replicas end-to-end tiers for dr0ptp4kt - https://phabricator.wikimedia.org/T343039 (10dr0ptp4kt) [19:54:26] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:55:27] !log milimetric@deploy1002 Started deploy [analytics/refinery@53db2ca]: Publish refinery-source-0.2.19 [19:56:36] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:01:04] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:02:12] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:05:03] (03PS1) 10Jforrester: SpecialViewObject: Don't load if action=edit etc. [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/942485 (https://phabricator.wikimedia.org/T342891) [20:06:18] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:06:38] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:12:21] !log milimetric@deploy1002 Finished deploy [analytics/refinery@53db2ca]: Publish refinery-source-0.2.19 (duration: 16m 53s) [20:15:37] (03PS1) 10Sharvaniharan: Config changes for new Android schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942707 [20:31:30] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:31:46] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:14:20] 10SRE, 10Wikimedia-Mailing-lists: Create research-engineering-alerts list - https://phabricator.wikimedia.org/T342833 (10fkaelin) Thank you. No need for it to be archived, and no need for it to be private. [21:50:33] !log milimetric@deploy1002 Started deploy [analytics/refinery@f7e74ae]: Fix wikifunction special page [22:00:52] !log milimetric@deploy1002 Finished deploy [analytics/refinery@f7e74ae]: Fix wikifunction special page (duration: 10m 18s) [22:03:17] !log milimetric@deploy1002 Started deploy [analytics/refinery@f7e74ae] (thin): Fix wikifunction special page [22:03:20] !log milimetric@deploy1002 Finished deploy [analytics/refinery@f7e74ae] (thin): Fix wikifunction special page (duration: 00m 03s) [22:16:51] !log milimetric@deploy1002 Started deploy [airflow-dags/analytics@1ff1629]: Updating webrequest refine to include wikifunctions [22:17:12] !log milimetric@deploy1002 Finished deploy [airflow-dags/analytics@1ff1629]: Updating webrequest refine to include wikifunctions (duration: 00m 21s) [23:46:42] PROBLEM - Juniper alarms on cr2-esams is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 1 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [23:47:00] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_ [23:49:32] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:50:02] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:50:02] PROBLEM - OSPF status on mr1-esams is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:50:12] PROBLEM - BFD status on cr3-esams is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:50:14] PROBLEM - BFD status on cr2-esams is CRITICAL: Down: 15 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:50:30] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 65, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:50:36] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:51:36] (Nonwrite HTTP requests with primary DB writes alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+writes+alert [23:55:23] 10SRE, 10SRE-Access-Requests: Requesting access to Wiki Replicas end-to-end tiers for dr0ptp4kt - https://phabricator.wikimedia.org/T343039 (10odimitrijevic) Approved [23:58:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10Jclark-ctr) Both CPU's replaced and mainboard replaced today @akosiaris