[00:03:05] PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:04:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T314041)', diff saved to https://phabricator.wikimedia.org/P34922 and previous config saved to /var/cache/conftool/dbconfig/20220927-000434-ladsgroup.json [00:04:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [00:04:40] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [00:04:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [00:04:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [00:04:55] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices1005.wikimedia.org [00:05:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [00:05:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T314041)', diff saved to https://phabricator.wikimedia.org/P34923 and previous config saved to /var/cache/conftool/dbconfig/20220927-000525-ladsgroup.json [00:07:24] 10SRE, 10InternetArchiveBot: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) IABot is now handling 429 but I still would like access to the request logs for IABot. [00:08:30] 10SRE, 10InternetArchiveBot: Request for increase request limit for InternetArchiveBot - https://phabricator.wikimedia.org/T318284 (10Cyberpower678) p:05Triage→03Medium [00:10:37] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:13:54] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudnet1005.eqiad.wmnet [00:13:55] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudnet1005.eqiad.wmnet [00:15:07] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudnet1005.eqiad.wmnet [00:15:08] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudnet1005.eqiad.wmnet [00:16:10] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1005.wikimedia.org [00:23:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:24:47] (03PS1) 10Stang: votewiki: Change wgLanguageCode to zh for Sep 2022 admins election [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835291 (https://phabricator.wikimedia.org/T318147) [00:28:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:31:15] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1006.wikimedia.org [00:32:00] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudcontrol1005.wikimedia.org [00:35:57] RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:40:22] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1007.wikimedia.org [00:42:46] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudcontrol1006.wikimedia.org [00:50:12] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1007.wikimedia.org [00:53:49] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:56:35] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti203[12] - https://phabricator.wikimedia.org/T313856 (10Papaul) [00:57:38] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti203[12] - https://phabricator.wikimedia.org/T313856 (10Papaul) 05Open→03Resolved @MoritzMuehlenhoff All your's [01:03:17] PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:03:19] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10Papaul) a:03Papaul [01:08:01] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10Papaul) [01:15:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T314041)', diff saved to https://phabricator.wikimedia.org/P34924 and previous config saved to /var/cache/conftool/dbconfig/20220927-011543-ladsgroup.json [01:15:48] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [01:17:29] PROBLEM - Check systemd state on dbprov2001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:20:03] PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:30:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P34925 and previous config saved to /var/cache/conftool/dbconfig/20220927-013050-ladsgroup.json [01:37:45] (JobUnavailable) firing: (6) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:45:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P34926 and previous config saved to /var/cache/conftool/dbconfig/20220927-014556-ladsgroup.json [01:47:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:54:05] RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220927T0200) [02:01:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T314041)', diff saved to https://phabricator.wikimedia.org/P34927 and previous config saved to /var/cache/conftool/dbconfig/20220927-020103-ladsgroup.json [02:01:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [02:01:07] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [02:01:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [02:01:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T314041)', diff saved to https://phabricator.wikimedia.org/P34928 and previous config saved to /var/cache/conftool/dbconfig/20220927-020124-ladsgroup.json [02:04:33] RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:04:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:05:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:05:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:06:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:07:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.3 [core] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835301 (https://phabricator.wikimedia.org/T314192) [02:07:37] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.3 [core] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835301 (https://phabricator.wikimedia.org/T314192) (owner: 10TrainBranchBot) [02:07:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:25] RECOVERY - Check systemd state on dbprov2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:24:23] (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.3 [core] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835301 (https://phabricator.wikimedia.org/T314192) (owner: 10TrainBranchBot) [02:32:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:34:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:34:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:35:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220927T0300) [03:01:13] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835302 (https://phabricator.wikimedia.org/T314192) [03:01:15] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835302 (https://phabricator.wikimedia.org/T314192) (owner: 10TrainBranchBot) [03:01:33] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:01:57] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835302 (https://phabricator.wikimedia.org/T314192) (owner: 10TrainBranchBot) [03:02:25] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.3 refs T314192 [03:02:29] T314192: 1.40.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T314192 [03:06:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:07:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:07:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:07:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:22:51] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:24:47] PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:38:26] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.3 refs T314192 (duration: 36m 01s) [03:38:30] T314192: 1.40.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T314192 [03:40:31] !log mwpresync@deploy1002 Pruned MediaWiki: 1.40.0-wmf.1 (duration: 02m 03s) [03:43:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:46:59] PROBLEM - Check systemd state on dbprov1002 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:51:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:51:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:57:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [04:10:07] RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:10:37] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:21:03] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:24:05] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:41:11] RECOVERY - Check systemd state on dbprov1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:39] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:45:47] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:52:45] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:52:53] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:02:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q1:rack/setup/install db218[34] - https://phabricator.wikimedia.org/T313979 (10Marostegui) Thanks @papaul - I think these will be handled by @jcrespo :-) [05:02:23] (03PS1) 10Giuseppe Lavagetto: mediawiki::canary_api: remove php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/835506 [05:03:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:04:26] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37355/console" [puppet] - 10https://gerrit.wikimedia.org/r/835506 (owner: 10Giuseppe Lavagetto) [05:06:55] PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:07:30] (03PS1) 10Marostegui: control-mariadb-10.6-bullseye: Upgrade 10.6.10 [software] - 10https://gerrit.wikimedia.org/r/835508 (https://phabricator.wikimedia.org/T318128) [05:08:29] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:08:54] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) Thanks John. I am leaving the host ON, but mysql stopped, so you can proceed and power it off anytime you want to swap the new DIMM. [05:12:59] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.254 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:18:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:23:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:28:33] !log Install 10.6.10 on db1124, db1125, pc1014, pc2014 T318128 [05:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:37] T318128: Compile and install MariaDB 10.6.10 - https://phabricator.wikimedia.org/T318128 [05:28:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:32:23] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) [05:33:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:38:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:45:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:52:19] RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:58:19] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence, and 2 others: db2098 crashed - https://phabricator.wikimedia.org/T318062 (10Marostegui) a:05jcrespo→03Papaul Assigning to @Papaul per T318062#8247109 [06:00:05] kormat, marostegui, and Amir1: Your horoscope predicts another unfortunate Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220927T0600). [06:00:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:12:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:17:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:26:21] PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:37:35] RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:46:15] (03PS16) 10Ayounsi: sre.network.peering: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 [06:49:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T314041)', diff saved to https://phabricator.wikimedia.org/P34930 and previous config saved to /var/cache/conftool/dbconfig/20220927-064925-ladsgroup.json [06:49:29] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [06:50:59] (03CR) 10Ayounsi: [C: 03+2] sre.network.peering: initial commit (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 (owner: 10Ayounsi) [06:52:04] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti203[12] - https://phabricator.wikimedia.org/T313856 (10MoritzMuehlenhoff) Thanks! [06:54:34] (03Merged) 10jenkins-bot: sre.network.peering: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 (owner: 10Ayounsi) [06:57:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:58:48] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'show' for AS: 8220 [06:59:35] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'show' for AS: 8220 [07:00:04] Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220927T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:04:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P34932 and previous config saved to /var/cache/conftool/dbconfig/20220927-070431-ladsgroup.json [07:06:41] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:11:25] PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:19:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P34933 and previous config saved to /var/cache/conftool/dbconfig/20220927-071938-ladsgroup.json [07:22:47] RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:22:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:25:51] (03CR) 10JMeybohm: [C: 03+1] Add golang 1.18 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/833792 (owner: 10Majavah) [07:27:38] (03CR) 10JMeybohm: [C: 03+1] services_proxy: add a keepalive timeout for image-suggestion (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/835205 (https://phabricator.wikimedia.org/T313973) (owner: 10Giuseppe Lavagetto) [07:30:08] jayme: may I ask you to merge/build that go 1.18 image? that requires ops access which I don't have [07:30:37] taavi: oh, sorry. Sure! give me a minute [07:31:17] thanks! [07:31:37] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add golang 1.18 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/833792 (owner: 10Majavah) [07:34:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T314041)', diff saved to https://phabricator.wikimedia.org/P34934 and previous config saved to /var/cache/conftool/dbconfig/20220927-073441-ladsgroup.json [07:34:46] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [07:34:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T314041)', diff saved to https://phabricator.wikimedia.org/P34935 and previous config saved to /var/cache/conftool/dbconfig/20220927-073451-ladsgroup.json [07:34:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1107.eqiad.wmnet with reason: Maintenance [07:35:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1107.eqiad.wmnet with reason: Maintenance [07:35:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1107 (T314041)', diff saved to https://phabricator.wikimedia.org/P34936 and previous config saved to /var/cache/conftool/dbconfig/20220927-073523-ladsgroup.json [07:36:33] !log published image docker-registry.discovery.wmnet/golang1.18:1.18-1 [07:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:37] taavi: ^ [07:39:32] !log uploaded expat 2.2.0-2+deb9u5+wmf1 to apt.wikimedia.org/stretch-wikimedia [07:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:15] (03CR) 10Ayounsi: [C: 04-1] "Is there a way to know what the final config file is going to look like?" [puppet] - 10https://gerrit.wikimedia.org/r/832632 (https://phabricator.wikimedia.org/T317746) (owner: 10Muehlenhoff) [07:48:09] !log installing expat security updates on stretch/buster/bullseye [07:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:35] !log upgrade python3-pynetbox to 6.6.0 on cumin2002 - T310745 [07:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:39] T310745: Upgrade pynetbox - https://phabricator.wikimedia.org/T310745 [07:49:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P34937 and previous config saved to /var/cache/conftool/dbconfig/20220927-074948-ladsgroup.json [07:52:55] !log upgrade python3-pynetbox to 6.6.0 on cumin1001 - T310745 [07:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:54] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.thumbor rolling restart_daemons on A:thumbor-codfw [07:56:50] PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:57:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.thumbor (exit_code=0) rolling restart_daemons on A:thumbor-codfw [07:58:07] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.thumbor rolling restart_daemons on A:thumbor-eqiad [08:00:20] (03CR) 10Hashar: Ship WMF-specific systemd unit parts as systemd override (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832632 (https://phabricator.wikimedia.org/T317746) (owner: 10Muehlenhoff) [08:00:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.thumbor (exit_code=0) rolling restart_daemons on A:thumbor-eqiad [08:04:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P34938 and previous config saved to /var/cache/conftool/dbconfig/20220927-080454-ladsgroup.json [08:05:36] (03PS1) 10Muehlenhoff: Make ganeti2031 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/835553 (https://phabricator.wikimedia.org/T313857) [08:08:11] RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:10:37] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:13:24] (03Abandoned) 10Hashar: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/833844 (owner: 10PipelineBot) [08:15:12] !log restarting apache/FPM on mw canaries to pick up Expat security updates [08:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:20:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T314041)', diff saved to https://phabricator.wikimedia.org/P34941 and previous config saved to /var/cache/conftool/dbconfig/20220927-082001-ladsgroup.json [08:20:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [08:20:06] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [08:20:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [08:20:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T314041)', diff saved to https://phabricator.wikimedia.org/P34942 and previous config saved to /var/cache/conftool/dbconfig/20220927-082023-ladsgroup.json [08:20:46] (03CR) 10Clément Goubert: C:rsync::server: convert to concat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) (owner: 10Jbond) [08:23:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:25:28] (03PS4) 10Jbond: lvs: Convert ::lvs::configuration to a profile [puppet] - 10https://gerrit.wikimedia.org/r/834360 (https://phabricator.wikimedia.org/T264132) (owner: 10BCornwall) [08:26:12] (03PS5) 10Jbond: lvs: Convert ::lvs::configuration to a profile [puppet] - 10https://gerrit.wikimedia.org/r/834360 (https://phabricator.wikimedia.org/T264132) (owner: 10BCornwall) [08:27:25] (03CR) 10Filippo Giunchedi: [C: 03+1] hieradata: remove ms-be10[28-39] from swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/835106 (https://phabricator.wikimedia.org/T294550) (owner: 10MVernon) [08:27:30] (03CR) 10Jbond: [C: 03+1] lvs: Convert ::lvs::configuration to a profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/834360 (https://phabricator.wikimedia.org/T264132) (owner: 10BCornwall) [08:28:33] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/832327 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [08:29:18] 10SRE, 10Traffic, 10Upstream: ATS wrongly parses requests without a leading / - https://phabricator.wikimedia.org/T317660 (10Vgutierrez) [08:29:51] (03CR) 10Jbond: [V: 03+1 C: 03+1] "PCC SUCCESS (NOOP 4 DIFF 20): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37356/console" [puppet] - 10https://gerrit.wikimedia.org/r/834360 (https://phabricator.wikimedia.org/T264132) (owner: 10BCornwall) [08:30:04] 10SRE, 10Traffic, 10Upstream: ATS wrongly parses requests without a leading / - https://phabricator.wikimedia.org/T317660 (10Vgutierrez) Making the task public after cleaning IP addresses from the original request that helped detecting the issue and after checking with upstream that this isn't a security bug [08:30:51] (03CR) 10Vgutierrez: [C: 03+2] Release 9.1.3-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/834045 (https://phabricator.wikimedia.org/T317660) (owner: 10Vgutierrez) [08:33:53] (03CR) 10MVernon: [C: 03+2] hieradata: remove ms-be10[28-39] from swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/835106 (https://phabricator.wikimedia.org/T294550) (owner: 10MVernon) [08:34:46] (03CR) 10Jbond: [C: 03+1] standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/835195 (owner: 10Muehlenhoff) [08:36:43] (03PS9) 10Jbond: C:rsync::server: convert to concat [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) [08:37:13] (03CR) 10Jbond: C:rsync::server: convert to concat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) (owner: 10Jbond) [08:37:31] (03PS9) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) [08:47:18] (03CR) 10Muehlenhoff: [C: 03+2] standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/835195 (owner: 10Muehlenhoff) [08:52:02] (03CR) 10Vgutierrez: Unlink certificate renewal and OCSP handling (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/820795 (https://phabricator.wikimedia.org/T244232) (owner: 10BCornwall) [08:57:16] (03PS1) 10Filippo Giunchedi: grafana: block external access to /metrics [puppet] - 10https://gerrit.wikimedia.org/r/835559 (https://phabricator.wikimedia.org/T309703) [08:57:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:59:37] (03CR) 10CI reject: [V: 04-1] grafana: block external access to /metrics [puppet] - 10https://gerrit.wikimedia.org/r/835559 (https://phabricator.wikimedia.org/T309703) (owner: 10Filippo Giunchedi) [09:00:07] (03PS2) 10Filippo Giunchedi: grafana: block external access to /metrics [puppet] - 10https://gerrit.wikimedia.org/r/835559 (https://phabricator.wikimedia.org/T309703) [09:01:42] (03PS3) 10Filippo Giunchedi: grafana: block external access to /metrics [puppet] - 10https://gerrit.wikimedia.org/r/835559 (https://phabricator.wikimedia.org/T309703) [09:02:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:03:35] hmmm is wikibugs down? [09:03:40] or just lagged? [09:05:18] not sure about phab but for gerrit it was pretty okay with my last update to https://gerrit.wikimedia.org/r/835559 [09:05:29] in terms of lag that is [09:05:32] (03CR) 10Vgutierrez: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/834525 (https://phabricator.wikimedia.org/T317051) (owner: 10Vgutierrez) [09:05:52] yeah.. that one was immediate [09:06:29] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37357/console" [puppet] - 10https://gerrit.wikimedia.org/r/829551 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [09:07:01] (03CR) 10Filippo Giunchedi: "Today we got yet another report of /metrics being publicly exposed. This patch will forbid access from the outside for Grafana." [puppet] - 10https://gerrit.wikimedia.org/r/835559 (https://phabricator.wikimedia.org/T309703) (owner: 10Filippo Giunchedi) [09:12:28] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host logstash2037.mgmt.codfw.wmnet with reboot policy FORCED [09:13:40] !log volans@cumin2002 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host logstash2037.mgmt.codfw.wmnet with reboot policy FORCED [09:14:36] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host logstash2037.mgmt.codfw.wmnet with reboot policy FORCED [09:15:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:20:57] (03CR) 10Jbond: [C: 04-1] C:rsync::server: convert to concat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) (owner: 10Jbond) [09:24:07] RECOVERY - MegaRAID on an-worker1146 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:30:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:51:18] 10SRE, 10Traffic: CDN doesn't validate request-target - https://phabricator.wikimedia.org/T318676 (10Vgutierrez) p:05Triage→03Medium [09:55:27] (03PS1) 10Muehlenhoff: New cookbook to roll-restart/reboot Thanos frontends [cookbooks] - 10https://gerrit.wikimedia.org/r/835565 [09:56:55] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: drop firmware-file flag [cookbooks] - 10https://gerrit.wikimedia.org/r/835108 (owner: 10Jbond) [09:56:57] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: Present user with a list of current files [cookbooks] - 10https://gerrit.wikimedia.org/r/835157 (owner: 10Jbond) [09:57:11] (03PS4) 10Jbond: sre.hardware.upgrade-firmware: drop firmware-file flag [cookbooks] - 10https://gerrit.wikimedia.org/r/835108 [09:57:15] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Present user with a list of current files [cookbooks] - 10https://gerrit.wikimedia.org/r/835157 (owner: 10Jbond) [09:57:19] (03PS6) 10Jbond: sre.hardware.upgrade-firmware: Present user with a list of current files [cookbooks] - 10https://gerrit.wikimedia.org/r/835157 [09:57:59] 10SRE, 10Traffic: CDN doesn't validate request-target - https://phabricator.wikimedia.org/T318676 (10Vgutierrez) Apparently varnish supports the absolute-URI form for non CONNECT requests. This has been introduced a long time ago in https://gerrit.wikimedia.org/r/c/operations/puppet/+/275474. @BBlack do you ha... [09:58:08] (03CR) 10Muehlenhoff: New cookbook to roll-restart/reboot Thanos frontends (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/835565 (owner: 10Muehlenhoff) [10:02:43] (03PS9) 10Hashar: gerrit: decouple scap and daemon users [puppet] - 10https://gerrit.wikimedia.org/r/832345 (https://phabricator.wikimedia.org/T317412) [10:03:26] (03CR) 10Hashar: [C: 03+1] "I have amended the commit message to fix a few typos and clarify the intent of this change, also attached it to T317412 "Automate Gerrit d" [puppet] - 10https://gerrit.wikimedia.org/r/832345 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [10:03:41] (03PS5) 10Hashar: gerrit: change deployment user on devtools [puppet] - 10https://gerrit.wikimedia.org/r/832507 [10:03:50] !log rebalance ganeti/codfw row D after completed Bullseye update T311686 [10:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:54] (03PS3) 10Hashar: gerrit: make homedir variable [puppet] - 10https://gerrit.wikimedia.org/r/833379 [10:03:54] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [10:04:57] (03PS2) 10Hashar: gerrit: make daemon_user variable everywhere [puppet] - 10https://gerrit.wikimedia.org/r/833385 [10:05:16] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: use catagory for storage [cookbooks] - 10https://gerrit.wikimedia.org/r/835567 [10:06:29] !log mvernon@cumin1001 START - Cookbook sre.hosts.decommission for hosts ms-be[1028-1033,1035-1039].eqiad.wmnet [10:07:51] (03CR) 10Jbond: [C: 03+2] "paths on cumin already updated" [cookbooks] - 10https://gerrit.wikimedia.org/r/835567 (owner: 10Jbond) [10:09:51] (03CR) 10Hashar: [C: 03+1] "Cherry picked on devtools and work as intended :)" [puppet] - 10https://gerrit.wikimedia.org/r/833379 (owner: 10Hashar) [10:10:05] (03PS1) 10MVernon: cumin: move swift-be-canary [puppet] - 10https://gerrit.wikimedia.org/r/835568 (https://phabricator.wikimedia.org/T294550) [10:10:07] !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts ms-be[1028-1033,1035-1039].eqiad.wmnet [10:10:57] (03CR) 10Hashar: [C: 03+1] "Noop on gerrit-prod-1001.devtools.eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/833385 (owner: 10Hashar) [10:11:02] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: use catagory for storage [cookbooks] - 10https://gerrit.wikimedia.org/r/835567 (owner: 10Jbond) [10:11:37] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet [10:11:55] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1002.eqiad.wmnet [10:13:08] (03PS1) 10Filippo Giunchedi: pontoon: fix bootstrap with new hiera location [puppet] - 10https://gerrit.wikimedia.org/r/835569 [10:13:21] (03CR) 10Filippo Giunchedi: [C: 03+1] cumin: move swift-be-canary [puppet] - 10https://gerrit.wikimedia.org/r/835568 (https://phabricator.wikimedia.org/T294550) (owner: 10MVernon) [10:13:44] (03CR) 10MVernon: [C: 03+2] cumin: move swift-be-canary [puppet] - 10https://gerrit.wikimedia.org/r/835568 (https://phabricator.wikimedia.org/T294550) (owner: 10MVernon) [10:14:15] !log mvernon@cumin2002 START - Cookbook sre.hosts.decommission for hosts ms-be[2028-2039].codfw.wmnet [10:16:20] !log mvernon@cumin1001 START - Cookbook sre.hosts.decommission for hosts ms-be[1028-1033,1035-1039].eqiad.wmnet [10:17:53] wotcha, timeouts trying to SSH to bastion.wmcloud.org (or, "TTL expired in transit" apparently?) [10:18:28] (03PS1) 10Jbond: sre.hardware.firmeware-upgrade: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/835570 [10:18:40] (disregard, it wasn't working for ~10 minutes, starts working as I posted that ^) [10:18:45] (03PS1) 10Vgutierrez: varnish: Remove ECDHE-ECDSA-AES128-SHA sinkhole [puppet] - 10https://gerrit.wikimedia.org/r/835571 (https://phabricator.wikimedia.org/T258405) [10:22:12] (03PS2) 10Jbond: sre.hardware.firmeware-upgrade: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/835570 [10:23:57] (03CR) 10Jbond: [C: 03+2] sre.hardware.firmeware-upgrade: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/835570 (owner: 10Jbond) [10:24:28] (03CR) 10Clément Goubert: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/835569 (owner: 10Filippo Giunchedi) [10:26:25] (03CR) 10CI reject: [V: 04-1] sre.hardware.firmeware-upgrade: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/835570 (owner: 10Jbond) [10:27:41] (03PS10) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [10:27:49] (03PS7) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [10:30:44] (03PS1) 10Vgutierrez: varnish: Fix VCL tests broken by querysort [puppet] - 10https://gerrit.wikimedia.org/r/835572 (https://phabricator.wikimedia.org/T314868) [10:31:31] (03PS3) 10Jbond: sre.hardware.firmware-upgrade: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/835570 [10:35:06] (03CR) 10Jbond: [C: 03+2] sre.hardware.firmware-upgrade: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/835570 (owner: 10Jbond) [10:36:39] (03PS2) 10Vgutierrez: varnish: Fix VCL tests broken by querysort [puppet] - 10https://gerrit.wikimedia.org/r/835572 (https://phabricator.wikimedia.org/T314868) [10:36:42] (03PS2) 10Giuseppe Lavagetto: jobrunner: convert to use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829551 (https://phabricator.wikimedia.org/T271736) [10:38:32] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet [10:38:40] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1002.eqiad.wmnet [10:39:32] (03PS3) 10Giuseppe Lavagetto: jobrunner: convert to use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829551 (https://phabricator.wikimedia.org/T271736) [10:40:22] (03CR) 10Isabelle Hurbain-Palatin: [C: 03+1] "I double-checked the name of the variables, that the variable are top-level in the config, and that this is applied all wikis in to the -l" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833454 (https://phabricator.wikimedia.org/T175177) (owner: 10Sbailey) [10:40:24] (03CR) 10CI reject: [V: 04-1] jobrunner: convert to use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829551 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [10:40:50] (03PS3) 10Hashar: gerrit: use daemon_user variable everywhere [puppet] - 10https://gerrit.wikimedia.org/r/833385 [10:41:07] (03PS3) 10Vgutierrez: varnish: Fix VCL tests broken by querysort [puppet] - 10https://gerrit.wikimedia.org/r/835572 (https://phabricator.wikimedia.org/T314868) [10:41:11] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: correct passed parameter [cookbooks] - 10https://gerrit.wikimedia.org/r/835575 [10:41:40] (03PS4) 10Giuseppe Lavagetto: jobrunner: convert to use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829551 (https://phabricator.wikimedia.org/T271736) [10:42:22] (03CR) 10Hashar: [C: 03+1] "I forgot to adjust the proxy/migration/migration_base profiles ;)" [puppet] - 10https://gerrit.wikimedia.org/r/833385 (owner: 10Hashar) [10:42:43] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37365/console" [puppet] - 10https://gerrit.wikimedia.org/r/829551 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [10:44:02] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: correct passed parameter [cookbooks] - 10https://gerrit.wikimedia.org/r/835575 [10:45:25] (03PS4) 10Vgutierrez: varnish: Fix VCL tests broken by querysort [puppet] - 10https://gerrit.wikimedia.org/r/835572 (https://phabricator.wikimedia.org/T314868) [10:49:32] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: correct passed parameter [cookbooks] - 10https://gerrit.wikimedia.org/r/835575 (owner: 10Jbond) [10:50:45] (03PS2) 10Muehlenhoff: interface: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/812176 (https://phabricator.wikimedia.org/T308013) [10:52:18] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [10:53:01] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: correct passed parameter [cookbooks] - 10https://gerrit.wikimedia.org/r/835575 (owner: 10Jbond) [10:53:54] (03CR) 10Vgutierrez: [C: 03+2] varnish: Fix VCL tests broken by querysort [puppet] - 10https://gerrit.wikimedia.org/r/835572 (https://phabricator.wikimedia.org/T314868) (owner: 10Vgutierrez) [10:54:23] (03CR) 10Muehlenhoff: [C: 03+2] interface: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/812176 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:54:49] vgutierrez: shall I merge your vcl patch along? [10:55:01] moritzm: go ahead please [10:55:15] ack, done [10:55:22] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:55:22] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-be[2028-2039].codfw.wmnet [10:55:27] 10SRE-swift-storage: Decom ms-be20[28-39] - https://phabricator.wikimedia.org/T294549 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin2002 for hosts: `ms-be[2028-2039].codfw.wmnet` - ms-be2028.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - Found physical hos... [10:55:41] 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission ms-be20[28-39].codfw.wmnet - https://phabricator.wikimedia.org/T318689 (10MatthewVernon) [10:56:10] 10SRE-swift-storage: Decom ms-be20[28-39] - https://phabricator.wikimedia.org/T294549 (10MatthewVernon) [10:56:28] 10SRE-swift-storage: Decom ms-be20[28-39] - https://phabricator.wikimedia.org/T294549 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon [10:57:08] (03Abandoned) 10Jbond: sre.hardware.upgrade-firmware: use catagory for storage [cookbooks] - 10https://gerrit.wikimedia.org/r/835567 (owner: 10Jbond) [10:57:40] !log mvernon@cumin1001 START - Cookbook sre.dns.netbox [10:58:51] !log mvernon@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:58:52] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-be[1028-1033,1035-1039].eqiad.wmnet [10:58:54] 10SRE-swift-storage: Decom ms-be10[28-39] - https://phabricator.wikimedia.org/T294550 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin1001 for hosts: `ms-be[1028-1033,1035-1039].eqiad.wmnet` - ms-be1028.eqiad.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - Found ph... [10:59:30] 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be10[28-39].eqiad.wmnet - https://phabricator.wikimedia.org/T318691 (10MatthewVernon) [11:00:08] 10SRE-swift-storage: Decom ms-be10[28-39] - https://phabricator.wikimedia.org/T294550 (10MatthewVernon) [11:00:28] 10SRE-swift-storage: Decom ms-be10[28-39] - https://phabricator.wikimedia.org/T294550 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon [11:04:37] (03PS2) 10Vgutierrez: varnish: Remove ECDHE-ECDSA-AES128-SHA sinkhole [puppet] - 10https://gerrit.wikimedia.org/r/835571 (https://phabricator.wikimedia.org/T258405) [11:06:16] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/834017 (https://phabricator.wikimedia.org/T318006) (owner: 10Jbond) [11:06:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:07:47] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/834038 (https://phabricator.wikimedia.org/T318006) (owner: 10Jbond) [11:08:29] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/834039 (owner: 10Jbond) [11:11:55] (03PS1) 10Jbond: sre.hardware.upfraede-firmware: use packagin.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 [11:14:18] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: use packagin.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 [11:14:34] (03CR) 10Vgutierrez: "text tests are happy:" [puppet] - 10https://gerrit.wikimedia.org/r/835571 (https://phabricator.wikimedia.org/T258405) (owner: 10Vgutierrez) [11:17:49] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:17:49] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:18:02] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: use packagin.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 (owner: 10Jbond) [11:21:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:23:03] (03PS1) 10Ladsgroup: labs: Enable temp user creation in dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835580 [11:24:19] (03CR) 10CI reject: [V: 04-1] labs: Enable temp user creation in dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835580 (owner: 10Ladsgroup) [11:28:32] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host logstash2037.mgmt.codfw.wmnet with reboot policy FORCED [11:30:48] (03PS2) 10Ladsgroup: labs: Enable temp user creation in dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835580 [11:32:15] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:34:47] (03CR) 10Ladsgroup: [C: 03+2] labs: Enable temp user creation in dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835580 (owner: 10Ladsgroup) [11:35:37] (03Merged) 10jenkins-bot: labs: Enable temp user creation in dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835580 (owner: 10Ladsgroup) [11:36:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:38:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:39:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:39:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:40:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:41:07] (03PS1) 10Jelto: gitlab: disable email notifications on replicas [puppet] - 10https://gerrit.wikimedia.org/r/835581 (https://phabricator.wikimedia.org/T318682) [11:43:38] (03CR) 10Nikerabbit: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/835589 (https://phabricator.wikimedia.org/T318484) (owner: 10Nikerabbit) [11:45:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:45:35] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37370/console" [puppet] - 10https://gerrit.wikimedia.org/r/835581 (https://phabricator.wikimedia.org/T318682) (owner: 10Jelto) [11:49:26] (03PS2) 10Jbond: 0.5.4: Prepare release [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/834039 [11:49:36] (03CR) 10Jbond: [V: 03+2 C: 03+2] update-known-hosts-production: Capture all fingerprints [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/834038 (https://phabricator.wikimedia.org/T318006) (owner: 10Jbond) [11:49:41] (03CR) 10Jbond: [V: 03+2 C: 03+2] 0.5.4: Prepare release [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/834039 (owner: 10Jbond) [11:49:54] (03CR) 10Jbond: [V: 03+2 C: 03+2] 0.5.4: Prepare release (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/834039 (owner: 10Jbond) [11:50:13] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Replace nutcracker with mcrouter - https://phabricator.wikimedia.org/T318695 (10hnowlan) [11:50:27] (03CR) 10Jbond: [C: 03+2] C:ssh::publish_fingerprints: drop RSA support [puppet] - 10https://gerrit.wikimedia.org/r/834017 (https://phabricator.wikimedia.org/T318006) (owner: 10Jbond) [11:51:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:51:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:51:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:57:19] !log upload new wmf-laptop_0.5.4 package [11:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:04:35] (03PS1) 10Clément Goubert: pontoon: initialize new stack sops-appservers [puppet] - 10https://gerrit.wikimedia.org/r/835583 [12:05:26] (03CR) 10CI reject: [V: 04-1] pontoon: initialize new stack sops-appservers [puppet] - 10https://gerrit.wikimedia.org/r/835583 (owner: 10Clément Goubert) [12:10:13] (03PS1) 10Clément Goubert: C:memcached Restart memcached service on change [puppet] - 10https://gerrit.wikimedia.org/r/835585 [12:10:37] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:11:46] (03CR) 10Clément Goubert: "I am not sure about this, since we may want more control around memcached restarts for cache warming reasons. Opinions?" [puppet] - 10https://gerrit.wikimedia.org/r/835585 (owner: 10Clément Goubert) [12:13:12] (03CR) 10Hashar: [C: 03+1] "I found the reason. gerrit2002 has been populated using rsync which included the following directories:" [puppet] - 10https://gerrit.wikimedia.org/r/832344 (owner: 10Hashar) [12:13:16] (03PS2) 10Clément Goubert: pontoon: initialize new stack sops-appservers [puppet] - 10https://gerrit.wikimedia.org/r/835583 [12:15:34] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37371/console" [puppet] - 10https://gerrit.wikimedia.org/r/835585 (owner: 10Clément Goubert) [12:15:48] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [12:15:51] (03PS1) 10KartikMistry: testwiki: Enable Section Translation for Bambara and Goan Konkani Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835606 (https://phabricator.wikimedia.org/T314557) [12:17:13] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice! LGTM, I'll send a change to have boostrap.sh add the SPDX header" [puppet] - 10https://gerrit.wikimedia.org/r/835583 (owner: 10Clément Goubert) [12:18:34] (03PS1) 10Filippo Giunchedi: pontoon: add SPDX header to rolemap on bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/835607 [12:18:54] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:18:58] (03CR) 10Clément Goubert: [C: 03+2] pontoon: initialize new stack sops-appservers [puppet] - 10https://gerrit.wikimedia.org/r/835583 (owner: 10Clément Goubert) [12:19:24] (03CR) 10Clément Goubert: [C: 03+1] pontoon: add SPDX header to rolemap on bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/835607 (owner: 10Filippo Giunchedi) [12:20:45] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [12:21:06] 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10BTullis) [12:21:59] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:22:53] (03PS2) 10Clément Goubert: C:memcached Restart memcached service on change [puppet] - 10https://gerrit.wikimedia.org/r/835585 (https://phabricator.wikimedia.org/T318697) [12:22:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:23:34] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add SPDX header to rolemap on bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/835607 (owner: 10Filippo Giunchedi) [12:23:36] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [12:23:55] claime: I merged your change too [12:26:03] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [12:26:59] godog: thanks! [12:28:50] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [12:31:03] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [12:33:31] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:36:14] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [12:41:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:42:32] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [12:52:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:58:45] (03PS1) 10Arturo Borrero Gonzalez: toolforge: refresh automated tests to remove references to Debian Stretch [puppet] - 10https://gerrit.wikimedia.org/r/835612 (https://phabricator.wikimedia.org/T275864) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220927T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220927T1300) [13:00:26] looks like there’s nothing to deploy :) [13:00:37] unless content transform team want to do mobileapps/wikifeeds things [13:02:59] I'll add a patch in a second :) [13:05:21] (03CR) 10Abijeet Patro: [C: 03+1] Update Translate job names [deployment-charts] - 10https://gerrit.wikimedia.org/r/835589 (https://phabricator.wikimedia.org/T318484) (owner: 10Nikerabbit) [13:10:17] So, I added a patch for the currently ongoing deploy window. Though, it could also be done in later slot. [13:10:29] deploy -> backport [13:10:57] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:17:53] (03CR) 10Herron: [C: 03+1] "Seems fine -- I don't see much benefit/downside to either config, but since blocking this should cut down on security false positives LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/835559 (https://phabricator.wikimedia.org/T309703) (owner: 10Filippo Giunchedi) [13:18:12] is anyone else around to deploy the backport? I’m in a meeting [13:24:25] MichaelG_WMDE: I assume that change should be backported to wmf.2? [13:24:36] (normally the change linked in the deployment calendar is already a cherry-pick, ftr) [13:25:04] Lucas_WMDE: ah yes, will prepare the cherry pick right away [13:25:47] * taavi looks [13:26:58] MichaelG_WMDE: hey. happy to deploy once you have a cherry-pick [13:27:23] thanks, one second [13:30:24] (03CR) 10Nikerabbit: "I'm trying to figure out who is capable and comfortable deploying this change." [deployment-charts] - 10https://gerrit.wikimedia.org/r/835589 (https://phabricator.wikimedia.org/T318484) (owner: 10Nikerabbit) [13:30:31] the gerrit up is the best way to create one [13:31:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install centrallog1002 - https://phabricator.wikimedia.org/T313858 (10Cmjohnson) [13:31:52] (03PS1) 10Michael Große: Track use of Searchbox footer on Wikidata [extensions/Wikibase] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/835590 (https://phabricator.wikimedia.org/T306933) [13:32:18] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Track use of Searchbox footer on Wikidata [extensions/Wikibase] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/835590 (https://phabricator.wikimedia.org/T306933) (owner: 10Michael Große) [13:32:25] ok, so I _think_ this is the right one now https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/835590 [13:32:51] though it might also be useful to have this on wmf.3? [13:32:52] oh wait, I just realized it’s Tuesday not Monday [13:32:55] so we’re post branch cut [13:32:59] yeah, probably wmf.3 too [13:33:11] yeah, you probably want both at this point [13:33:11] 👍 [13:33:50] (03PS1) 10Michael Große: Track use of Searchbox footer on Wikidata [extensions/Wikibase] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835591 (https://phabricator.wikimedia.org/T306933) [13:34:30] (03PS1) 10Cmjohnson: Adding site.pp entry for centrallog1002 [puppet] - 10https://gerrit.wikimedia.org/r/835619 (https://phabricator.wikimedia.org/T313858) [13:35:23] (03PS2) 10Cmjohnson: Adding site.pp entry for centrallog1002 [puppet] - 10https://gerrit.wikimedia.org/r/835619 (https://phabricator.wikimedia.org/T313858) [13:35:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/835590 (https://phabricator.wikimedia.org/T306933) (owner: 10Michael Große) [13:35:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835591 (https://phabricator.wikimedia.org/T306933) (owner: 10Michael Große) [13:36:02] ooh, `scap backport` in action [13:36:42] yeah, testing it with 2 patches at the same time for the first time [13:36:54] * taavi starts with filing a scap feature request [13:37:25] (03CR) 10Cmjohnson: [C: 03+2] Adding site.pp entry for centrallog1002 [puppet] - 10https://gerrit.wikimedia.org/r/835619 (https://phabricator.wikimedia.org/T313858) (owner: 10Cmjohnson) [13:38:28] hmm 'http.client.RemoteDisconnected: Remote end closed connection without response' [13:38:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/835590 (https://phabricator.wikimedia.org/T306933) (owner: 10Michael Große) [13:38:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835591 (https://phabricator.wikimedia.org/T306933) (owner: 10Michael Große) [13:38:55] and a bug report [13:40:23] (03CR) 10Ayounsi: customscripts: export 'mgmt' entries from hiera_export (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [13:41:12] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/835559 (https://phabricator.wikimedia.org/T309703) (owner: 10Filippo Giunchedi) [13:42:24] 10SRE, 10serviceops: Update conf1* servers - https://phabricator.wikimedia.org/T310062 (10JMeybohm) a:03akosiaris I think this is done, right? [13:45:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107 (T314041)', diff saved to https://phabricator.wikimedia.org/P34950 and previous config saved to /var/cache/conftool/dbconfig/20220927-134528-ladsgroup.json [13:45:33] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [13:45:40] <_joe_> jouncebot: next [13:45:41] In 0 hour(s) and 14 minute(s): Maintenance script run (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220927T1400) [13:46:11] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] jobrunner: convert to use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829551 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [13:47:06] _joe_: also there's a backport window atm, and I'm waiting for some backports to merge [13:47:32] <_joe_> taavi: ah sorry I thought you were done [13:47:53] <_joe_> but it's ok, I plan on deploying the change just to one jobrunner for now [13:48:00] <_joe_> worst case scenario I'll depool it [13:49:04] yeah, these are js only backports so in theory shouldn't affect jobrunners at all [13:49:57] I can wait with the maintenance script run, it hopefully won’t take the full two hours [13:52:18] <_joe_> Lucas_WMDE: no need [13:52:22] ok [13:52:27] <_joe_> taavi: let's hope they don't :P [13:53:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T314041)', diff saved to https://phabricator.wikimedia.org/P34951 and previous config saved to /var/cache/conftool/dbconfig/20220927-135310-ladsgroup.json [13:53:15] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [13:53:21] * MichaelG_WMDE keeps looking at zuul and it should be *almost* done [13:53:35] (03CR) 10Volans: [C: 03+1] "minor doc typo inline, LGTM otherwise, let's see what o11y says about the services to restart" [cookbooks] - 10https://gerrit.wikimedia.org/r/835565 (owner: 10Muehlenhoff) [13:54:22] (03Merged) 10jenkins-bot: Track use of Searchbox footer on Wikidata [extensions/Wikibase] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/835590 (https://phabricator.wikimedia.org/T306933) (owner: 10Michael Große) [13:54:36] wmf.2 I can test on www.wikidata.org, but wmf.3 only on test.wikidata.org, right? [13:54:37] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:55:03] I think so, yeah [13:55:13] correct [13:56:01] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:56:18] 👍 [13:56:31] * MichaelG_WMDE is ready when you are [13:57:30] sigh. got another ConnectionError with the gerrit polling [13:57:37] * taavi waits for the patch to merge before re-running it [13:58:06] (03Merged) 10jenkins-bot: Track use of Searchbox footer on Wikidata [extensions/Wikibase] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835591 (https://phabricator.wikimedia.org/T306933) (owner: 10Michael Große) [13:58:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/835590 (https://phabricator.wikimedia.org/T306933) (owner: 10Michael Große) [13:58:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835591 (https://phabricator.wikimedia.org/T306933) (owner: 10Michael Große) [13:58:41] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.250 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:59:14] !log taavi@deploy1002 Started scap: Backport for [[gerrit:835590|Track use of Searchbox footer on Wikidata (T306933)]], [[gerrit:835591|Track use of Searchbox footer on Wikidata (T306933)]] [13:59:18] T306933: Enable configurable scroll and "load more" behavior in TypeaheadSearch - https://phabricator.wikimedia.org/T306933 [13:59:33] (03PS1) 10Filippo Giunchedi: Fix /metrics ACL remoteip header [puppet] - 10https://gerrit.wikimedia.org/r/835623 (https://phabricator.wikimedia.org/T309703) [13:59:45] !log taavi@deploy1002 taavi and migr: Backport for [[gerrit:835590|Track use of Searchbox footer on Wikidata (T306933)]], [[gerrit:835591|Track use of Searchbox footer on Wikidata (T306933)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [13:59:53] MichaelG_WMDE: please test [14:00:05] Jhs and Lucas_WMDE: My dear minions, it's time we take the moon! Just kidding. Time for Maintenance script run deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220927T1400). [14:00:09] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48681 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:00:16] o/, waiting for backports to finish [14:00:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:00:30] taavi can I test both? [14:00:34] yes [14:00:35] * MichaelG_WMDE looks at both [14:00:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107', diff saved to https://phabricator.wikimedia.org/P34952 and previous config saved to /var/cache/conftool/dbconfig/20220927-140034-ladsgroup.json [14:00:39] thanks! [14:00:41] * MichaelG_WMDE tests [14:01:49] can confirm both working and I see no errors! [14:01:55] cool, syncing [14:03:37] I’ll start doing dry-runs of the maintenance script already, to determine the number of rows affected [14:03:41] shouldn’t have any effect [14:04:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:04:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:04:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (Sprint 02): Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10EChetty) [14:05:24] (03CR) 10Filippo Giunchedi: "Thank you for working on this! See inline, LGTM overall" [cookbooks] - 10https://gerrit.wikimedia.org/r/835565 (owner: 10Muehlenhoff) [14:06:13] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:835590|Track use of Searchbox footer on Wikidata (T306933)]], [[gerrit:835591|Track use of Searchbox footer on Wikidata (T306933)]] (duration: 06m 59s) [14:06:17] T306933: Enable configurable scroll and "load more" behavior in TypeaheadSearch - https://phabricator.wikimedia.org/T306933 [14:06:39] MichaelG_WMDE: ok, should be live [14:06:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:07:28] Lucas_WMDE: all done [14:07:34] thanks! [14:07:36] and _joe_ ^ [14:07:54] <_joe_> taavi: thanks, I'll expand to the rest of the cluster [14:08:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:08:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P34953 and previous config saved to /var/cache/conftool/dbconfig/20220927-140817-ladsgroup.json [14:08:20] _joe_: I think I’ll run my maintenance script with PHP=php7.4, does that sound okay to you? [14:08:42] <_joe_> Lucas_WMDE: it should make no difference in terms of ICU [14:08:45] <_joe_> so yes, go on [14:08:46] ah ok [14:08:51] I thought there might be a difference [14:08:52] ok :) [14:08:52] <_joe_> We'll switch pretty soon btw [14:09:15] hype hype hype [14:09:18] <_joe_> I'm switching the jobrunners right now, we might as well switch mwmaint next [14:10:50] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics for devnull - https://phabricator.wikimedia.org/T318104 (10Devnull) I do not currently have a sponsor, how would I get one? [14:11:45] (03PS1) 10Giuseppe Lavagetto: role::mediawiki::maintenance: switch to php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/835629 (https://phabricator.wikimedia.org/T271736) [14:11:52] !log BEGIN lucaswerkmeister-wmde@mwmaint1002:~$ PHP=php7.4 mwscript updateCollation.php incubatorwiki --force # T315552 [14:11:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:56] T315552: Run updateCollation.php on the Wikimedia Incubator - https://phabricator.wikimedia.org/T315552 [14:12:01] (03PS1) 10Papaul: Add new logstash nodes to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/835630 (https://phabricator.wikimedia.org/T313848) [14:13:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:13:24] seems to be running quite a bit faster than the mw.o documentation suggested, yay [14:13:34] (100k rows done now) [14:13:39] taavi: Thank you! (sorry for delayed response, office network problems...) [14:13:43] (out of ~670k) [14:13:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:13:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:13:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:15:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107', diff saved to https://phabricator.wikimedia.org/P34954 and previous config saved to /var/cache/conftool/dbconfig/20220927-141541-ladsgroup.json [14:16:41] 10SRE, 10Traffic, 10Patch-For-Review: Create program to interact with Atlas RIPE API - https://phabricator.wikimedia.org/T315536 (10BCornwall) While we have https://gerrit.wikimedia.org/r/c/operations/software/latency-measurement/+/833848 available for review, I hear that there'd be pushback for not having a... [14:17:13] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10Papaul) 05Open→03Declined There is a decommission task for this node @T318689 to declining this task [14:17:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:18:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:21:04] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence, and 2 others: db2098 crashed - https://phabricator.wikimedia.org/T318062 (10Papaul) @willy it will not be possible to submit a RMA for this host, I have some decommissioned servers onsite i can check and see if we can pull some memory. [14:22:38] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@66dfa44]: (no justification provided) [14:23:25] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@66dfa44]: (no justification provided) (duration: 00m 46s) [14:23:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P34955 and previous config saved to /var/cache/conftool/dbconfig/20220927-142324-ladsgroup.json [14:23:29] (03CR) 10Papaul: [C: 03+2] Add new logstash nodes to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/835630 (https://phabricator.wikimedia.org/T313848) (owner: 10Papaul) [14:24:12] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [14:24:55] 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10EChetty) p:05Medium→03High [14:25:06] !log END lucaswerkmeister-wmde@mwmaint1002:~$ PHP=php7.4 mwscript updateCollation.php incubatorwiki --force # T315552, 710183 rows done [14:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:10] T315552: Run updateCollation.php on the Wikimedia Incubator - https://phabricator.wikimedia.org/T315552 [14:26:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2036.codfw.wmnet with OS buster [14:26:56] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10Patch-For-Review: Q1:rack/setup/install logstash203[67] - https://phabricator.wikimedia.org/T313848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host logstash2036.codfw.wmnet with OS buster [14:27:18] I think that means we’re done with the maintenance script run window :) [14:28:53] (03PS2) 10Arturo Borrero Gonzalez: toolforge: automated-tests: remove references to Debian Stretch [puppet] - 10https://gerrit.wikimedia.org/r/835612 (https://phabricator.wikimedia.org/T275864) [14:30:08] 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Joe) Do you happen to have any further detail on the response headers and body you get whenever you receive a 429 response? it would help us identify which layer is returni... [14:30:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107 (T314041)', diff saved to https://phabricator.wikimedia.org/P34956 and previous config saved to /var/cache/conftool/dbconfig/20220927-143047-ladsgroup.json [14:30:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1118.eqiad.wmnet with reason: Maintenance [14:30:52] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [14:31:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1118.eqiad.wmnet with reason: Maintenance [14:31:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T314041)', diff saved to https://phabricator.wikimedia.org/P34957 and previous config saved to /var/cache/conftool/dbconfig/20220927-143109-ladsgroup.json [14:31:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:35:33] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host logstash2036.codfw.wmnet with OS buster [14:35:38] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install logstash203[67] - https://phabricator.wikimedia.org/T313848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host logstash2036.codfw.wmnet with OS buster executed with errors: - logstash2036 (**F... [14:35:49] 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) >>! In T318065#8265005, @Joe wrote: > Do you happen to have any further detail on the response headers and body you get whenever you receive a 429 response?... [14:38:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T314041)', diff saved to https://phabricator.wikimedia.org/P34958 and previous config saved to /var/cache/conftool/dbconfig/20220927-143831-ladsgroup.json [14:38:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [14:38:36] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [14:38:46] (03CR) 10Filippo Giunchedi: customscripts: export 'mgmt' entries from hiera_export (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [14:38:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [14:40:33] 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) Actually, I have some left from intentionally hitting them while testing the bot yesterday. ` array(37) { ["url"]=> string(131) "https://en.wikipedia.o... [14:41:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:43:47] (03CR) 10Volans: [C: 03+1] "LGTM, minor nits inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond) [14:43:49] (03PS1) 10DLynch: MobileWebUIActions sample rate to 1 on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835635 (https://phabricator.wikimedia.org/T302108) [14:45:02] 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) The bot has two IPs it works from. # 185.15.56.22 # 185.15.56.29 [14:46:55] (03CR) 10Volans: "post-merge nit" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/834038 (https://phabricator.wikimedia.org/T318006) (owner: 10Jbond) [14:47:22] (03CR) 10Filippo Giunchedi: [C: 03+2] Fix /metrics ACL remoteip header [puppet] - 10https://gerrit.wikimedia.org/r/835623 (https://phabricator.wikimedia.org/T309703) (owner: 10Filippo Giunchedi) [14:51:40] (03CR) 10Volans: [C: 03+1] "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/835565 (owner: 10Muehlenhoff) [14:51:58] (03PS1) 10Muehlenhoff: spdx::convert: Fix two bugs in detecting contributors for roles/profiles [puppet] - 10https://gerrit.wikimedia.org/r/835636 (https://phabricator.wikimedia.org/T308013) [14:52:57] (03CR) 10CI reject: [V: 04-1] spdx::convert: Fix two bugs in detecting contributors for roles/profiles [puppet] - 10https://gerrit.wikimedia.org/r/835636 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:54:46] (03CR) 10BCornwall: [C: 03+2] lvs: Convert ::lvs::configuration to a profile [puppet] - 10https://gerrit.wikimedia.org/r/834360 (https://phabricator.wikimedia.org/T264132) (owner: 10BCornwall) [14:56:06] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@25dda27]: (no justification provided) [14:56:17] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@25dda27]: (no justification provided) (duration: 00m 11s) [14:56:26] (03PS1) 10JMeybohm: Use p95 instead of p99 for KubernetesAPILatency alerts [alerts] - 10https://gerrit.wikimedia.org/r/835637 (https://phabricator.wikimedia.org/T311251) [14:56:38] (03CR) 10BCornwall: [C: 03+2] lvs: Convert ::lvs::configuration to a profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834360 (https://phabricator.wikimedia.org/T264132) (owner: 10BCornwall) [14:56:53] (03CR) 10Hashar: [C: 04-1] "+ Moritz for the aptrepo config." [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [14:58:03] (03PS2) 10Muehlenhoff: spdx::convert: Fix two bugs in detecting contributors for roles/profiles [puppet] - 10https://gerrit.wikimedia.org/r/835636 (https://phabricator.wikimedia.org/T308013) [14:58:08] (03PS3) 10Jbond: sre.hardware.upgrade-firmware: use packagin.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 [14:58:11] (03PS11) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [14:58:59] (03CR) 10Hashar: [C: 03+1] "I love how the version can be passed as an argument to the profile and the Docker version being in sync across all distributions. That is" [puppet] - 10https://gerrit.wikimedia.org/r/834399 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [14:59:22] (03CR) 10Hashar: [C: 03+1] P:ci::docker: Upgrade docker to 20.10.18 on all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834400 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [14:59:40] (03CR) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [15:00:25] (03PS2) 10JMeybohm: Use p95 instead of p99 for KubernetesAPILatency alerts [alerts] - 10https://gerrit.wikimedia.org/r/835637 (https://phabricator.wikimedia.org/T311251) [15:00:46] (03PS12) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [15:01:51] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [15:03:04] 10SRE, 10Scap: Wrong umask when deploying from screen - https://phabricator.wikimedia.org/T200690 (10dancy) @Tgr Can you confirm that this is still a problem? [15:03:31] (03CR) 10Filippo Giunchedi: New cookbook to roll-restart/reboot Thanos frontends (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/835565 (owner: 10Muehlenhoff) [15:04:16] (03CR) 10JMeybohm: [C: 03+2] Use p95 instead of p99 for KubernetesAPILatency alerts [alerts] - 10https://gerrit.wikimedia.org/r/835637 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [15:04:57] (03CR) 10BCornwall: [C: 03+2] Prometheus: Remove ATS gauge periods [puppet] - 10https://gerrit.wikimedia.org/r/832327 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [15:06:48] (03Merged) 10jenkins-bot: Use p95 instead of p99 for KubernetesAPILatency alerts [alerts] - 10https://gerrit.wikimedia.org/r/835637 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [15:06:54] (03PS4) 10Jbond: sre.hardware.upgrade-firmware: use packagin.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 [15:06:56] (03PS13) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [15:07:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] role::mediawiki::maintenance: switch to php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/835629 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [15:11:09] 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Vgutierrez) > This ticket is two-fold. The first is a request for SRE to provide logs regarding queries originating from IABot, easily identified from the UA. @Cyberpower67... [15:19:14] (03CR) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [15:20:08] (03CR) 10Brennen Bearnes: [C: 03+1] "Seems reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/835581 (https://phabricator.wikimedia.org/T318682) (owner: 10Jelto) [15:21:48] (03PS5) 10Jbond: sre.hardware.upgrade-firmware: use packagin.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 [15:21:52] (03PS14) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [15:22:34] (03PS6) 10Jbond: sre.hardware.upgrade-firmware: use packagin.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 [15:23:48] (03PS15) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [15:24:42] 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Vgutierrez) As a reference, this change in behavior has been triggered by https://gerrit.wikimedia.org/r/c/operations/puppet/+/677872 [15:25:14] (03CR) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [15:25:41] 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) >>! In T318065#8265171, @Vgutierrez wrote: >> This ticket is two-fold. The first is a request for SRE to provide logs regarding queries originating from IABo... [15:26:20] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [15:27:37] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [15:28:33] (03CR) 10Clément Goubert: [C: 04-1] "Not restarting on file change is on purpose to avoid cold cache. Putting on hold." [puppet] - 10https://gerrit.wikimedia.org/r/835585 (https://phabricator.wikimedia.org/T318697) (owner: 10Clément Goubert) [15:29:53] 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Vgutierrez) > Without specific logs, I can't really assess if these aggressive requests can be optimized. I would recommend generating those logs on the IABot side [15:29:55] (03PS7) 10Jbond: sre.hardware.upgrade-firmware: use packagin.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 [15:30:05] (03PS1) 10FNegri: ceph.bootstrap_and_add: fix _wait_for_osds [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723) [15:30:56] 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) >>! In T318065#8265227, @Vgutierrez wrote: >> Without specific logs, I can't really assess if these aggressive requests can be optimized. > I would recommend... [15:33:15] 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) >>! In T318065#8265199, @Vgutierrez wrote: > As a reference, this change in behavior has been triggered by https://gerrit.wikimedia.org/r/c/operations/puppet... [15:34:15] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp - https://phabricator.wikimedia.org/T317244 (10BBlack) Copying over from T317249#8262220 - This is the replacement mapping of nodes + disks: | cp nodes | Current | Replacement | Disks | text | 21-26, 33, 34 | 37-44 | Single NVME | upload |... [15:34:21] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/831111 (owner: 10Muehlenhoff) [15:36:46] (03PS4) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) [15:36:48] (03PS5) 10Dduvall: P:ci::docker: Install upstream docker packages for all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834399 (https://phabricator.wikimedia.org/T318382) [15:36:49] (03PS5) 10Dduvall: P:ci::docker: Upgrade docker to 20.10.18 on all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834400 (https://phabricator.wikimedia.org/T318382) [15:37:33] (03PS16) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [15:38:46] (03CR) 10Dduvall: "Thanks for the review, Antoine. Your explanation helped me understand the distributions file much more clearly. I believe I've fixed up ev" [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [15:40:04] (03PS1) 10BBlack: cache node disk layout p11n for F4 config [puppet] - 10https://gerrit.wikimedia.org/r/835646 (https://phabricator.wikimedia.org/T317244) [15:41:02] (03CR) 10CI reject: [V: 04-1] cache node disk layout p11n for F4 config [puppet] - 10https://gerrit.wikimedia.org/r/835646 (https://phabricator.wikimedia.org/T317244) (owner: 10BBlack) [15:41:33] (03CR) 10Hnowlan: [C: 04-1] Update the logic to run test coverage (035 comments) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) (owner: 10Vlad.shapik) [15:41:41] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [15:45:21] (03PS1) 10DLynch: Enable DiscussionTools reply button visual enhancements on cswiki+huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835648 (https://phabricator.wikimedia.org/T315626) [15:45:22] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [15:51:28] (03CR) 10Hashar: [C: 03+1] "Nice, I think that is good now but Moritz would know for sure :] When deploying may you update the Docker package for thirdparty/ci on b" [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [15:54:30] 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) @Vgutierrez is there an explanation somewhere why the Cloud VPS IP range was removed from this list? Is it possible to add IABot IPs back on until we can ge... [15:54:56] (03CR) 10Arturo Borrero Gonzalez: ceph.bootstrap_and_add: fix _wait_for_osds (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri) [15:58:39] 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Vgutierrez) >>! In T318065#8265344, @Cyberpower678 wrote: > @Vgutierrez is there an explanation somewhere why the Cloud VPS IP range was removed from this list? Is it poss... [16:00:05] jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220927T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:22] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:01:26] (03CR) 10David Caro: [C: 03+1] "lgtm, just some naming nits" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri) [16:02:00] (03CR) 10David Caro: [C: 03+1] ceph.bootstrap_and_add: fix _wait_for_osds (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri) [16:03:04] 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10BBlack) >>! In T318065#8265200, @Cyberpower678 wrote: > IABot workers run independently of each other. Each worker runs on a single wiki and minds it's own business. So t... [16:05:34] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Horsey - https://phabricator.wikimedia.org/T318729 (10MHorsey-WMF) [16:06:37] (03PS2) 10BBlack: cache node disk layout p11n for F4 config [puppet] - 10https://gerrit.wikimedia.org/r/835646 (https://phabricator.wikimedia.org/T317244) [16:06:40] 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) >>! In T318065#8265366, @BBlack wrote: >>>! In T318065#8265200, @Cyberpower678 wrote: >> IABot workers run independently of each other. Each worker runs on... [16:07:37] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Horsey - https://phabricator.wikimedia.org/T318729 (10MHorsey-WMF) [16:07:59] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:08:25] PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:13:45] 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10ayounsi) >>! In T318065#8265347, @Vgutierrez wrote: > that would be a question for @ayounsi / @cmooney from the netops team and/or @Andrew from WMCS Context is in T265864,... [16:17:49] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:17:52] 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) >>! In T318065#8265422, @ayounsi wrote: >>>! In T318065#8265347, @Vgutierrez wrote: >> that would be a question for @ayounsi / @cmooney from the netops team... [16:18:13] 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10BBlack) >>! In T318065#8265397, @Cyberpower678 wrote: >>>! In T318065#8265366, @BBlack wrote: >>>>! In T318065#8265200, @Cyberpower678 wrote: >>> IABot workers run independ... [16:21:30] (03Abandoned) 10BBlack: Add wikifunctions to MW canonical redirects [puppet] - 10https://gerrit.wikimedia.org/r/822455 (https://phabricator.wikimedia.org/T275904) (owner: 10BBlack) [16:22:21] (03PS2) 10BBlack: Add wikifunctions to Varnish as a 302 [puppet] - 10https://gerrit.wikimedia.org/r/822657 (https://phabricator.wikimedia.org/T275904) [16:23:50] 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) >>! In T318065#8265446, @BBlack wrote: >>>! In T318065#8265397, @Cyberpower678 wrote: >>>>! In T318065#8265366, @BBlack wrote: >>>>>! In T318065#8265200, @Cy... [16:29:14] (03PS17) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [16:30:29] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:45:27] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): hw troubleshooting: network cards shutting down for lasbtore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T317651 (10dcaro) [16:51:06] (03CR) 10Vgutierrez: [C: 03+1] cache node disk layout p11n for F4 config (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/835646 (https://phabricator.wikimedia.org/T317244) (owner: 10BBlack) [16:55:25] (03PS18) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [16:57:48] (03PS19) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [17:08:10] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt-wdqs1003.eqiad.wmnet [17:09:43] RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:15:47] (03PS1) 10Andrew Bogott: Make cloudnet100[56] into cloudnet nodes [puppet] - 10https://gerrit.wikimedia.org/r/835657 (https://phabricator.wikimedia.org/T316284) [17:19:03] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:19:34] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudvirt-wdqs1003.eqiad.wmnet [17:19:44] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for "Stef Dunlap" - https://phabricator.wikimedia.org/T318626 (10Jdforrester-WMF) (Tech lead confirmation, if it's needed.) [17:23:31] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/835659 [17:23:34] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/835660 [17:26:59] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt-wdqs1003.eqiad.wmnet [17:28:15] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest[1001-1002].eqiad.wmnet [17:28:20] (03PS20) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [17:29:22] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest[1001-1002].eqiad.wmnet [17:31:43] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:31:46] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [17:38:22] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudvirt-wdqs1003.eqiad.wmnet [17:38:50] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt-wdqs1002.eqiad.wmnet [17:39:21] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt-wdqs1001.eqiad.wmnet [17:41:39] (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/835659 (owner: 10PipelineBot) [17:41:58] (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/835660 (owner: 10PipelineBot) [17:42:55] (03PS8) 10Jbond: sre.hardware.upgrade-firmware: use packagin.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 [17:45:13] (03PS21) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [17:45:26] (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/835660 (owner: 10PipelineBot) [17:46:56] (03CR) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [17:47:30] !log dduvall@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply [17:47:51] !log dduvall@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply [17:48:15] !log dduvall@deploy1002 helmfile [codfw] START helmfile.d/services/blubberoid: apply [17:48:43] !log dduvall@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: apply [17:48:47] !log dduvall@deploy1002 helmfile [eqiad] START helmfile.d/services/blubberoid: apply [17:49:01] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [17:49:17] !log dduvall@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: apply [17:50:14] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudvirt-wdqs1002.eqiad.wmnet [17:50:45] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudvirt-wdqs1001.eqiad.wmnet [17:52:31] (03PS22) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [17:55:16] (03PS8) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [17:56:11] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [17:57:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:57:38] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/835667 [17:58:38] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond) [18:00:04] brennen and jnuche: Your horoscope predicts another unfortunate MediaWiki train - Utc-7+Utc-0 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220927T1800). [18:00:06] (03CR) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond) [18:01:28] o/ [18:02:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:02:19] !log 1.40.0-wmf.3 (T314192) no current blockers, promoting to group0 [18:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:23] T314192: 1.40.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T314192 [18:03:47] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835670 (https://phabricator.wikimedia.org/T314192) [18:03:49] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835670 (https://phabricator.wikimedia.org/T314192) (owner: 10TrainBranchBot) [18:05:08] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835670 (https://phabricator.wikimedia.org/T314192) (owner: 10TrainBranchBot) [18:08:59] (03PS23) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [18:09:29] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:09:33] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.3 refs T314192 [18:09:37] T314192: 1.40.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T314192 [18:09:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:12:51] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [18:14:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:14:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:14:06] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be10[28-39].eqiad.wmnet - https://phabricator.wikimedia.org/T318691 (10wiki_willy) a:03Jclark-ctr [18:15:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:17:39] (03PS24) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [18:19:55] (03PS9) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [18:21:06] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [18:22:49] (03CR) 10Muehlenhoff: aptrepo: add docker packages to thirdparty/ci for bullseye (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [18:23:26] (03PS10) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [18:23:28] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond) [18:26:47] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond) [18:27:11] (03PS25) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [18:29:11] (03PS5) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) [18:29:13] (03PS6) 10Dduvall: P:ci::docker: Install upstream docker packages for all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834399 (https://phabricator.wikimedia.org/T318382) [18:29:15] (03PS6) 10Dduvall: P:ci::docker: Upgrade docker to 20.10.18 on all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834400 (https://phabricator.wikimedia.org/T318382) [18:29:17] (03PS26) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [18:30:14] (03PS11) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [18:30:48] (03CR) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for bullseye (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [18:34:16] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [18:34:18] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond) [18:35:04] 10SRE-swift-storage, 10Commons, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-File-management, and 3 others: MediaWiki sometimes displays old image revision despite purge and hard refresh - https://phabricator.wikimedia.org/T317481 (10Krinkle) [18:35:59] (03PS12) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [18:36:22] (03PS1) 10Subramanya Sastry: Remove figures from text extracts [extensions/TextExtracts] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835593 (https://phabricator.wikimedia.org/T318727) [18:39:19] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond) [18:42:50] (03PS2) 10Ryan Kemper: admin: ryankemper update shell to zsh [puppet] - 10https://gerrit.wikimedia.org/r/834515 (owner: 10Jbond) [18:43:43] (03CR) 10Gehel: "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833860 (https://phabricator.wikimedia.org/T318270) (owner: 10Ryan Kemper) [19:05:34] (03PS1) 10Subramanya Sastry: Remove figures from text extracts [extensions/TextExtracts] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/835594 (https://phabricator.wikimedia.org/T318727) [19:06:36] (03CR) 10Ryan Kemper: [C: 03+2] Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) (owner: 10Ebernhardson) [19:16:50] (03PS3) 10DDesouza: Deploy Research Incentive survey on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834042 (https://phabricator.wikimedia.org/T318328) [19:16:54] (03PS3) 10DDesouza: Deploy Research Incentive survey on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834044 (https://phabricator.wikimedia.org/T318331) [19:19:55] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for "Stef Dunlap" - https://phabricator.wikimedia.org/T318626 (10Jrbranaa) Manager Approval if needed. [19:34:03] (03PS1) 10Ryan Kemper: Revert "Mount labstore to wcqs/wdqs instance for dumps reload" [puppet] - 10https://gerrit.wikimedia.org/r/835595 [19:34:16] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Revert "Mount labstore to wcqs/wdqs instance for dumps reload" [puppet] - 10https://gerrit.wikimedia.org/r/835595 (owner: 10Ryan Kemper) [19:38:15] (MjolnirUpdateFailureRateExceedesThreshold) firing: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [19:43:15] (MjolnirUpdateFailureRateExceedesThreshold) resolved: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [19:43:15] PROBLEM - SSH on ms-be1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:43:52] (03PS1) 10Stang: romdwikimedia: Enable subpages in NS0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835681 (https://phabricator.wikimedia.org/T318491) [19:45:47] (03PS1) 10JHathaway: dup otrs dummy password to vrts for rename [labs/private] - 10https://gerrit.wikimedia.org/r/835682 [19:46:13] (03PS1) 10Ryan Kemper: Revert "Revert "Mount labstore to wcqs/wdqs instance for dumps reload"" [puppet] - 10https://gerrit.wikimedia.org/r/835596 [19:48:16] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host centrallog1002.eqiad.wmnet with OS bullseye [19:48:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install centrallog1002 - https://phabricator.wikimedia.org/T313858 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host centrallog1002.eqiad.wmnet with OS bullseye [19:48:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [19:49:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [19:49:04] (03PS2) 10Ryan Kemper: Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/835596 (https://phabricator.wikimedia.org/T222349) [19:49:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T314041)', diff saved to https://phabricator.wikimedia.org/P34966 and previous config saved to /var/cache/conftool/dbconfig/20220927-194908-ladsgroup.json [19:49:13] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [19:50:37] (03CR) 10Ryan Kemper: "@David - This commit is the same as https://gerrit.wikimedia.org/r/c/operations/puppet/+/832543 but with the addition of https://gerrit.wi" [puppet] - 10https://gerrit.wikimedia.org/r/835596 (https://phabricator.wikimedia.org/T222349) (owner: 10Ryan Kemper) [19:51:04] (03CR) 10JHathaway: [C: 03+2] dup otrs dummy password to vrts for rename [labs/private] - 10https://gerrit.wikimedia.org/r/835682 (owner: 10JHathaway) [19:51:06] (03CR) 10JHathaway: [V: 03+2 C: 03+2] dup otrs dummy password to vrts for rename [labs/private] - 10https://gerrit.wikimedia.org/r/835682 (owner: 10JHathaway) [19:51:40] (03CR) 10Ryan Kemper: [C: 03+2] "Thanks @Jbond!" [puppet] - 10https://gerrit.wikimedia.org/r/834515 (owner: 10Jbond) [19:59:14] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on centrallog1002.eqiad.wmnet with reason: host reimage [20:00:05] RoanKattouw, Urbanecm, cjming, and TheresNoTime: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220927T2000) [20:00:05] kemayo, ryankemper, subbu, and koi: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:10] o/ [20:00:13] \o, around [20:00:22] 👋🏻 [20:00:25] o/ [20:00:45] hey all! :) [20:00:50] * TheresNoTime can deploy! [20:00:57] \o/ [20:01:08] (gimme a sec) [20:02:13] Kemayo: I'll start with your patches :) [20:02:41] Sounds good [20:02:41] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on centrallog1002.eqiad.wmnet with reason: host reimage [20:02:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835635 (https://phabricator.wikimedia.org/T302108) (owner: 10DLynch) [20:02:57] o/ [20:03:11] hi TheresNoTime, looks like you've it all in your hands :) [20:03:25] urbanecm: yup ^^ [20:03:47] (03Merged) 10jenkins-bot: MobileWebUIActions sample rate to 1 on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835635 (https://phabricator.wikimedia.org/T302108) (owner: 10DLynch) [20:04:16] !log samtar@deploy1002 Started scap: Backport for [[gerrit:835635|MobileWebUIActions sample rate to 1 on testwiki (T302108)]] [20:04:20] T302108: Ensure logging is in place to compare MobileFrontend and DiscussionTools new topic and new comment completion rates - https://phabricator.wikimedia.org/T302108 [20:04:40] !log samtar@deploy1002 samtar and kemayo: Backport for [[gerrit:835635|MobileWebUIActions sample rate to 1 on testwiki (T302108)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [20:04:45] Kemayo: 835635 is live on 1002 ^ [20:05:19] TheresNoTime: Looks good there. [20:05:26] Syncing [20:06:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:07:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:07:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:08:05] (03PS2) 10Samtar: Enable DiscussionTools reply button visual enhancements on cswiki+huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835648 (https://phabricator.wikimedia.org/T315626) (owner: 10DLynch) [20:08:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:10:03] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:835635|MobileWebUIActions sample rate to 1 on testwiki (T302108)]] (duration: 05m 46s) [20:10:07] T302108: Ensure logging is in place to compare MobileFrontend and DiscussionTools new topic and new comment completion rates - https://phabricator.wikimedia.org/T302108 [20:10:11] Kemayo: that's sync'd if you want to check again, moving onto 835648 [20:10:55] Continues to look good off-debug. [20:11:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835648 (https://phabricator.wikimedia.org/T315626) (owner: 10DLynch) [20:13:15] (CI feeling a bit slow this evening...) [20:13:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:14:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:14:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:14:59] (03Merged) 10jenkins-bot: Enable DiscussionTools reply button visual enhancements on cswiki+huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835648 (https://phabricator.wikimedia.org/T315626) (owner: 10DLynch) [20:15:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:15:21] !log samtar@deploy1002 Started scap: Backport for [[gerrit:835648|Enable DiscussionTools reply button visual enhancements on cswiki+huwiki (T315626)]] [20:15:24] T315626: [Config Change] Add Clear Affordances to beta feature at partner wikis (desktop) - https://phabricator.wikimedia.org/T315626 [20:15:39] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host centrallog1002.eqiad.wmnet with OS bullseye [20:15:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install centrallog1002 - https://phabricator.wikimedia.org/T313858 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host centrallog1002.eqiad.wmnet with OS bullseye completed: -... [20:15:45] !log samtar@deploy1002 samtar and kemayo: Backport for [[gerrit:835648|Enable DiscussionTools reply button visual enhancements on cswiki+huwiki (T315626)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:15:59] Kemayo: on mwdebug :) [20:16:06] TheresNoTime: It looks good there. [20:16:21] syncin' [20:16:39] (03PS2) 10Samtar: Disable MobileFrontend default editor a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835206 (https://phabricator.wikimedia.org/T302356) (owner: 10DLynch) [20:16:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install centrallog1002 - https://phabricator.wikimedia.org/T313858 (10Cmjohnson) [20:17:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install centrallog1002 - https://phabricator.wikimedia.org/T313858 (10Cmjohnson) 05Open→03Resolved [20:18:05] (03PS1) 10JHathaway: Fix config template for OTRS or VRTS aliases [puppet] - 10https://gerrit.wikimedia.org/r/835687 (https://phabricator.wikimedia.org/T318749) [20:18:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10Cmjohnson) @Joe which partman recipe do you need for these? [20:19:37] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/835687 (https://phabricator.wikimedia.org/T318749) (owner: 10JHathaway) [20:19:45] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/835687 (https://phabricator.wikimedia.org/T318749) (owner: 10JHathaway) [20:20:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:20:20] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:835648|Enable DiscussionTools reply button visual enhancements on cswiki+huwiki (T315626)]] (duration: 04m 58s) [20:20:34] (03CR) 10CI reject: [V: 04-1] Fix config template for OTRS or VRTS aliases [puppet] - 10https://gerrit.wikimedia.org/r/835687 (https://phabricator.wikimedia.org/T318749) (owner: 10JHathaway) [20:20:46] Kemayo: (same again while I set 835206 going) :D [20:21:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835206 (https://phabricator.wikimedia.org/T302356) (owner: 10DLynch) [20:21:10] TheresNoTime: Yup, good off-debug. [20:21:19] RECOVERY - Check systemd state on cloudbackup2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:21:53] (03Merged) 10jenkins-bot: Disable MobileFrontend default editor a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835206 (https://phabricator.wikimedia.org/T302356) (owner: 10DLynch) [20:22:03] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:22:18] !log samtar@deploy1002 Started scap: Backport for [[gerrit:835206|Disable MobileFrontend default editor a/b test (T302356)]] [20:22:21] T302356: Deploy config change to "turn off" mobile VE A/B test - https://phabricator.wikimedia.org/T302356 [20:22:31] (03PS4) 10Samtar: elastic: rebalance enwiki_content shard counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833860 (https://phabricator.wikimedia.org/T318270) (owner: 10Ryan Kemper) [20:23:16] (03PS2) 10JHathaway: Fix config template for OTRS or VRTS aliases [puppet] - 10https://gerrit.wikimedia.org/r/835687 (https://phabricator.wikimedia.org/T318749) [20:23:55] Interesting... scap just err'd while doing 835206.. `'mwscript eval.php --wiki aawiki' generated unexpected output: Notice: Undefined variable: wmgMFDefaultEditor in /srv/mediawiki-staging/wmf-config/CommonSettings.php on line 2828` [20:24:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835206 (https://phabricator.wikimedia.org/T302356) (owner: 10DLynch) [20:24:35] just going to try it again.. [20:24:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:24:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:24:45] !log samtar@deploy1002 Started scap: Backport for [[gerrit:835206|Disable MobileFrontend default editor a/b test (T302356)]] [20:25:37] Kemayo: ^ FYI.. going to try doing it manually [20:25:47] I can amend the patch -- I can see why it'd happen. [20:25:58] ah, yes please then :) [20:26:05] Ah, but already merged. New patch I guess, one second! [20:26:33] (03CR) 10Samtar: "Scap failure on deploy: `'mwscript eval.php --wiki aawiki' generated unexpected output: Notice: Undefined variable: wmgMFDefaultEditor in " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835206 (https://phabricator.wikimedia.org/T302356) (owner: 10DLynch) [20:27:40] (03PS1) 10DLynch: Add wmgMFDefaultEditor back in for future use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835689 [20:27:50] (03CR) 10CI reject: [V: 04-1] Add wmgMFDefaultEditor back in for future use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835689 (owner: 10DLynch) [20:28:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:28:37] Kemayo: I'm not entirely sure where in the scap process this failed (it's prior to deployment to medebug) so I'd like to do a revert of 835206 to get us back to a known state. You're doing an entirely new patch, correct? [20:28:47] *mwdebug [20:29:08] TheresNoTime: Sure, go for it. I can do the whole thing again in another backport window rather than delaying the others. [20:29:09] TheresNoTime: It failed before syncing out [20:29:41] (03PS2) 10DLynch: Add wmgMFDefaultEditor back in for future use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835689 [20:29:54] Kemayo: Okay, good idea, unless dancy you have a different suggestion I'm going to revert 835206 [20:30:06] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [20:30:09] TheresNoTime: I do have https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/835689 as a followup that should probably fix it. [20:30:35] ack, looking, could just merge that and go from there.. [20:30:43] It's too bad that such a chance passed CI [20:31:05] dancy: second opinion on merging 835689 and proceeding? [20:31:23] Seems reasonable to merge. [20:31:32] ack, will do [20:31:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835689 (owner: 10DLynch) [20:32:04] dancy: Yeah, it's presumably because the spot that gives a warning is one that relies on the config's whole setting-lots-of-globals behavior, so it's relatively hard to test without actually running the file... which I assume we don't do in this repo. [20:32:33] (03Merged) 10jenkins-bot: Add wmgMFDefaultEditor back in for future use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835689 (owner: 10DLynch) [20:32:56] !log samtar@deploy1002 Started scap: Backport for [[gerrit:835689|Add wmgMFDefaultEditor back in for future use]] [20:33:12] (that worked) [20:33:20] !log samtar@deploy1002 samtar and kemayo: Backport for [[gerrit:835689|Add wmgMFDefaultEditor back in for future use]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:33:28] (03CR) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [20:33:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:33:34] Kemayo: on mwdebug :) [20:33:56] (03PS5) 10Samtar: elastic: rebalance enwiki_content shard counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833860 (https://phabricator.wikimedia.org/T318270) (owner: 10Ryan Kemper) [20:34:15] ryankemper: it'll be your patch next fyi [20:34:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:34:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:34:34] TheresNoTime: Looks good there. [20:34:39] TheresNoTime: cool. I don't have checks to run on debug so you can sync it fully when ready to [20:34:46] Kemayo: syncin' [20:34:51] ryankemper: ack :) [20:35:28] (03PS1) 10Bking: k8s: Limit envoy metrics scraped from k8s [puppet] - 10https://gerrit.wikimedia.org/r/835691 (https://phabricator.wikimedia.org/T318705) [20:35:53] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:38:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:38:59] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:835689|Add wmgMFDefaultEditor back in for future use]] (duration: 06m 02s) [20:39:21] Kemayo: all sync'd :) [20:39:48] TheresNoTime: Looks good. Sorry for the need to scramble a bit there! [20:40:00] no worries! :D [20:40:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833860 (https://phabricator.wikimedia.org/T318270) (owner: 10Ryan Kemper) [20:40:52] (03Merged) 10jenkins-bot: elastic: rebalance enwiki_content shard counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833860 (https://phabricator.wikimedia.org/T318270) (owner: 10Ryan Kemper) [20:41:16] koi: I'm going to do your patch next just fyi :) [20:41:17] !log samtar@deploy1002 Started scap: Backport for [[gerrit:833860|elastic: rebalance enwiki_content shard counts (T318270)]] [20:41:19] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/835691 (https://phabricator.wikimedia.org/T318705) (owner: 10Bking) [20:41:21] T318270: Avoid overloading individual Elastic nodes with popular shards - https://phabricator.wikimedia.org/T318270 [20:41:41] !log samtar@deploy1002 samtar and ryankemper: Backport for [[gerrit:833860|elastic: rebalance enwiki_content shard counts (T318270)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [20:41:54] TheresNoTime, I assume after that will be my patches? [20:42:10] (syncin' 833860) [20:42:22] thanks! [20:42:44] subbu: I was going to leave yours until last as I believe they can take a little while to merge in comparison to the config patches :) [20:43:13] sounds good. [20:43:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:43:45] RECOVERY - SSH on ms-be1040.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:44:11] depending on if koi is around when this one finishes syncing of course :) [20:44:47] TheresNoTime: I'm around/ [20:44:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T314041)', diff saved to https://phabricator.wikimedia.org/P34967 and previous config saved to /var/cache/conftool/dbconfig/20220927-204446-ladsgroup.json [20:44:51] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [20:44:55] ^^ [20:45:15] (03PS2) 10Samtar: romdwikimedia: Enable subpages in NS0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835681 (https://phabricator.wikimedia.org/T318491) (owner: 10Stang) [20:45:25] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host mc-wf1001.mgmt.eqiad.wmnet with reboot policy FORCED [20:45:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host mc-wf1002.mgmt.eqiad.wmnet with reboot policy FORCED [20:46:31] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:833860|elastic: rebalance enwiki_content shard counts (T318270)]] (duration: 05m 14s) [20:46:34] T318270: Avoid overloading individual Elastic nodes with popular shards - https://phabricator.wikimedia.org/T318270 [20:46:37] ryankemper: all sync'd :) [20:47:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835681 (https://phabricator.wikimedia.org/T318491) (owner: 10Stang) [20:48:03] (03Merged) 10jenkins-bot: romdwikimedia: Enable subpages in NS0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835681 (https://phabricator.wikimedia.org/T318491) (owner: 10Stang) [20:48:27] !log samtar@deploy1002 Started scap: Backport for [[gerrit:835681|romdwikimedia: Enable subpages in NS0 (T318491)]] [20:48:31] T318491: Enable subpages in NS_MAIN on romd.wikimedia.org - https://phabricator.wikimedia.org/T318491 [20:48:51] !log samtar@deploy1002 samtar and stang: Backport for [[gerrit:835681|romdwikimedia: Enable subpages in NS0 (T318491)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [20:48:56] koi: live on mwdebug [20:49:49] TheresNoTime: subpages in ns0 are correctly shown, so LGTM [20:49:55] syncing [20:50:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:50:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:51:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:52:36] subbu: thank you for waiting, and apologies for keeping you around until the last minute.. I'm going to start with 835593 once this finishes syncing [20:52:48] ok. [20:53:09] actually lets start with 835594 ... wmf.2 [20:53:15] sure :) [20:53:20] that lets me verify that the patch actually fixes the bug. [20:53:41] wmf.3 isn't on the right wikis yet where this bug kicks in. [20:53:56] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:835681|romdwikimedia: Enable subpages in NS0 (T318491)]] (duration: 05m 29s) [20:53:58] ack :) and koi, all sync'd [20:54:00] T318491: Enable subpages in NS_MAIN on romd.wikimedia.org - https://phabricator.wikimedia.org/T318491 [20:54:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:54:12] thanks! [20:54:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/TextExtracts] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/835594 (https://phabricator.wikimedia.org/T318727) (owner: 10Subramanya Sastry) [20:56:05] subbu: I'm happy to keep the deployment window open until your patches are deployed, if you're happy to stick around? [20:56:12] yes. [20:56:22] thanks! :) [20:56:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:57:01] it's the least I can do :) 835594 is now merging, ~12 minutes [20:57:34] (03Merged) 10jenkins-bot: Remove figures from text extracts [extensions/TextExtracts] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/835594 (https://phabricator.wikimedia.org/T318727) (owner: 10Subramanya Sastry) [20:57:38] In case you want to speed things up in the future, you can +2 ahead of time while your other patches are syncing and still use scap backport to finish the job [20:57:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:57:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:58:01] !log samtar@deploy1002 Started scap: Backport for [[gerrit:835594|Remove figures from text extracts (T318727)]] [20:58:05] T318727: Recent update caused image title to appear in text extracts - https://phabricator.wikimedia.org/T318727 [20:58:12] (that was a quick 12 minutes...) [20:58:24] jeena: oh good idea, thank you! [20:58:25] !log samtar@deploy1002 samtar and ssastry: Backport for [[gerrit:835594|Remove figures from text extracts (T318727)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:58:35] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-wf1002.mgmt.eqiad.wmnet with reboot policy FORCED [20:58:39] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-wf1001.mgmt.eqiad.wmnet with reboot policy FORCED [20:58:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:58:47] subbu: this is live on mwdebug1002, could you test? :) [20:58:49] np :) [20:58:55] on it. [20:59:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:59:48] !log extending UTC late backport window [20:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P34968 and previous config saved to /var/cache/conftool/dbconfig/20220927-205953-ladsgroup.json [21:00:48] verified fixed. [21:00:52] okay to sync. [21:00:56] great, syncing [21:01:41] the other one to wmf.3 can be merged and synced as well .. it will just ride the train this week to those affected wikis. [21:02:10] (03CR) 10Samtar: [C: 03+2] "deploy" [extensions/TextExtracts] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835593 (https://phabricator.wikimedia.org/T318727) (owner: 10Subramanya Sastry) [21:02:24] (ack) [21:03:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:04:12] (03Merged) 10jenkins-bot: Remove figures from text extracts [extensions/TextExtracts] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835593 (https://phabricator.wikimedia.org/T318727) (owner: 10Subramanya Sastry) [21:05:00] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:835594|Remove figures from text extracts (T318727)]] (duration: 06m 58s) [21:05:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/TextExtracts] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835593 (https://phabricator.wikimedia.org/T318727) (owner: 10Subramanya Sastry) [21:05:42] !log samtar@deploy1002 Started scap: Backport for [[gerrit:835593|Remove figures from text extracts (T318727)]] [21:06:06] !log samtar@deploy1002 samtar and ssastry: Backport for [[gerrit:835593|Remove figures from text extracts (T318727)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:06:18] subbu: did you want to test 835593 as well, or is there nothing you're able to test on wmf.3 wikis? [21:06:35] T318727: Recent update caused image title to appear in text extracts - https://phabricator.wikimedia.org/T318727 [21:06:37] no, nothing to test with that one. okay to sync. [21:06:42] ack [21:08:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:08:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:09:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:10:35] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:835593|Remove figures from text extracts (T318727)]] (duration: 04m 53s) [21:10:52] subbu: all deployed :) thanks again for your patience! [21:10:59] \o/ ty [21:12:10] !log closing UTC late backport window [21:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:14:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:14:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:15:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P34969 and previous config saved to /var/cache/conftool/dbconfig/20220927-211500-ladsgroup.json [21:15:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:19:29] (03PS1) 10Cmjohnson: adding mc-wf to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835701 (https://phabricator.wikimedia.org/T313963) [21:21:41] (03CR) 10Cmjohnson: [C: 03+2] adding mc-wf to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835701 (https://phabricator.wikimedia.org/T313963) (owner: 10Cmjohnson) [21:23:53] PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:30:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T314041)', diff saved to https://phabricator.wikimedia.org/P34970 and previous config saved to /var/cache/conftool/dbconfig/20220927-213006-ladsgroup.json [21:30:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance [21:30:11] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [21:30:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance [21:30:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T314041)', diff saved to https://phabricator.wikimedia.org/P34971 and previous config saved to /var/cache/conftool/dbconfig/20220927-213028-ladsgroup.json [21:44:14] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mc-wf1001.eqiad.wmnet with OS bullseye [21:44:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mc-wf1001.eqiad.wmnet with OS bullseye [21:47:33] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mc-wf1002.eqiad.wmnet with OS bullseye [21:47:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mc-wf1002.eqiad.wmnet with OS bullseye [21:55:14] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-wf1001.eqiad.wmnet with reason: host reimage [21:58:31] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-wf1002.eqiad.wmnet with reason: host reimage [21:58:33] (03PS1) 10Ebernhardson: dumpcirrussearch.sh: Replace gzip with lbzip2 [puppet] - 10https://gerrit.wikimedia.org/r/835705 [21:58:39] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-wf1001.eqiad.wmnet with reason: host reimage [22:02:09] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-wf1002.eqiad.wmnet with reason: host reimage [22:03:12] (03CR) 10Ebernhardson: "I'm not sure if it would be appropriate to maintain both .gz and .bz2 files here (like wikidata dumps do). Not opposed, but not sure if i" [puppet] - 10https://gerrit.wikimedia.org/r/835705 (owner: 10Ebernhardson) [22:13:05] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-wf1001.eqiad.wmnet with OS bullseye [22:13:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mc-wf1001.eqiad.wmnet with OS bullseye completed: - mc-wf1001 (**PASS**... [22:16:56] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-wf1002.eqiad.wmnet with OS bullseye [22:17:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mc-wf1002.eqiad.wmnet with OS bullseye completed: - mc-wf1002 (**PASS**... [22:24:41] RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:18:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10Cmjohnson) [23:19:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10Cmjohnson) 05Open→03Resolved @joe all yours, figured it to be the same partman recipe as memcache