[00:03:05] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[00:04:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T314041)', diff saved to https://phabricator.wikimedia.org/P34922 and previous config saved to /var/cache/conftool/dbconfig/20220927-000434-ladsgroup.json
[00:04:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[00:04:40] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[00:04:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[00:04:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[00:04:55] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices1005.wikimedia.org
[00:05:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[00:05:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T314041)', diff saved to https://phabricator.wikimedia.org/P34923 and previous config saved to /var/cache/conftool/dbconfig/20220927-000525-ladsgroup.json
[00:07:24] <wikibugs>	 10SRE, 10InternetArchiveBot: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) IABot is now handling 429 but I still would like access to the request logs for IABot.
[00:08:30] <wikibugs>	 10SRE, 10InternetArchiveBot: Request for increase request limit for InternetArchiveBot - https://phabricator.wikimedia.org/T318284 (10Cyberpower678) p:05Triage→03Medium
[00:10:37] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[00:13:54] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudnet1005.eqiad.wmnet
[00:13:55] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudnet1005.eqiad.wmnet
[00:15:07] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudnet1005.eqiad.wmnet
[00:15:08] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudnet1005.eqiad.wmnet
[00:16:10] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1005.wikimedia.org
[00:23:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[00:24:47] <wikibugs>	 (03PS1) 10Stang: votewiki: Change wgLanguageCode to zh for Sep 2022 admins election [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835291 (https://phabricator.wikimedia.org/T318147)
[00:28:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[00:31:15] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1006.wikimedia.org
[00:32:00] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudcontrol1005.wikimedia.org
[00:35:57] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[00:40:22] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1007.wikimedia.org
[00:42:46] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudcontrol1006.wikimedia.org
[00:50:12] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1007.wikimedia.org
[00:53:49] <icinga-wm>	 RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:56:35] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti203[12] - https://phabricator.wikimedia.org/T313856 (10Papaul)
[00:57:38] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti203[12] - https://phabricator.wikimedia.org/T313856 (10Papaul) 05Open→03Resolved @MoritzMuehlenhoff All your's
[01:03:17] <icinga-wm>	 PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:03:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10Papaul) a:03Papaul
[01:08:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10Papaul)
[01:15:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T314041)', diff saved to https://phabricator.wikimedia.org/P34924 and previous config saved to /var/cache/conftool/dbconfig/20220927-011543-ladsgroup.json
[01:15:48] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[01:17:29] <icinga-wm>	 PROBLEM - Check systemd state on dbprov2001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:20:03] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[01:30:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P34925 and previous config saved to /var/cache/conftool/dbconfig/20220927-013050-ladsgroup.json
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:45:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P34926 and previous config saved to /var/cache/conftool/dbconfig/20220927-014556-ladsgroup.json
[01:47:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:54:05] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:00:04] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220927T0200)
[02:01:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T314041)', diff saved to https://phabricator.wikimedia.org/P34927 and previous config saved to /var/cache/conftool/dbconfig/20220927-020103-ladsgroup.json
[02:01:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance
[02:01:07] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[02:01:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance
[02:01:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T314041)', diff saved to https://phabricator.wikimedia.org/P34928 and previous config saved to /var/cache/conftool/dbconfig/20220927-020124-ladsgroup.json
[02:04:33] <icinga-wm>	 RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:04:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:05:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:05:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:06:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:07:31] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.3 [core] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835301 (https://phabricator.wikimedia.org/T314192)
[02:07:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.3 [core] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835301 (https://phabricator.wikimedia.org/T314192) (owner: 10TrainBranchBot)
[02:07:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:11:25] <icinga-wm>	 RECOVERY - Check systemd state on dbprov2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:24:23] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.3 [core] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835301 (https://phabricator.wikimedia.org/T314192) (owner: 10TrainBranchBot)
[02:32:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:34:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:34:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:35:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[03:00:05] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220927T0300)
[03:01:13] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835302 (https://phabricator.wikimedia.org/T314192)
[03:01:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835302 (https://phabricator.wikimedia.org/T314192) (owner: 10TrainBranchBot)
[03:01:33] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:01:57] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835302 (https://phabricator.wikimedia.org/T314192) (owner: 10TrainBranchBot)
[03:02:25] <logmsgbot>	 !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.3  refs T314192
[03:02:29] <stashbot>	 T314192: 1.40.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T314192
[03:06:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[03:07:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[03:07:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[03:07:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[03:22:51] <icinga-wm>	 PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:24:47] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:38:26] <logmsgbot>	 !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.3  refs T314192 (duration: 36m 01s)
[03:38:30] <stashbot>	 T314192: 1.40.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T314192
[03:40:31] <logmsgbot>	 !log mwpresync@deploy1002 Pruned MediaWiki: 1.40.0-wmf.1 (duration: 02m 03s)
[03:43:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[03:46:59] <icinga-wm>	 PROBLEM - Check systemd state on dbprov1002 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:51:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[03:51:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[03:57:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[04:10:07] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:10:37] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[04:21:03] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[04:24:05] <icinga-wm>	 RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:41:11] <icinga-wm>	 RECOVERY - Check systemd state on dbprov1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:45:39] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:45:47] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:52:45] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:52:53] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:02:06] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q1:rack/setup/install db218[34] - https://phabricator.wikimedia.org/T313979 (10Marostegui) Thanks @papaul - I think these will be handled by @jcrespo :-)
[05:02:23] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::canary_api: remove php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/835506
[05:03:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:04:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37355/console" [puppet] - 10https://gerrit.wikimedia.org/r/835506 (owner: 10Giuseppe Lavagetto)
[05:06:55] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:07:30] <wikibugs>	 (03PS1) 10Marostegui: control-mariadb-10.6-bullseye: Upgrade 10.6.10 [software] - 10https://gerrit.wikimedia.org/r/835508 (https://phabricator.wikimedia.org/T318128)
[05:08:29] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:08:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) Thanks John. I am leaving the host ON, but mysql stopped, so you can proceed and power it off anytime you want to swap the new DIMM.
[05:12:59] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.254 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:18:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:23:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:28:33] <marostegui>	 !log Install 10.6.10 on db1124, db1125, pc1014, pc2014 T318128
[05:28:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:28:37] <stashbot>	 T318128: Compile and install  MariaDB 10.6.10 - https://phabricator.wikimedia.org/T318128
[05:28:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[05:32:23] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui)
[05:33:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[05:38:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:45:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:52:19] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:58:19] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence, and 2 others: db2098 crashed - https://phabricator.wikimedia.org/T318062 (10Marostegui) a:05jcrespo→03Papaul Assigning to @Papaul per T318062#8247109
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: Your horoscope predicts another unfortunate Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220927T0600).
[06:00:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:12:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:17:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:26:21] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:37:35] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:46:15] <wikibugs>	 (03PS16) 10Ayounsi: sre.network.peering: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/816730
[06:49:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T314041)', diff saved to https://phabricator.wikimedia.org/P34930 and previous config saved to /var/cache/conftool/dbconfig/20220927-064925-ladsgroup.json
[06:49:29] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[06:50:59] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] sre.network.peering: initial commit (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 (owner: 10Ayounsi)
[06:52:04] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti203[12] - https://phabricator.wikimedia.org/T313856 (10MoritzMuehlenhoff) Thanks!
[06:54:34] <wikibugs>	 (03Merged) 10jenkins-bot: sre.network.peering: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 (owner: 10Ayounsi)
[06:57:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:58:48] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'show' for AS: 8220
[06:59:35] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'show' for AS: 8220
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220927T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:04:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P34932 and previous config saved to /var/cache/conftool/dbconfig/20220927-070431-ladsgroup.json
[07:06:41] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:11:25] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:19:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P34933 and previous config saved to /var/cache/conftool/dbconfig/20220927-071938-ladsgroup.json
[07:22:47] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:22:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:25:51] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Add golang 1.18 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/833792 (owner: 10Majavah)
[07:27:38] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] services_proxy: add a keepalive timeout for image-suggestion (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/835205 (https://phabricator.wikimedia.org/T313973) (owner: 10Giuseppe Lavagetto)
[07:30:08] <taavi>	 jayme: may I ask you to merge/build that go 1.18 image? that requires ops access which I don't have
[07:30:37] <jayme>	 taavi: oh, sorry. Sure! give me a minute
[07:31:17] <taavi>	 thanks!
[07:31:37] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add golang 1.18 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/833792 (owner: 10Majavah)
[07:34:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T314041)', diff saved to https://phabricator.wikimedia.org/P34934 and previous config saved to /var/cache/conftool/dbconfig/20220927-073441-ladsgroup.json
[07:34:46] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[07:34:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T314041)', diff saved to https://phabricator.wikimedia.org/P34935 and previous config saved to /var/cache/conftool/dbconfig/20220927-073451-ladsgroup.json
[07:34:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1107.eqiad.wmnet with reason: Maintenance
[07:35:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1107.eqiad.wmnet with reason: Maintenance
[07:35:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1107 (T314041)', diff saved to https://phabricator.wikimedia.org/P34936 and previous config saved to /var/cache/conftool/dbconfig/20220927-073523-ladsgroup.json
[07:36:33] <jayme>	 !log published image docker-registry.discovery.wmnet/golang1.18:1.18-1
[07:36:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:37] <jayme>	 taavi: ^
[07:39:32] <moritzm>	 !log uploaded expat 2.2.0-2+deb9u5+wmf1 to apt.wikimedia.org/stretch-wikimedia
[07:39:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:15] <wikibugs>	 (03CR) 10Ayounsi: [C: 04-1] "Is there a way to know what the final config file is going to look like?" [puppet] - 10https://gerrit.wikimedia.org/r/832632 (https://phabricator.wikimedia.org/T317746) (owner: 10Muehlenhoff)
[07:48:09] <moritzm>	 !log installing expat security updates on stretch/buster/bullseye
[07:48:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:35] <XioNoX>	 !log upgrade python3-pynetbox to 6.6.0 on cumin2002 - T310745
[07:49:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:39] <stashbot>	 T310745: Upgrade pynetbox - https://phabricator.wikimedia.org/T310745
[07:49:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P34937 and previous config saved to /var/cache/conftool/dbconfig/20220927-074948-ladsgroup.json
[07:52:55] <XioNoX>	 !log upgrade python3-pynetbox to 6.6.0 on cumin1001 - T310745
[07:52:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.misc-clusters.thumbor rolling restart_daemons on A:thumbor-codfw
[07:56:50] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:57:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.thumbor (exit_code=0) rolling restart_daemons on A:thumbor-codfw
[07:58:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.misc-clusters.thumbor rolling restart_daemons on A:thumbor-eqiad
[08:00:20] <wikibugs>	 (03CR) 10Hashar: Ship WMF-specific systemd unit parts as systemd override (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832632 (https://phabricator.wikimedia.org/T317746) (owner: 10Muehlenhoff)
[08:00:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.thumbor (exit_code=0) rolling restart_daemons on A:thumbor-eqiad
[08:04:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P34938 and previous config saved to /var/cache/conftool/dbconfig/20220927-080454-ladsgroup.json
[08:05:36] <wikibugs>	 (03PS1) 10Muehlenhoff: Make ganeti2031 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/835553 (https://phabricator.wikimedia.org/T313857)
[08:08:11] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:10:37] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[08:13:24] <wikibugs>	 (03Abandoned) 10Hashar: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/833844 (owner: 10PipelineBot)
[08:15:12] <moritzm>	 !log restarting apache/FPM on mw canaries to pick up Expat security updates
[08:15:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:20:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T314041)', diff saved to https://phabricator.wikimedia.org/P34941 and previous config saved to /var/cache/conftool/dbconfig/20220927-082001-ladsgroup.json
[08:20:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance
[08:20:06] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[08:20:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance
[08:20:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T314041)', diff saved to https://phabricator.wikimedia.org/P34942 and previous config saved to /var/cache/conftool/dbconfig/20220927-082023-ladsgroup.json
[08:20:46] <wikibugs>	 (03CR) 10Clément Goubert: C:rsync::server:  convert to concat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) (owner: 10Jbond)
[08:23:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:25:28] <wikibugs>	 (03PS4) 10Jbond: lvs: Convert ::lvs::configuration to a profile [puppet] - 10https://gerrit.wikimedia.org/r/834360 (https://phabricator.wikimedia.org/T264132) (owner: 10BCornwall)
[08:26:12] <wikibugs>	 (03PS5) 10Jbond: lvs: Convert ::lvs::configuration to a profile [puppet] - 10https://gerrit.wikimedia.org/r/834360 (https://phabricator.wikimedia.org/T264132) (owner: 10BCornwall)
[08:27:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] hieradata: remove ms-be10[28-39] from swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/835106 (https://phabricator.wikimedia.org/T294550) (owner: 10MVernon)
[08:27:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] lvs: Convert ::lvs::configuration to a profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/834360 (https://phabricator.wikimedia.org/T264132) (owner: 10BCornwall)
[08:28:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/832327 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[08:29:18] <wikibugs>	 10SRE, 10Traffic, 10Upstream: ATS wrongly parses requests without a leading / - https://phabricator.wikimedia.org/T317660 (10Vgutierrez)
[08:29:51] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+1] "PCC SUCCESS (NOOP 4 DIFF 20): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37356/console" [puppet] - 10https://gerrit.wikimedia.org/r/834360 (https://phabricator.wikimedia.org/T264132) (owner: 10BCornwall)
[08:30:04] <wikibugs>	 10SRE, 10Traffic, 10Upstream: ATS wrongly parses requests without a leading / - https://phabricator.wikimedia.org/T317660 (10Vgutierrez) Making the task public after cleaning IP addresses from the original request that helped detecting the issue and after checking with upstream that this isn't a security bug
[08:30:51] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Release 9.1.3-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/834045 (https://phabricator.wikimedia.org/T317660) (owner: 10Vgutierrez)
[08:33:53] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] hieradata: remove ms-be10[28-39] from swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/835106 (https://phabricator.wikimedia.org/T294550) (owner: 10MVernon)
[08:34:46] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/835195 (owner: 10Muehlenhoff)
[08:36:43] <wikibugs>	 (03PS9) 10Jbond: C:rsync::server:  convert to concat [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618)
[08:37:13] <wikibugs>	 (03CR) 10Jbond: C:rsync::server:  convert to concat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) (owner: 10Jbond)
[08:37:31] <wikibugs>	 (03PS9) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595)
[08:47:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/835195 (owner: 10Muehlenhoff)
[08:52:02] <wikibugs>	 (03CR) 10Vgutierrez: Unlink certificate renewal and OCSP handling (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/820795 (https://phabricator.wikimedia.org/T244232) (owner: 10BCornwall)
[08:57:16] <wikibugs>	 (03PS1) 10Filippo Giunchedi: grafana: block external access to /metrics [puppet] - 10https://gerrit.wikimedia.org/r/835559 (https://phabricator.wikimedia.org/T309703)
[08:57:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:59:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] grafana: block external access to /metrics [puppet] - 10https://gerrit.wikimedia.org/r/835559 (https://phabricator.wikimedia.org/T309703) (owner: 10Filippo Giunchedi)
[09:00:07] <wikibugs>	 (03PS2) 10Filippo Giunchedi: grafana: block external access to /metrics [puppet] - 10https://gerrit.wikimedia.org/r/835559 (https://phabricator.wikimedia.org/T309703)
[09:01:42] <wikibugs>	 (03PS3) 10Filippo Giunchedi: grafana: block external access to /metrics [puppet] - 10https://gerrit.wikimedia.org/r/835559 (https://phabricator.wikimedia.org/T309703)
[09:02:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:03:35] <vgutierrez>	 hmmm is wikibugs down?
[09:03:40] <vgutierrez>	 or just lagged?
[09:05:18] <godog>	 not sure about phab but for gerrit it was pretty okay with my last update to https://gerrit.wikimedia.org/r/835559
[09:05:29] <godog>	 in terms of lag that is
[09:05:32] <wikibugs>	 (03CR) 10Vgutierrez: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/834525 (https://phabricator.wikimedia.org/T317051) (owner: 10Vgutierrez)
[09:05:52] <vgutierrez>	 yeah.. that one was immediate
[09:06:29] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37357/console" [puppet] - 10https://gerrit.wikimedia.org/r/829551 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto)
[09:07:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Today we got yet another report of /metrics being publicly exposed. This patch will forbid access from the outside for Grafana." [puppet] - 10https://gerrit.wikimedia.org/r/835559 (https://phabricator.wikimedia.org/T309703) (owner: 10Filippo Giunchedi)
[09:12:28] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.provision for host logstash2037.mgmt.codfw.wmnet with reboot policy FORCED
[09:13:40] <logmsgbot>	 !log volans@cumin2002 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host logstash2037.mgmt.codfw.wmnet with reboot policy FORCED
[09:14:36] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.provision for host logstash2037.mgmt.codfw.wmnet with reboot policy FORCED
[09:15:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:20:57] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] C:rsync::server:  convert to concat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) (owner: 10Jbond)
[09:24:07] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1146 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:30:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:51:18] <wikibugs>	 10SRE, 10Traffic: CDN doesn't validate request-target - https://phabricator.wikimedia.org/T318676 (10Vgutierrez) p:05Triage→03Medium
[09:55:27] <wikibugs>	 (03PS1) 10Muehlenhoff: New cookbook to roll-restart/reboot Thanos frontends [cookbooks] - 10https://gerrit.wikimedia.org/r/835565
[09:56:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: drop firmware-file flag [cookbooks] - 10https://gerrit.wikimedia.org/r/835108 (owner: 10Jbond)
[09:56:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: Present user with a list of current files [cookbooks] - 10https://gerrit.wikimedia.org/r/835157 (owner: 10Jbond)
[09:57:11] <wikibugs>	 (03PS4) 10Jbond: sre.hardware.upgrade-firmware: drop firmware-file flag [cookbooks] - 10https://gerrit.wikimedia.org/r/835108
[09:57:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Present user with a list of current files [cookbooks] - 10https://gerrit.wikimedia.org/r/835157 (owner: 10Jbond)
[09:57:19] <wikibugs>	 (03PS6) 10Jbond: sre.hardware.upgrade-firmware: Present user with a list of current files [cookbooks] - 10https://gerrit.wikimedia.org/r/835157
[09:57:59] <wikibugs>	 10SRE, 10Traffic: CDN doesn't validate request-target - https://phabricator.wikimedia.org/T318676 (10Vgutierrez) Apparently varnish supports the absolute-URI form for non CONNECT requests. This has been introduced a long time ago in https://gerrit.wikimedia.org/r/c/operations/puppet/+/275474. @BBlack do you ha...
[09:58:08] <wikibugs>	 (03CR) 10Muehlenhoff: New cookbook to roll-restart/reboot Thanos frontends (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/835565 (owner: 10Muehlenhoff)
[10:02:43] <wikibugs>	 (03PS9) 10Hashar: gerrit: decouple scap and daemon users [puppet] - 10https://gerrit.wikimedia.org/r/832345 (https://phabricator.wikimedia.org/T317412)
[10:03:26] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "I have amended the commit message to fix a few typos and clarify the intent of this change, also attached it to T317412 "Automate Gerrit d" [puppet] - 10https://gerrit.wikimedia.org/r/832345 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar)
[10:03:41] <wikibugs>	 (03PS5) 10Hashar: gerrit: change deployment user on devtools [puppet] - 10https://gerrit.wikimedia.org/r/832507
[10:03:50] <moritzm>	 !log rebalance ganeti/codfw row D after completed Bullseye update T311686
[10:03:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:54] <wikibugs>	 (03PS3) 10Hashar: gerrit: make homedir variable [puppet] - 10https://gerrit.wikimedia.org/r/833379
[10:03:54] <stashbot>	 T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686
[10:04:57] <wikibugs>	 (03PS2) 10Hashar: gerrit: make daemon_user variable everywhere [puppet] - 10https://gerrit.wikimedia.org/r/833385
[10:05:16] <wikibugs>	 (03PS1) 10Jbond: sre.hardware.upgrade-firmware: use catagory for storage [cookbooks] - 10https://gerrit.wikimedia.org/r/835567
[10:06:29] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.decommission for hosts ms-be[1028-1033,1035-1039].eqiad.wmnet
[10:07:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "paths on cumin already updated" [cookbooks] - 10https://gerrit.wikimedia.org/r/835567 (owner: 10Jbond)
[10:09:51] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Cherry picked on devtools and work as intended :)" [puppet] - 10https://gerrit.wikimedia.org/r/833379 (owner: 10Hashar)
[10:10:05] <wikibugs>	 (03PS1) 10MVernon: cumin: move swift-be-canary [puppet] - 10https://gerrit.wikimedia.org/r/835568 (https://phabricator.wikimedia.org/T294550)
[10:10:07] <logmsgbot>	 !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts ms-be[1028-1033,1035-1039].eqiad.wmnet
[10:10:57] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Noop on gerrit-prod-1001.devtools.eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/833385 (owner: 10Hashar)
[10:11:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: use catagory for storage [cookbooks] - 10https://gerrit.wikimedia.org/r/835567 (owner: 10Jbond)
[10:11:37] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet
[10:11:55] <logmsgbot>	 !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1002.eqiad.wmnet
[10:13:08] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: fix bootstrap with new hiera location [puppet] - 10https://gerrit.wikimedia.org/r/835569
[10:13:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] cumin: move swift-be-canary [puppet] - 10https://gerrit.wikimedia.org/r/835568 (https://phabricator.wikimedia.org/T294550) (owner: 10MVernon)
[10:13:44] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] cumin: move swift-be-canary [puppet] - 10https://gerrit.wikimedia.org/r/835568 (https://phabricator.wikimedia.org/T294550) (owner: 10MVernon)
[10:14:15] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.decommission for hosts ms-be[2028-2039].codfw.wmnet
[10:16:20] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.decommission for hosts ms-be[1028-1033,1035-1039].eqiad.wmnet
[10:17:53] <TheresNoTime>	 wotcha, timeouts trying to SSH to bastion.wmcloud.org (or, "TTL expired in transit" apparently?)
[10:18:28] <wikibugs>	 (03PS1) 10Jbond: sre.hardware.firmeware-upgrade: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/835570
[10:18:40] <TheresNoTime>	 (disregard, it wasn't working for ~10 minutes, starts working as I posted that ^)
[10:18:45] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Remove ECDHE-ECDSA-AES128-SHA sinkhole [puppet] - 10https://gerrit.wikimedia.org/r/835571 (https://phabricator.wikimedia.org/T258405)
[10:22:12] <wikibugs>	 (03PS2) 10Jbond: sre.hardware.firmeware-upgrade: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/835570
[10:23:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hardware.firmeware-upgrade: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/835570 (owner: 10Jbond)
[10:24:28] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/835569 (owner: 10Filippo Giunchedi)
[10:26:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.firmeware-upgrade: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/835570 (owner: 10Jbond)
[10:27:41] <wikibugs>	 (03PS10) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168
[10:27:49] <wikibugs>	 (03PS7) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212
[10:30:44] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Fix VCL tests broken by querysort [puppet] - 10https://gerrit.wikimedia.org/r/835572 (https://phabricator.wikimedia.org/T314868)
[10:31:31] <wikibugs>	 (03PS3) 10Jbond: sre.hardware.firmware-upgrade: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/835570
[10:35:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hardware.firmware-upgrade: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/835570 (owner: 10Jbond)
[10:36:39] <wikibugs>	 (03PS2) 10Vgutierrez: varnish: Fix VCL tests broken by querysort [puppet] - 10https://gerrit.wikimedia.org/r/835572 (https://phabricator.wikimedia.org/T314868)
[10:36:42] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: jobrunner: convert to use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829551 (https://phabricator.wikimedia.org/T271736)
[10:38:32] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet
[10:38:40] <logmsgbot>	 !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1002.eqiad.wmnet
[10:39:32] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: jobrunner: convert to use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829551 (https://phabricator.wikimedia.org/T271736)
[10:40:22] <wikibugs>	 (03CR) 10Isabelle Hurbain-Palatin: [C: 03+1] "I double-checked the name of the variables, that the variable are top-level in the config, and that this is applied all wikis in to the -l" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833454 (https://phabricator.wikimedia.org/T175177) (owner: 10Sbailey)
[10:40:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] jobrunner: convert to use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829551 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto)
[10:40:50] <wikibugs>	 (03PS3) 10Hashar: gerrit: use daemon_user variable everywhere [puppet] - 10https://gerrit.wikimedia.org/r/833385
[10:41:07] <wikibugs>	 (03PS3) 10Vgutierrez: varnish: Fix VCL tests broken by querysort [puppet] - 10https://gerrit.wikimedia.org/r/835572 (https://phabricator.wikimedia.org/T314868)
[10:41:11] <wikibugs>	 (03PS1) 10Jbond: sre.hardware.upgrade-firmware: correct passed parameter [cookbooks] - 10https://gerrit.wikimedia.org/r/835575
[10:41:40] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: jobrunner: convert to use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829551 (https://phabricator.wikimedia.org/T271736)
[10:42:22] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "I forgot to adjust the proxy/migration/migration_base profiles ;)" [puppet] - 10https://gerrit.wikimedia.org/r/833385 (owner: 10Hashar)
[10:42:43] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37365/console" [puppet] - 10https://gerrit.wikimedia.org/r/829551 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto)
[10:44:02] <wikibugs>	 (03PS2) 10Jbond: sre.hardware.upgrade-firmware: correct passed parameter [cookbooks] - 10https://gerrit.wikimedia.org/r/835575
[10:45:25] <wikibugs>	 (03PS4) 10Vgutierrez: varnish: Fix VCL tests broken by querysort [puppet] - 10https://gerrit.wikimedia.org/r/835572 (https://phabricator.wikimedia.org/T314868)
[10:49:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: correct passed parameter [cookbooks] - 10https://gerrit.wikimedia.org/r/835575 (owner: 10Jbond)
[10:50:45] <wikibugs>	 (03PS2) 10Muehlenhoff: interface: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/812176 (https://phabricator.wikimedia.org/T308013)
[10:52:18] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.dns.netbox
[10:53:01] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: correct passed parameter [cookbooks] - 10https://gerrit.wikimedia.org/r/835575 (owner: 10Jbond)
[10:53:54] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] varnish: Fix VCL tests broken by querysort [puppet] - 10https://gerrit.wikimedia.org/r/835572 (https://phabricator.wikimedia.org/T314868) (owner: 10Vgutierrez)
[10:54:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] interface: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/812176 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[10:54:49] <moritzm>	 vgutierrez: shall I merge your vcl patch along?
[10:55:01] <vgutierrez>	 moritzm: go ahead please
[10:55:15] <moritzm>	 ack, done
[10:55:22] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:55:22] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-be[2028-2039].codfw.wmnet
[10:55:27] <wikibugs>	 10SRE-swift-storage: Decom ms-be20[28-39] - https://phabricator.wikimedia.org/T294549 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin2002 for hosts: `ms-be[2028-2039].codfw.wmnet` - ms-be2028.codfw.wmnet (**PASS**)   - Downtimed host on Icinga/Alertmanager   - Found physical hos...
[10:55:41] <wikibugs>	 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission ms-be20[28-39].codfw.wmnet - https://phabricator.wikimedia.org/T318689 (10MatthewVernon)
[10:56:10] <wikibugs>	 10SRE-swift-storage: Decom ms-be20[28-39] - https://phabricator.wikimedia.org/T294549 (10MatthewVernon)
[10:56:28] <wikibugs>	 10SRE-swift-storage: Decom ms-be20[28-39] - https://phabricator.wikimedia.org/T294549 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon
[10:57:08] <wikibugs>	 (03Abandoned) 10Jbond: sre.hardware.upgrade-firmware: use catagory for storage [cookbooks] - 10https://gerrit.wikimedia.org/r/835567 (owner: 10Jbond)
[10:57:40] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.dns.netbox
[10:58:51] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:58:52] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-be[1028-1033,1035-1039].eqiad.wmnet
[10:58:54] <wikibugs>	 10SRE-swift-storage: Decom ms-be10[28-39] - https://phabricator.wikimedia.org/T294550 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin1001 for hosts: `ms-be[1028-1033,1035-1039].eqiad.wmnet` - ms-be1028.eqiad.wmnet (**PASS**)   - Downtimed host on Icinga/Alertmanager   - Found ph...
[10:59:30] <wikibugs>	 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be10[28-39].eqiad.wmnet - https://phabricator.wikimedia.org/T318691 (10MatthewVernon)
[11:00:08] <wikibugs>	 10SRE-swift-storage: Decom ms-be10[28-39] - https://phabricator.wikimedia.org/T294550 (10MatthewVernon)
[11:00:28] <wikibugs>	 10SRE-swift-storage: Decom ms-be10[28-39] - https://phabricator.wikimedia.org/T294550 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon
[11:04:37] <wikibugs>	 (03PS2) 10Vgutierrez: varnish: Remove ECDHE-ECDSA-AES128-SHA sinkhole [puppet] - 10https://gerrit.wikimedia.org/r/835571 (https://phabricator.wikimedia.org/T258405)
[11:06:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/834017 (https://phabricator.wikimedia.org/T318006) (owner: 10Jbond)
[11:06:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:07:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/834038 (https://phabricator.wikimedia.org/T318006) (owner: 10Jbond)
[11:08:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/834039 (owner: 10Jbond)
[11:11:55] <wikibugs>	 (03PS1) 10Jbond: sre.hardware.upfraede-firmware: use packagin.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579
[11:14:18] <wikibugs>	 (03PS2) 10Jbond: sre.hardware.upgrade-firmware: use packagin.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579
[11:14:34] <wikibugs>	 (03CR) 10Vgutierrez: "text tests are happy:" [puppet] - 10https://gerrit.wikimedia.org/r/835571 (https://phabricator.wikimedia.org/T258405) (owner: 10Vgutierrez)
[11:17:49] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:17:49] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:18:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: use packagin.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 (owner: 10Jbond)
[11:21:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:23:03] <wikibugs>	 (03PS1) 10Ladsgroup: labs: Enable temp user creation in dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835580
[11:24:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] labs: Enable temp user creation in dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835580 (owner: 10Ladsgroup)
[11:28:32] <logmsgbot>	 !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host logstash2037.mgmt.codfw.wmnet with reboot policy FORCED
[11:30:48] <wikibugs>	 (03PS2) 10Ladsgroup: labs: Enable temp user creation in dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835580
[11:32:15] <icinga-wm>	 PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:34:47] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] labs: Enable temp user creation in dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835580 (owner: 10Ladsgroup)
[11:35:37] <wikibugs>	 (03Merged) 10jenkins-bot: labs: Enable temp user creation in dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835580 (owner: 10Ladsgroup)
[11:36:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:38:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[11:39:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[11:39:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[11:40:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[11:41:07] <wikibugs>	 (03PS1) 10Jelto: gitlab: disable email notifications on replicas [puppet] - 10https://gerrit.wikimedia.org/r/835581 (https://phabricator.wikimedia.org/T318682)
[11:43:38] <wikibugs>	 (03CR) 10Nikerabbit: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/835589 (https://phabricator.wikimedia.org/T318484) (owner: 10Nikerabbit)
[11:45:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[11:45:35] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37370/console" [puppet] - 10https://gerrit.wikimedia.org/r/835581 (https://phabricator.wikimedia.org/T318682) (owner: 10Jelto)
[11:49:26] <wikibugs>	 (03PS2) 10Jbond: 0.5.4: Prepare release [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/834039
[11:49:36] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] update-known-hosts-production: Capture all fingerprints [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/834038 (https://phabricator.wikimedia.org/T318006) (owner: 10Jbond)
[11:49:41] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] 0.5.4: Prepare release [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/834039 (owner: 10Jbond)
[11:49:54] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] 0.5.4: Prepare release (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/834039 (owner: 10Jbond)
[11:50:13] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Replace nutcracker with mcrouter - https://phabricator.wikimedia.org/T318695 (10hnowlan)
[11:50:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:ssh::publish_fingerprints: drop RSA support [puppet] - 10https://gerrit.wikimedia.org/r/834017 (https://phabricator.wikimedia.org/T318006) (owner: 10Jbond)
[11:51:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[11:51:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:51:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[11:57:19] <jbond>	 !log upload new wmf-laptop_0.5.4 package
[11:57:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[12:04:35] <wikibugs>	 (03PS1) 10Clément Goubert: pontoon: initialize new stack sops-appservers [puppet] - 10https://gerrit.wikimedia.org/r/835583
[12:05:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] pontoon: initialize new stack sops-appservers [puppet] - 10https://gerrit.wikimedia.org/r/835583 (owner: 10Clément Goubert)
[12:10:13] <wikibugs>	 (03PS1) 10Clément Goubert: C:memcached Restart memcached service on change [puppet] - 10https://gerrit.wikimedia.org/r/835585
[12:10:37] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[12:11:46] <wikibugs>	 (03CR) 10Clément Goubert: "I am not sure about this, since we may want more control around memcached restarts for cache warming reasons. Opinions?" [puppet] - 10https://gerrit.wikimedia.org/r/835585 (owner: 10Clément Goubert)
[12:13:12] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "I found the reason. gerrit2002 has been populated using rsync which included the following directories:" [puppet] - 10https://gerrit.wikimedia.org/r/832344 (owner: 10Hashar)
[12:13:16] <wikibugs>	 (03PS2) 10Clément Goubert: pontoon: initialize new stack sops-appservers [puppet] - 10https://gerrit.wikimedia.org/r/835583
[12:15:34] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37371/console" [puppet] - 10https://gerrit.wikimedia.org/r/835585 (owner: 10Clément Goubert)
[12:15:48] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[12:15:51] <wikibugs>	 (03PS1) 10KartikMistry: testwiki: Enable Section Translation for Bambara and Goan Konkani Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835606 (https://phabricator.wikimedia.org/T314557)
[12:17:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice! LGTM, I'll send a change to have boostrap.sh add the SPDX header" [puppet] - 10https://gerrit.wikimedia.org/r/835583 (owner: 10Clément Goubert)
[12:18:34] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: add SPDX header to rolemap on bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/835607
[12:18:54] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[12:18:58] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] pontoon: initialize new stack sops-appservers [puppet] - 10https://gerrit.wikimedia.org/r/835583 (owner: 10Clément Goubert)
[12:19:24] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] pontoon: add SPDX header to rolemap on bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/835607 (owner: 10Filippo Giunchedi)
[12:20:45] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[12:21:06] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10BTullis)
[12:21:59] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:22:53] <wikibugs>	 (03PS2) 10Clément Goubert: C:memcached Restart memcached service on change [puppet] - 10https://gerrit.wikimedia.org/r/835585 (https://phabricator.wikimedia.org/T318697)
[12:22:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:23:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add SPDX header to rolemap on bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/835607 (owner: 10Filippo Giunchedi)
[12:23:36] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[12:23:55] <godog>	 claime: I merged your change too
[12:26:03] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[12:26:59] <claime>	 godog: thanks!
[12:28:50] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[12:31:03] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[12:33:31] <icinga-wm>	 RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:36:14] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[12:41:59] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:42:32] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[12:52:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:58:45] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: refresh automated tests to remove references to Debian Stretch [puppet] - 10https://gerrit.wikimedia.org/r/835612 (https://phabricator.wikimedia.org/T275864)
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220927T1300).
[13:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:05] <jouncebot>	 Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220927T1300)
[13:00:26] <Lucas_WMDE>	 looks like there’s nothing to deploy :)
[13:00:37] <Lucas_WMDE>	 unless content transform team want to do mobileapps/wikifeeds things
[13:02:59] <MichaelG_WMDE>	 I'll add a patch in a second :)
[13:05:21] <wikibugs>	 (03CR) 10Abijeet Patro: [C: 03+1] Update Translate job names [deployment-charts] - 10https://gerrit.wikimedia.org/r/835589 (https://phabricator.wikimedia.org/T318484) (owner: 10Nikerabbit)
[13:10:17] <MichaelG_WMDE>	 So, I added a patch for the currently ongoing deploy window. Though, it could also be done in later slot.
[13:10:29] <MichaelG_WMDE>	 deploy -> backport
[13:10:57] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:17:53] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "Seems fine -- I don't see much benefit/downside to either config, but since blocking this should cut down on security false positives LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/835559 (https://phabricator.wikimedia.org/T309703) (owner: 10Filippo Giunchedi)
[13:18:12] <Lucas_WMDE>	 is anyone else around to deploy the backport? I’m in a meeting
[13:24:25] <Lucas_WMDE>	 MichaelG_WMDE: I assume that change should be backported to wmf.2?
[13:24:36] <Lucas_WMDE>	 (normally the change linked in the deployment calendar is already a cherry-pick, ftr)
[13:25:04] <MichaelG_WMDE>	 Lucas_WMDE: ah yes, will prepare the cherry pick right away
[13:25:47] * taavi looks
[13:26:58] <taavi>	 MichaelG_WMDE: hey. happy to deploy once you have a cherry-pick
[13:27:23] <MichaelG_WMDE>	 thanks, one second
[13:30:24] <wikibugs>	 (03CR) 10Nikerabbit: "I'm trying to figure out who is capable and comfortable deploying this change." [deployment-charts] - 10https://gerrit.wikimedia.org/r/835589 (https://phabricator.wikimedia.org/T318484) (owner: 10Nikerabbit)
[13:30:31] <taavi>	 the gerrit up is the best way to create one
[13:31:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install centrallog1002 - https://phabricator.wikimedia.org/T313858 (10Cmjohnson)
[13:31:52] <wikibugs>	 (03PS1) 10Michael Große: Track use of Searchbox footer on Wikidata [extensions/Wikibase] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/835590 (https://phabricator.wikimedia.org/T306933)
[13:32:18] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Track use of Searchbox footer on Wikidata [extensions/Wikibase] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/835590 (https://phabricator.wikimedia.org/T306933) (owner: 10Michael Große)
[13:32:25] <MichaelG_WMDE>	 ok, so I _think_ this is the right one now https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/835590
[13:32:51] <MichaelG_WMDE>	 though it might also be useful to have this on wmf.3?
[13:32:52] <Lucas_WMDE>	 oh wait, I just realized it’s Tuesday not Monday
[13:32:55] <Lucas_WMDE>	 so we’re post branch cut
[13:32:59] <Lucas_WMDE>	 yeah, probably wmf.3 too
[13:33:11] <taavi>	 yeah, you probably want both at this point
[13:33:11] <MichaelG_WMDE>	 👍
[13:33:50] <wikibugs>	 (03PS1) 10Michael Große: Track use of Searchbox footer on Wikidata [extensions/Wikibase] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835591 (https://phabricator.wikimedia.org/T306933)
[13:34:30] <wikibugs>	 (03PS1) 10Cmjohnson: Adding site.pp entry for centrallog1002 [puppet] - 10https://gerrit.wikimedia.org/r/835619 (https://phabricator.wikimedia.org/T313858)
[13:35:23] <wikibugs>	 (03PS2) 10Cmjohnson: Adding site.pp entry for centrallog1002 [puppet] - 10https://gerrit.wikimedia.org/r/835619 (https://phabricator.wikimedia.org/T313858)
[13:35:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/835590 (https://phabricator.wikimedia.org/T306933) (owner: 10Michael Große)
[13:35:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835591 (https://phabricator.wikimedia.org/T306933) (owner: 10Michael Große)
[13:36:02] <Lucas_WMDE>	 ooh, `scap backport` in action
[13:36:42] <taavi>	 yeah, testing it with 2 patches at the same time for the first time
[13:36:54] * taavi starts with filing a scap feature request
[13:37:25] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding site.pp entry for centrallog1002 [puppet] - 10https://gerrit.wikimedia.org/r/835619 (https://phabricator.wikimedia.org/T313858) (owner: 10Cmjohnson)
[13:38:28] <taavi>	 hmm 'http.client.RemoteDisconnected: Remote end closed connection without response'
[13:38:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/835590 (https://phabricator.wikimedia.org/T306933) (owner: 10Michael Große)
[13:38:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835591 (https://phabricator.wikimedia.org/T306933) (owner: 10Michael Große)
[13:38:55] <taavi>	 and a bug report
[13:40:23] <wikibugs>	 (03CR) 10Ayounsi: customscripts: export 'mgmt' entries from hiera_export (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi)
[13:41:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/835559 (https://phabricator.wikimedia.org/T309703) (owner: 10Filippo Giunchedi)
[13:42:24] <wikibugs>	 10SRE, 10serviceops: Update conf1* servers - https://phabricator.wikimedia.org/T310062 (10JMeybohm) a:03akosiaris I think this is done, right?
[13:45:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107 (T314041)', diff saved to https://phabricator.wikimedia.org/P34950 and previous config saved to /var/cache/conftool/dbconfig/20220927-134528-ladsgroup.json
[13:45:33] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[13:45:40] <_joe_>	 jouncebot: next
[13:45:41] <jouncebot>	 In 0 hour(s) and 14 minute(s): Maintenance script run (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220927T1400)
[13:46:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] jobrunner: convert to use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829551 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto)
[13:47:06] <taavi>	 _joe_: also there's a backport window atm, and I'm waiting for some backports to merge
[13:47:32] <_joe_>	 taavi: ah sorry I thought you were done
[13:47:53] <_joe_>	 but it's ok, I plan on deploying the change just to one jobrunner for now
[13:48:00] <_joe_>	 worst case scenario I'll depool it
[13:49:04] <taavi>	 yeah, these are js only backports so in theory shouldn't affect jobrunners at all
[13:49:57] <Lucas_WMDE>	 I can wait with the maintenance script run, it hopefully won’t take the full two hours
[13:52:18] <_joe_>	 Lucas_WMDE: no need
[13:52:22] <Lucas_WMDE>	 ok
[13:52:27] <_joe_>	 taavi: let's hope they don't :P
[13:53:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T314041)', diff saved to https://phabricator.wikimedia.org/P34951 and previous config saved to /var/cache/conftool/dbconfig/20220927-135310-ladsgroup.json
[13:53:15] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[13:53:21] * MichaelG_WMDE keeps looking at zuul and it should be *almost* done
[13:53:35] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "minor doc typo inline, LGTM otherwise, let's see what o11y says about the services to restart" [cookbooks] - 10https://gerrit.wikimedia.org/r/835565 (owner: 10Muehlenhoff)
[13:54:22] <wikibugs>	 (03Merged) 10jenkins-bot: Track use of Searchbox footer on Wikidata [extensions/Wikibase] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/835590 (https://phabricator.wikimedia.org/T306933) (owner: 10Michael Große)
[13:54:36] <MichaelG_WMDE>	 wmf.2 I can test on www.wikidata.org, but wmf.3 only on test.wikidata.org, right?
[13:54:37] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:55:03] <Lucas_WMDE>	 I think so, yeah
[13:55:13] <taavi>	 correct
[13:56:01] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:56:18] <MichaelG_WMDE>	 👍
[13:56:31] * MichaelG_WMDE is ready when you are
[13:57:30] <taavi>	 sigh. got another ConnectionError with the gerrit polling
[13:57:37] * taavi waits for the patch to merge before re-running it
[13:58:06] <wikibugs>	 (03Merged) 10jenkins-bot: Track use of Searchbox footer on Wikidata [extensions/Wikibase] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835591 (https://phabricator.wikimedia.org/T306933) (owner: 10Michael Große)
[13:58:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/835590 (https://phabricator.wikimedia.org/T306933) (owner: 10Michael Große)
[13:58:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835591 (https://phabricator.wikimedia.org/T306933) (owner: 10Michael Große)
[13:58:41] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.250 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:59:14] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:835590|Track use of Searchbox footer on Wikidata (T306933)]], [[gerrit:835591|Track use of Searchbox footer on Wikidata (T306933)]]
[13:59:18] <stashbot>	 T306933: Enable configurable scroll and "load more" behavior in TypeaheadSearch - https://phabricator.wikimedia.org/T306933
[13:59:33] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Fix /metrics ACL remoteip header [puppet] - 10https://gerrit.wikimedia.org/r/835623 (https://phabricator.wikimedia.org/T309703)
[13:59:45] <logmsgbot>	 !log taavi@deploy1002 taavi and migr: Backport for [[gerrit:835590|Track use of Searchbox footer on Wikidata (T306933)]], [[gerrit:835591|Track use of Searchbox footer on Wikidata (T306933)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[13:59:53] <taavi>	 MichaelG_WMDE: please test
[14:00:05] <jouncebot>	 Jhs and Lucas_WMDE: My dear minions, it's time we take the moon! Just kidding. Time for Maintenance script run deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220927T1400).
[14:00:09] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48681 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:00:16] <Lucas_WMDE>	 o/, waiting for backports to finish
[14:00:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:00:30] <MichaelG_WMDE>	 taavi can I test both?
[14:00:34] <taavi>	 yes
[14:00:35] * MichaelG_WMDE looks at both
[14:00:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107', diff saved to https://phabricator.wikimedia.org/P34952 and previous config saved to /var/cache/conftool/dbconfig/20220927-140034-ladsgroup.json
[14:00:39] <MichaelG_WMDE>	 thanks!
[14:00:41] * MichaelG_WMDE tests
[14:01:49] <MichaelG_WMDE>	 can confirm both working and I see no errors!
[14:01:55] <taavi>	 cool, syncing
[14:03:37] <Lucas_WMDE>	 I’ll start doing dry-runs of the maintenance script already, to determine the number of rows affected
[14:03:41] <Lucas_WMDE>	 shouldn’t have any effect
[14:04:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:04:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:04:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (Sprint 02): Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10EChetty)
[14:05:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for working on this! See inline, LGTM overall" [cookbooks] - 10https://gerrit.wikimedia.org/r/835565 (owner: 10Muehlenhoff)
[14:06:13] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:835590|Track use of Searchbox footer on Wikidata (T306933)]], [[gerrit:835591|Track use of Searchbox footer on Wikidata (T306933)]] (duration: 06m 59s)
[14:06:17] <stashbot>	 T306933: Enable configurable scroll and "load more" behavior in TypeaheadSearch - https://phabricator.wikimedia.org/T306933
[14:06:39] <taavi>	 MichaelG_WMDE: ok, should be live
[14:06:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:07:28] <taavi>	 Lucas_WMDE: all done
[14:07:34] <Lucas_WMDE>	 thanks!
[14:07:36] <taavi>	 and _joe_ ^
[14:07:54] <_joe_>	 taavi: thanks, I'll expand to the rest of the cluster
[14:08:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:08:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P34953 and previous config saved to /var/cache/conftool/dbconfig/20220927-140817-ladsgroup.json
[14:08:20] <Lucas_WMDE>	 _joe_: I think I’ll run my maintenance script with PHP=php7.4, does that sound okay to you?
[14:08:42] <_joe_>	 Lucas_WMDE: it should make no difference in terms of ICU
[14:08:45] <_joe_>	 so yes, go on
[14:08:46] <Lucas_WMDE>	 ah ok
[14:08:51] <Lucas_WMDE>	 I thought there might be a difference
[14:08:52] <Lucas_WMDE>	 ok :)
[14:08:52] <_joe_>	 We'll switch pretty soon btw
[14:09:15] <Lucas_WMDE>	 hype hype hype
[14:09:18] <_joe_>	 I'm switching the jobrunners right now, we might as well switch mwmaint next
[14:10:50] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytics for devnull - https://phabricator.wikimedia.org/T318104 (10Devnull) I do not currently have a sponsor, how would I get one?
[14:11:45] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: role::mediawiki::maintenance: switch to php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/835629 (https://phabricator.wikimedia.org/T271736)
[14:11:52] <Lucas_WMDE>	 !log BEGIN lucaswerkmeister-wmde@mwmaint1002:~$ PHP=php7.4 mwscript updateCollation.php incubatorwiki --force # T315552
[14:11:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:11:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:56] <stashbot>	 T315552: Run updateCollation.php on the Wikimedia Incubator - https://phabricator.wikimedia.org/T315552
[14:12:01] <wikibugs>	 (03PS1) 10Papaul: Add new logstash nodes to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/835630 (https://phabricator.wikimedia.org/T313848)
[14:13:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:13:24] <Lucas_WMDE>	 seems to be running quite a bit faster than the mw.o documentation suggested, yay
[14:13:34] <Lucas_WMDE>	 (100k rows done now)
[14:13:39] <MichaelG_WMDE>	 taavi: Thank you! (sorry for delayed response, office network problems...)
[14:13:43] <Lucas_WMDE>	 (out of ~670k)
[14:13:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:13:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:13:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:15:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107', diff saved to https://phabricator.wikimedia.org/P34954 and previous config saved to /var/cache/conftool/dbconfig/20220927-141541-ladsgroup.json
[14:16:41] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Create program to interact with Atlas RIPE API - https://phabricator.wikimedia.org/T315536 (10BCornwall) While we have https://gerrit.wikimedia.org/r/c/operations/software/latency-measurement/+/833848 available for review, I hear that there'd be pushback for not having a...
[14:17:13] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10Papaul) 05Open→03Declined There is a decommission task for this node @T318689 to declining this task
[14:17:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:18:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:21:04] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence, and 2 others: db2098 crashed - https://phabricator.wikimedia.org/T318062 (10Papaul) @willy it will not be possible to submit a RMA for this host, I have some decommissioned servers onsite i can check and see if we can pull some memory.
[14:22:38] <logmsgbot>	 !log mforns@deploy1002 Started deploy [airflow-dags/analytics@66dfa44]: (no justification provided)
[14:23:25] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@66dfa44]: (no justification provided) (duration: 00m 46s)
[14:23:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P34955 and previous config saved to /var/cache/conftool/dbconfig/20220927-142324-ladsgroup.json
[14:23:29] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add new logstash nodes to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/835630 (https://phabricator.wikimedia.org/T313848) (owner: 10Papaul)
[14:24:12] <wikibugs>	 (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond)
[14:24:55] <wikibugs>	 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10EChetty) p:05Medium→03High
[14:25:06] <Lucas_WMDE>	 !log END lucaswerkmeister-wmde@mwmaint1002:~$ PHP=php7.4 mwscript updateCollation.php incubatorwiki --force # T315552, 710183 rows done
[14:25:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:10] <stashbot>	 T315552: Run updateCollation.php on the Wikimedia Incubator - https://phabricator.wikimedia.org/T315552
[14:26:48] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2036.codfw.wmnet with OS buster
[14:26:56] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10Patch-For-Review: Q1:rack/setup/install logstash203[67] - https://phabricator.wikimedia.org/T313848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host logstash2036.codfw.wmnet with OS buster
[14:27:18] <Lucas_WMDE>	 I think that means we’re done with the maintenance script run window :)
[14:28:53] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: toolforge: automated-tests: remove references to Debian Stretch [puppet] - 10https://gerrit.wikimedia.org/r/835612 (https://phabricator.wikimedia.org/T275864)
[14:30:08] <wikibugs>	 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Joe) Do you happen to have any further detail on the response headers and body you get whenever you receive a 429 response? it would help us identify which layer is returni...
[14:30:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107 (T314041)', diff saved to https://phabricator.wikimedia.org/P34956 and previous config saved to /var/cache/conftool/dbconfig/20220927-143047-ladsgroup.json
[14:30:49] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[14:30:52] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[14:31:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[14:31:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T314041)', diff saved to https://phabricator.wikimedia.org/P34957 and previous config saved to /var/cache/conftool/dbconfig/20220927-143109-ladsgroup.json
[14:31:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:35:33] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host logstash2036.codfw.wmnet with OS buster
[14:35:38] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install logstash203[67] - https://phabricator.wikimedia.org/T313848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host logstash2036.codfw.wmnet with OS buster executed with errors: - logstash2036 (**F...
[14:35:49] <wikibugs>	 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) >>! In T318065#8265005, @Joe wrote: > Do you happen to have any further detail on the response headers and body you get whenever you receive a 429 response?...
[14:38:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T314041)', diff saved to https://phabricator.wikimedia.org/P34958 and previous config saved to /var/cache/conftool/dbconfig/20220927-143831-ladsgroup.json
[14:38:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance
[14:38:36] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[14:38:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: customscripts: export 'mgmt' entries from hiera_export (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi)
[14:38:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance
[14:40:33] <wikibugs>	 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) Actually, I have some left from intentionally hitting them while testing the bot yesterday.   ` array(37) {   ["url"]=>   string(131) "https://en.wikipedia.o...
[14:41:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:43:47] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, minor nits inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond)
[14:43:49] <wikibugs>	 (03PS1) 10DLynch: MobileWebUIActions sample rate to 1 on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835635 (https://phabricator.wikimedia.org/T302108)
[14:45:02] <wikibugs>	 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) The bot has two IPs it works from.     # 185.15.56.22   # 185.15.56.29
[14:46:55] <wikibugs>	 (03CR) 10Volans: "post-merge nit" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/834038 (https://phabricator.wikimedia.org/T318006) (owner: 10Jbond)
[14:47:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Fix /metrics ACL remoteip header [puppet] - 10https://gerrit.wikimedia.org/r/835623 (https://phabricator.wikimedia.org/T309703) (owner: 10Filippo Giunchedi)
[14:51:40] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/835565 (owner: 10Muehlenhoff)
[14:51:58] <wikibugs>	 (03PS1) 10Muehlenhoff: spdx::convert: Fix two bugs in detecting contributors for roles/profiles [puppet] - 10https://gerrit.wikimedia.org/r/835636 (https://phabricator.wikimedia.org/T308013)
[14:52:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] spdx::convert: Fix two bugs in detecting contributors for roles/profiles [puppet] - 10https://gerrit.wikimedia.org/r/835636 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[14:54:46] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] lvs: Convert ::lvs::configuration to a profile [puppet] - 10https://gerrit.wikimedia.org/r/834360 (https://phabricator.wikimedia.org/T264132) (owner: 10BCornwall)
[14:56:06] <logmsgbot>	 !log mforns@deploy1002 Started deploy [airflow-dags/analytics@25dda27]: (no justification provided)
[14:56:17] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@25dda27]: (no justification provided) (duration: 00m 11s)
[14:56:26] <wikibugs>	 (03PS1) 10JMeybohm: Use p95 instead of p99 for KubernetesAPILatency alerts [alerts] - 10https://gerrit.wikimedia.org/r/835637 (https://phabricator.wikimedia.org/T311251)
[14:56:38] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] lvs: Convert ::lvs::configuration to a profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834360 (https://phabricator.wikimedia.org/T264132) (owner: 10BCornwall)
[14:56:53] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "+ Moritz for the aptrepo config." [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall)
[14:58:03] <wikibugs>	 (03PS2) 10Muehlenhoff: spdx::convert: Fix two bugs in detecting contributors for roles/profiles [puppet] - 10https://gerrit.wikimedia.org/r/835636 (https://phabricator.wikimedia.org/T308013)
[14:58:08] <wikibugs>	 (03PS3) 10Jbond: sre.hardware.upgrade-firmware: use packagin.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579
[14:58:11] <wikibugs>	 (03PS11) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168
[14:58:59] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "I love how the version can be passed as an argument to the profile and the Docker version being in sync across all distributions.  That is" [puppet] - 10https://gerrit.wikimedia.org/r/834399 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall)
[14:59:22] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] P:ci::docker: Upgrade docker to 20.10.18 on all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834400 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall)
[14:59:40] <wikibugs>	 (03CR) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond)
[15:00:25] <wikibugs>	 (03PS2) 10JMeybohm: Use p95 instead of p99 for KubernetesAPILatency alerts [alerts] - 10https://gerrit.wikimedia.org/r/835637 (https://phabricator.wikimedia.org/T311251)
[15:00:46] <wikibugs>	 (03PS12) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168
[15:01:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond)
[15:03:04] <wikibugs>	 10SRE, 10Scap: Wrong umask when deploying from screen - https://phabricator.wikimedia.org/T200690 (10dancy) @Tgr Can you confirm that this is still a problem?
[15:03:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: New cookbook to roll-restart/reboot Thanos frontends (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/835565 (owner: 10Muehlenhoff)
[15:04:16] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Use p95 instead of p99 for KubernetesAPILatency alerts [alerts] - 10https://gerrit.wikimedia.org/r/835637 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm)
[15:04:57] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] Prometheus: Remove ATS gauge periods [puppet] - 10https://gerrit.wikimedia.org/r/832327 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[15:06:48] <wikibugs>	 (03Merged) 10jenkins-bot: Use p95 instead of p99 for KubernetesAPILatency alerts [alerts] - 10https://gerrit.wikimedia.org/r/835637 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm)
[15:06:54] <wikibugs>	 (03PS4) 10Jbond: sre.hardware.upgrade-firmware: use packagin.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579
[15:06:56] <wikibugs>	 (03PS13) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168
[15:07:28] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] role::mediawiki::maintenance: switch to php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/835629 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto)
[15:11:09] <wikibugs>	 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Vgutierrez) > This ticket is two-fold. The first is a request for SRE to provide logs regarding queries originating from IABot, easily identified from the UA. @Cyberpower67...
[15:19:14] <wikibugs>	 (03CR) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall)
[15:20:08] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+1] "Seems reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/835581 (https://phabricator.wikimedia.org/T318682) (owner: 10Jelto)
[15:21:48] <wikibugs>	 (03PS5) 10Jbond: sre.hardware.upgrade-firmware: use packagin.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579
[15:21:52] <wikibugs>	 (03PS14) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168
[15:22:34] <wikibugs>	 (03PS6) 10Jbond: sre.hardware.upgrade-firmware: use packagin.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579
[15:23:48] <wikibugs>	 (03PS15) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168
[15:24:42] <wikibugs>	 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Vgutierrez) As a reference, this change in behavior has been triggered by https://gerrit.wikimedia.org/r/c/operations/puppet/+/677872
[15:25:14] <wikibugs>	 (03CR) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall)
[15:25:41] <wikibugs>	 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) >>! In T318065#8265171, @Vgutierrez wrote: >> This ticket is two-fold. The first is a request for SRE to provide logs regarding queries originating from IABo...
[15:26:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond)
[15:27:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond)
[15:28:33] <wikibugs>	 (03CR) 10Clément Goubert: [C: 04-1] "Not restarting on file change is on purpose to avoid cold cache. Putting on hold." [puppet] - 10https://gerrit.wikimedia.org/r/835585 (https://phabricator.wikimedia.org/T318697) (owner: 10Clément Goubert)
[15:29:53] <wikibugs>	 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Vgutierrez) > Without specific logs, I can't really assess if these aggressive requests can be optimized. I would recommend generating those logs on the IABot side
[15:29:55] <wikibugs>	 (03PS7) 10Jbond: sre.hardware.upgrade-firmware: use packagin.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579
[15:30:05] <wikibugs>	 (03PS1) 10FNegri: ceph.bootstrap_and_add: fix _wait_for_osds [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723)
[15:30:56] <wikibugs>	 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) >>! In T318065#8265227, @Vgutierrez wrote: >> Without specific logs, I can't really assess if these aggressive requests can be optimized. > I would recommend...
[15:33:15] <wikibugs>	 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) >>! In T318065#8265199, @Vgutierrez wrote: > As a reference, this change in behavior has been triggered by https://gerrit.wikimedia.org/r/c/operations/puppet...
[15:34:15] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp - https://phabricator.wikimedia.org/T317244 (10BBlack) Copying over from T317249#8262220 - This is the replacement mapping of nodes + disks:  | cp nodes | Current | Replacement | Disks | text | 21-26, 33, 34 | 37-44 | Single NVME | upload |...
[15:34:21] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/831111 (owner: 10Muehlenhoff)
[15:36:46] <wikibugs>	 (03PS4) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382)
[15:36:48] <wikibugs>	 (03PS5) 10Dduvall: P:ci::docker: Install upstream docker packages for all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834399 (https://phabricator.wikimedia.org/T318382)
[15:36:49] <wikibugs>	 (03PS5) 10Dduvall: P:ci::docker: Upgrade docker to 20.10.18 on all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834400 (https://phabricator.wikimedia.org/T318382)
[15:37:33] <wikibugs>	 (03PS16) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168
[15:38:46] <wikibugs>	 (03CR) 10Dduvall: "Thanks for the review, Antoine. Your explanation helped me understand the distributions file much more clearly. I believe I've fixed up ev" [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall)
[15:40:04] <wikibugs>	 (03PS1) 10BBlack: cache node disk layout p11n for F4 config [puppet] - 10https://gerrit.wikimedia.org/r/835646 (https://phabricator.wikimedia.org/T317244)
[15:41:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cache node disk layout p11n for F4 config [puppet] - 10https://gerrit.wikimedia.org/r/835646 (https://phabricator.wikimedia.org/T317244) (owner: 10BBlack)
[15:41:33] <wikibugs>	 (03CR) 10Hnowlan: [C: 04-1] Update the logic to run test coverage (035 comments) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) (owner: 10Vlad.shapik)
[15:41:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond)
[15:45:21] <wikibugs>	 (03PS1) 10DLynch: Enable DiscussionTools reply button visual enhancements on cswiki+huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835648 (https://phabricator.wikimedia.org/T315626)
[15:45:22] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[15:51:28] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Nice, I think that is good now but Moritz would know for sure :]   When deploying may you update the Docker package for thirdparty/ci on b" [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall)
[15:54:30] <wikibugs>	 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) @Vgutierrez is there an explanation somewhere why the Cloud VPS IP range was removed from this list?  Is it possible to add IABot IPs back on until we can ge...
[15:54:56] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: ceph.bootstrap_and_add: fix _wait_for_osds (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri)
[15:58:39] <wikibugs>	 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Vgutierrez) >>! In T318065#8265344, @Cyberpower678 wrote: > @Vgutierrez is there an explanation somewhere why the Cloud VPS IP range was removed from this list?  Is it poss...
[16:00:05] <jouncebot>	 jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220927T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:22] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[16:01:26] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "lgtm, just some naming nits" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri)
[16:02:00] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] ceph.bootstrap_and_add: fix _wait_for_osds (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/835643 (https://phabricator.wikimedia.org/T318723) (owner: 10FNegri)
[16:03:04] <wikibugs>	 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10BBlack) >>! In T318065#8265200, @Cyberpower678 wrote: > IABot workers run independently of each other.  Each worker runs on a single wiki and minds it's own business.  So t...
[16:05:34] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Horsey - https://phabricator.wikimedia.org/T318729 (10MHorsey-WMF)
[16:06:37] <wikibugs>	 (03PS2) 10BBlack: cache node disk layout p11n for F4 config [puppet] - 10https://gerrit.wikimedia.org/r/835646 (https://phabricator.wikimedia.org/T317244)
[16:06:40] <wikibugs>	 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) >>! In T318065#8265366, @BBlack wrote: >>>! In T318065#8265200, @Cyberpower678 wrote: >> IABot workers run independently of each other.  Each worker runs on...
[16:07:37] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Horsey - https://phabricator.wikimedia.org/T318729 (10MHorsey-WMF)
[16:07:59] <icinga-wm>	 PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:08:25] <icinga-wm>	 PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:13:45] <wikibugs>	 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10ayounsi) >>! In T318065#8265347, @Vgutierrez wrote: > that would be a question for @ayounsi / @cmooney from the netops team and/or @Andrew from WMCS Context is in T265864,...
[16:17:49] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:17:52] <wikibugs>	 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) >>! In T318065#8265422, @ayounsi wrote: >>>! In T318065#8265347, @Vgutierrez wrote: >> that would be a question for @ayounsi / @cmooney from the netops team...
[16:18:13] <wikibugs>	 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10BBlack) >>! In T318065#8265397, @Cyberpower678 wrote: >>>! In T318065#8265366, @BBlack wrote: >>>>! In T318065#8265200, @Cyberpower678 wrote: >>> IABot workers run independ...
[16:21:30] <wikibugs>	 (03Abandoned) 10BBlack: Add wikifunctions to MW canonical redirects [puppet] - 10https://gerrit.wikimedia.org/r/822455 (https://phabricator.wikimedia.org/T275904) (owner: 10BBlack)
[16:22:21] <wikibugs>	 (03PS2) 10BBlack: Add wikifunctions to Varnish as a 302 [puppet] - 10https://gerrit.wikimedia.org/r/822657 (https://phabricator.wikimedia.org/T275904)
[16:23:50] <wikibugs>	 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) >>! In T318065#8265446, @BBlack wrote: >>>! In T318065#8265397, @Cyberpower678 wrote: >>>>! In T318065#8265366, @BBlack wrote: >>>>>! In T318065#8265200, @Cy...
[16:29:14] <wikibugs>	 (03PS17) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168
[16:30:29] <icinga-wm>	 PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:45:27] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): hw troubleshooting: network cards shutting down for lasbtore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T317651 (10dcaro)
[16:51:06] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] cache node disk layout p11n for F4 config (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/835646 (https://phabricator.wikimedia.org/T317244) (owner: 10BBlack)
[16:55:25] <wikibugs>	 (03PS18) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168
[16:57:48] <wikibugs>	 (03PS19) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168
[17:08:10] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt-wdqs1003.eqiad.wmnet
[17:09:43] <icinga-wm>	 RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:15:47] <wikibugs>	 (03PS1) 10Andrew Bogott: Make cloudnet100[56] into cloudnet nodes [puppet] - 10https://gerrit.wikimedia.org/r/835657 (https://phabricator.wikimedia.org/T316284)
[17:19:03] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:19:34] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudvirt-wdqs1003.eqiad.wmnet
[17:19:44] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for "Stef Dunlap" - https://phabricator.wikimedia.org/T318626 (10Jdforrester-WMF) (Tech lead confirmation, if it's needed.)
[17:23:31] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/835659
[17:23:34] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/835660
[17:26:59] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt-wdqs1003.eqiad.wmnet
[17:28:15] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest[1001-1002].eqiad.wmnet
[17:28:20] <wikibugs>	 (03PS20) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168
[17:29:22] <logmsgbot>	 !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest[1001-1002].eqiad.wmnet
[17:31:43] <icinga-wm>	 RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:31:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond)
[17:38:22] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudvirt-wdqs1003.eqiad.wmnet
[17:38:50] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt-wdqs1002.eqiad.wmnet
[17:39:21] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt-wdqs1001.eqiad.wmnet
[17:41:39] <wikibugs>	 (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/835659 (owner: 10PipelineBot)
[17:41:58] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/835660 (owner: 10PipelineBot)
[17:42:55] <wikibugs>	 (03PS8) 10Jbond: sre.hardware.upgrade-firmware: use packagin.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579
[17:45:13] <wikibugs>	 (03PS21) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168
[17:45:26] <wikibugs>	 (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/835660 (owner: 10PipelineBot)
[17:46:56] <wikibugs>	 (03CR) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond)
[17:47:30] <logmsgbot>	 !log dduvall@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply
[17:47:51] <logmsgbot>	 !log dduvall@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply
[17:48:15] <logmsgbot>	 !log dduvall@deploy1002 helmfile [codfw] START helmfile.d/services/blubberoid: apply
[17:48:43] <logmsgbot>	 !log dduvall@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: apply
[17:48:47] <logmsgbot>	 !log dduvall@deploy1002 helmfile [eqiad] START helmfile.d/services/blubberoid: apply
[17:49:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond)
[17:49:17] <logmsgbot>	 !log dduvall@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: apply
[17:50:14] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudvirt-wdqs1002.eqiad.wmnet
[17:50:45] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudvirt-wdqs1001.eqiad.wmnet
[17:52:31] <wikibugs>	 (03PS22) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168
[17:55:16] <wikibugs>	 (03PS8) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212
[17:56:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond)
[17:57:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:57:38] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/835667
[17:58:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond)
[18:00:04] <jouncebot>	 brennen and jnuche: Your horoscope predicts another unfortunate MediaWiki train - Utc-7+Utc-0 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220927T1800).
[18:00:06] <wikibugs>	 (03CR) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond)
[18:01:28] <brennen>	 o/
[18:02:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:02:19] <brennen>	 !log 1.40.0-wmf.3 (T314192) no current blockers, promoting to group0
[18:02:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:23] <stashbot>	 T314192: 1.40.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T314192
[18:03:47] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835670 (https://phabricator.wikimedia.org/T314192)
[18:03:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835670 (https://phabricator.wikimedia.org/T314192) (owner: 10TrainBranchBot)
[18:05:08] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835670 (https://phabricator.wikimedia.org/T314192) (owner: 10TrainBranchBot)
[18:08:59] <wikibugs>	 (03PS23) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168
[18:09:29] <icinga-wm>	 RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:09:33] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.3  refs T314192
[18:09:37] <stashbot>	 T314192: 1.40.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T314192
[18:09:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:12:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond)
[18:14:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:14:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:14:06] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be10[28-39].eqiad.wmnet - https://phabricator.wikimedia.org/T318691 (10wiki_willy) a:03Jclark-ctr
[18:15:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:17:39] <wikibugs>	 (03PS24) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168
[18:19:55] <wikibugs>	 (03PS9) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212
[18:21:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond)
[18:22:49] <wikibugs>	 (03CR) 10Muehlenhoff: aptrepo: add docker packages to thirdparty/ci for bullseye (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall)
[18:23:26] <wikibugs>	 (03PS10) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212
[18:23:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond)
[18:26:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond)
[18:27:11] <wikibugs>	 (03PS25) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168
[18:29:11] <wikibugs>	 (03PS5) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382)
[18:29:13] <wikibugs>	 (03PS6) 10Dduvall: P:ci::docker: Install upstream docker packages for all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834399 (https://phabricator.wikimedia.org/T318382)
[18:29:15] <wikibugs>	 (03PS6) 10Dduvall: P:ci::docker: Upgrade docker to 20.10.18 on all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834400 (https://phabricator.wikimedia.org/T318382)
[18:29:17] <wikibugs>	 (03PS26) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168
[18:30:14] <wikibugs>	 (03PS11) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212
[18:30:48] <wikibugs>	 (03CR) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for bullseye (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall)
[18:34:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond)
[18:34:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond)
[18:35:04] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-File-management, and 3 others: MediaWiki sometimes displays old image revision despite purge and hard refresh - https://phabricator.wikimedia.org/T317481 (10Krinkle)
[18:35:59] <wikibugs>	 (03PS12) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212
[18:36:22] <wikibugs>	 (03PS1) 10Subramanya Sastry: Remove figures from text extracts [extensions/TextExtracts] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835593 (https://phabricator.wikimedia.org/T318727)
[18:39:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond)
[18:42:50] <wikibugs>	 (03PS2) 10Ryan Kemper: admin: ryankemper update shell to zsh [puppet] - 10https://gerrit.wikimedia.org/r/834515 (owner: 10Jbond)
[18:43:43] <wikibugs>	 (03CR) 10Gehel: "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833860 (https://phabricator.wikimedia.org/T318270) (owner: 10Ryan Kemper)
[19:05:34] <wikibugs>	 (03PS1) 10Subramanya Sastry: Remove figures from text extracts [extensions/TextExtracts] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/835594 (https://phabricator.wikimedia.org/T318727)
[19:06:36] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) (owner: 10Ebernhardson)
[19:16:50] <wikibugs>	 (03PS3) 10DDesouza: Deploy Research Incentive survey on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834042 (https://phabricator.wikimedia.org/T318328)
[19:16:54] <wikibugs>	 (03PS3) 10DDesouza: Deploy Research Incentive survey on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834044 (https://phabricator.wikimedia.org/T318331)
[19:19:55] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for "Stef Dunlap" - https://phabricator.wikimedia.org/T318626 (10Jrbranaa) Manager Approval if needed.
[19:34:03] <wikibugs>	 (03PS1) 10Ryan Kemper: Revert "Mount labstore to wcqs/wdqs instance for dumps reload" [puppet] - 10https://gerrit.wikimedia.org/r/835595
[19:34:16] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Revert "Mount labstore to wcqs/wdqs instance for dumps reload" [puppet] - 10https://gerrit.wikimedia.org/r/835595 (owner: 10Ryan Kemper)
[19:38:15] <jinxer-wm>	 (MjolnirUpdateFailureRateExceedesThreshold) firing: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold
[19:43:15] <jinxer-wm>	 (MjolnirUpdateFailureRateExceedesThreshold) resolved: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold
[19:43:15] <icinga-wm>	 PROBLEM - SSH on ms-be1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:43:52] <wikibugs>	 (03PS1) 10Stang: romdwikimedia: Enable subpages in NS0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835681 (https://phabricator.wikimedia.org/T318491)
[19:45:47] <wikibugs>	 (03PS1) 10JHathaway: dup otrs dummy password to vrts for rename [labs/private] - 10https://gerrit.wikimedia.org/r/835682
[19:46:13] <wikibugs>	 (03PS1) 10Ryan Kemper: Revert "Revert "Mount labstore to wcqs/wdqs instance for dumps reload"" [puppet] - 10https://gerrit.wikimedia.org/r/835596
[19:48:16] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host centrallog1002.eqiad.wmnet with OS bullseye
[19:48:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install centrallog1002 - https://phabricator.wikimedia.org/T313858 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host centrallog1002.eqiad.wmnet with OS bullseye
[19:48:49] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance
[19:49:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance
[19:49:04] <wikibugs>	 (03PS2) 10Ryan Kemper: Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/835596 (https://phabricator.wikimedia.org/T222349)
[19:49:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T314041)', diff saved to https://phabricator.wikimedia.org/P34966 and previous config saved to /var/cache/conftool/dbconfig/20220927-194908-ladsgroup.json
[19:49:13] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[19:50:37] <wikibugs>	 (03CR) 10Ryan Kemper: "@David - This commit is the same as https://gerrit.wikimedia.org/r/c/operations/puppet/+/832543 but with the addition of https://gerrit.wi" [puppet] - 10https://gerrit.wikimedia.org/r/835596 (https://phabricator.wikimedia.org/T222349) (owner: 10Ryan Kemper)
[19:51:04] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] dup otrs dummy password to vrts for rename [labs/private] - 10https://gerrit.wikimedia.org/r/835682 (owner: 10JHathaway)
[19:51:06] <wikibugs>	 (03CR) 10JHathaway: [V: 03+2 C: 03+2] dup otrs dummy password to vrts for rename [labs/private] - 10https://gerrit.wikimedia.org/r/835682 (owner: 10JHathaway)
[19:51:40] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] "Thanks @Jbond!" [puppet] - 10https://gerrit.wikimedia.org/r/834515 (owner: 10Jbond)
[19:59:14] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on centrallog1002.eqiad.wmnet with reason: host reimage
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, and TheresNoTime: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220927T2000)
[20:00:05] <jouncebot>	 kemayo, ryankemper, subbu, and koi: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:10] <subbu>	 o/
[20:00:13] <ryankemper>	 \o, around
[20:00:22] <Kemayo>	 👋🏻
[20:00:25] <koi>	 o/
[20:00:45] <TheresNoTime>	 hey all! :)
[20:00:50] * TheresNoTime can deploy!
[20:00:57] <cjming>	 \o/
[20:01:08] <TheresNoTime>	 (gimme a sec)
[20:02:13] <TheresNoTime>	 Kemayo: I'll start with your patches :)
[20:02:41] <Kemayo>	 Sounds good
[20:02:41] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on centrallog1002.eqiad.wmnet with reason: host reimage
[20:02:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835635 (https://phabricator.wikimedia.org/T302108) (owner: 10DLynch)
[20:02:57] <urbanecm>	 o/
[20:03:11] <urbanecm>	 hi TheresNoTime, looks like you've it all in your hands :)
[20:03:25] <TheresNoTime>	 urbanecm: yup ^^
[20:03:47] <wikibugs>	 (03Merged) 10jenkins-bot: MobileWebUIActions sample rate to 1 on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835635 (https://phabricator.wikimedia.org/T302108) (owner: 10DLynch)
[20:04:16] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:835635|MobileWebUIActions sample rate to 1 on testwiki (T302108)]]
[20:04:20] <stashbot>	 T302108: Ensure logging is in place to compare MobileFrontend and DiscussionTools new topic and new comment completion rates - https://phabricator.wikimedia.org/T302108
[20:04:40] <logmsgbot>	 !log samtar@deploy1002 samtar and kemayo: Backport for [[gerrit:835635|MobileWebUIActions sample rate to 1 on testwiki (T302108)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[20:04:45] <TheresNoTime>	 Kemayo: 835635 is live on 1002 ^
[20:05:19] <Kemayo>	 TheresNoTime: Looks good there.
[20:05:26] <TheresNoTime>	 Syncing
[20:06:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:07:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:07:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:08:05] <wikibugs>	 (03PS2) 10Samtar: Enable DiscussionTools reply button visual enhancements on cswiki+huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835648 (https://phabricator.wikimedia.org/T315626) (owner: 10DLynch)
[20:08:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:10:03] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:835635|MobileWebUIActions sample rate to 1 on testwiki (T302108)]] (duration: 05m 46s)
[20:10:07] <stashbot>	 T302108: Ensure logging is in place to compare MobileFrontend and DiscussionTools new topic and new comment completion rates - https://phabricator.wikimedia.org/T302108
[20:10:11] <TheresNoTime>	 Kemayo: that's sync'd if you want to check again, moving onto 835648
[20:10:55] <Kemayo>	 Continues to look good off-debug.
[20:11:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835648 (https://phabricator.wikimedia.org/T315626) (owner: 10DLynch)
[20:13:15] <TheresNoTime>	 (CI feeling a bit slow this evening...)
[20:13:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:14:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:14:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:14:59] <wikibugs>	 (03Merged) 10jenkins-bot: Enable DiscussionTools reply button visual enhancements on cswiki+huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835648 (https://phabricator.wikimedia.org/T315626) (owner: 10DLynch)
[20:15:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:15:21] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:835648|Enable DiscussionTools reply button visual enhancements on cswiki+huwiki (T315626)]]
[20:15:24] <stashbot>	 T315626: [Config Change] Add Clear Affordances to beta feature at partner wikis (desktop) - https://phabricator.wikimedia.org/T315626
[20:15:39] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host centrallog1002.eqiad.wmnet with OS bullseye
[20:15:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install centrallog1002 - https://phabricator.wikimedia.org/T313858 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host centrallog1002.eqiad.wmnet with OS bullseye completed: -...
[20:15:45] <logmsgbot>	 !log samtar@deploy1002 samtar and kemayo: Backport for [[gerrit:835648|Enable DiscussionTools reply button visual enhancements on cswiki+huwiki (T315626)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[20:15:59] <TheresNoTime>	 Kemayo: on mwdebug :)
[20:16:06] <Kemayo>	 TheresNoTime: It looks good there.
[20:16:21] <TheresNoTime>	 syncin'
[20:16:39] <wikibugs>	 (03PS2) 10Samtar: Disable MobileFrontend default editor a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835206 (https://phabricator.wikimedia.org/T302356) (owner: 10DLynch)
[20:16:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install centrallog1002 - https://phabricator.wikimedia.org/T313858 (10Cmjohnson)
[20:17:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install centrallog1002 - https://phabricator.wikimedia.org/T313858 (10Cmjohnson) 05Open→03Resolved
[20:18:05] <wikibugs>	 (03PS1) 10JHathaway: Fix config template for OTRS or VRTS aliases [puppet] - 10https://gerrit.wikimedia.org/r/835687 (https://phabricator.wikimedia.org/T318749)
[20:18:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10Cmjohnson) @Joe which partman recipe do you need for these?
[20:19:37] <wikibugs>	 (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/835687 (https://phabricator.wikimedia.org/T318749) (owner: 10JHathaway)
[20:19:45] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/835687 (https://phabricator.wikimedia.org/T318749) (owner: 10JHathaway)
[20:20:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:20:20] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:835648|Enable DiscussionTools reply button visual enhancements on cswiki+huwiki (T315626)]] (duration: 04m 58s)
[20:20:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Fix config template for OTRS or VRTS aliases [puppet] - 10https://gerrit.wikimedia.org/r/835687 (https://phabricator.wikimedia.org/T318749) (owner: 10JHathaway)
[20:20:46] <TheresNoTime>	 Kemayo: (same again while I set 835206 going) :D
[20:21:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835206 (https://phabricator.wikimedia.org/T302356) (owner: 10DLynch)
[20:21:10] <Kemayo>	 TheresNoTime: Yup, good off-debug.
[20:21:19] <icinga-wm>	 RECOVERY - Check systemd state on cloudbackup2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:21:53] <wikibugs>	 (03Merged) 10jenkins-bot: Disable MobileFrontend default editor a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835206 (https://phabricator.wikimedia.org/T302356) (owner: 10DLynch)
[20:22:03] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:22:18] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:835206|Disable MobileFrontend default editor a/b test (T302356)]]
[20:22:21] <stashbot>	 T302356: Deploy config change to "turn off" mobile VE A/B test - https://phabricator.wikimedia.org/T302356
[20:22:31] <wikibugs>	 (03PS4) 10Samtar: elastic: rebalance enwiki_content shard counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833860 (https://phabricator.wikimedia.org/T318270) (owner: 10Ryan Kemper)
[20:23:16] <wikibugs>	 (03PS2) 10JHathaway: Fix config template for OTRS or VRTS aliases [puppet] - 10https://gerrit.wikimedia.org/r/835687 (https://phabricator.wikimedia.org/T318749)
[20:23:55] <TheresNoTime>	 Interesting... scap just err'd while doing 835206.. `'mwscript eval.php --wiki aawiki' generated unexpected output: Notice: Undefined variable: wmgMFDefaultEditor in /srv/mediawiki-staging/wmf-config/CommonSettings.php on line 2828`
[20:24:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835206 (https://phabricator.wikimedia.org/T302356) (owner: 10DLynch)
[20:24:35] <TheresNoTime>	 just going to try it again..
[20:24:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:24:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:24:45] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:835206|Disable MobileFrontend default editor a/b test (T302356)]]
[20:25:37] <TheresNoTime>	 Kemayo: ^ FYI.. going to try doing it manually 
[20:25:47] <Kemayo>	 I can amend the patch -- I can see why it'd happen.
[20:25:58] <TheresNoTime>	 ah, yes please then :)
[20:26:05] <Kemayo>	 Ah, but already merged. New patch I guess, one second!
[20:26:33] <wikibugs>	 (03CR) 10Samtar: "Scap failure on deploy: `'mwscript eval.php --wiki aawiki' generated unexpected output: Notice: Undefined variable: wmgMFDefaultEditor in " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835206 (https://phabricator.wikimedia.org/T302356) (owner: 10DLynch)
[20:27:40] <wikibugs>	 (03PS1) 10DLynch: Add wmgMFDefaultEditor back in for future use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835689
[20:27:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add wmgMFDefaultEditor back in for future use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835689 (owner: 10DLynch)
[20:28:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:28:37] <TheresNoTime>	 Kemayo: I'm not entirely sure where in the scap process this failed (it's prior to deployment to medebug) so I'd like to do a revert of 835206 to get us back to a known state. You're doing an entirely new patch, correct?
[20:28:47] <TheresNoTime>	 *mwdebug
[20:29:08] <Kemayo>	 TheresNoTime: Sure, go for it. I can do the whole thing again in another backport window rather than delaying the others.
[20:29:09] <dancy>	 TheresNoTime: It failed before syncing out
[20:29:41] <wikibugs>	 (03PS2) 10DLynch: Add wmgMFDefaultEditor back in for future use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835689
[20:29:54] <TheresNoTime>	 Kemayo: Okay, good idea, unless dancy you have a different suggestion I'm going to revert 835206
[20:30:06] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[20:30:09] <Kemayo>	 TheresNoTime: I do have https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/835689 as a followup that should probably fix it.
[20:30:35] <TheresNoTime>	 ack, looking, could just merge that and go from there..
[20:30:43] <dancy>	 It's too bad that such a chance passed CI
[20:31:05] <TheresNoTime>	 dancy: second opinion on merging 835689 and proceeding?
[20:31:23] <dancy>	 Seems reasonable to merge.
[20:31:32] <TheresNoTime>	 ack, will do
[20:31:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835689 (owner: 10DLynch)
[20:32:04] <Kemayo>	 dancy: Yeah, it's presumably because the spot that gives a warning is one that relies on the config's whole setting-lots-of-globals behavior, so it's relatively hard to test without actually running the file... which I assume we don't do in this repo.
[20:32:33] <wikibugs>	 (03Merged) 10jenkins-bot: Add wmgMFDefaultEditor back in for future use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835689 (owner: 10DLynch)
[20:32:56] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:835689|Add wmgMFDefaultEditor back in for future use]]
[20:33:12] <TheresNoTime>	 (that worked)
[20:33:20] <logmsgbot>	 !log samtar@deploy1002 samtar and kemayo: Backport for [[gerrit:835689|Add wmgMFDefaultEditor back in for future use]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[20:33:28] <wikibugs>	 (03CR) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall)
[20:33:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:33:34] <TheresNoTime>	 Kemayo: on mwdebug :)
[20:33:56] <wikibugs>	 (03PS5) 10Samtar: elastic: rebalance enwiki_content shard counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833860 (https://phabricator.wikimedia.org/T318270) (owner: 10Ryan Kemper)
[20:34:15] <TheresNoTime>	 ryankemper: it'll be your patch next fyi
[20:34:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:34:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:34:34] <Kemayo>	 TheresNoTime: Looks good there.
[20:34:39] <ryankemper>	 TheresNoTime: cool. I don't have checks to run on debug so you can sync it fully when ready to
[20:34:46] <TheresNoTime>	 Kemayo: syncin'
[20:34:51] <TheresNoTime>	 ryankemper: ack :)
[20:35:28] <wikibugs>	 (03PS1) 10Bking: k8s: Limit envoy metrics scraped from k8s [puppet] - 10https://gerrit.wikimedia.org/r/835691 (https://phabricator.wikimedia.org/T318705)
[20:35:53] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:38:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:38:59] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:835689|Add wmgMFDefaultEditor back in for future use]] (duration: 06m 02s)
[20:39:21] <TheresNoTime>	 Kemayo: all sync'd :)
[20:39:48] <Kemayo>	 TheresNoTime: Looks good. Sorry for the need to scramble a bit there!
[20:40:00] <TheresNoTime>	 no worries! :D
[20:40:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833860 (https://phabricator.wikimedia.org/T318270) (owner: 10Ryan Kemper)
[20:40:52] <wikibugs>	 (03Merged) 10jenkins-bot: elastic: rebalance enwiki_content shard counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833860 (https://phabricator.wikimedia.org/T318270) (owner: 10Ryan Kemper)
[20:41:16] <TheresNoTime>	 koi: I'm going to do your patch next just fyi :)
[20:41:17] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:833860|elastic: rebalance enwiki_content shard counts (T318270)]]
[20:41:19] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/835691 (https://phabricator.wikimedia.org/T318705) (owner: 10Bking)
[20:41:21] <stashbot>	 T318270: Avoid overloading individual Elastic nodes with popular shards - https://phabricator.wikimedia.org/T318270
[20:41:41] <logmsgbot>	 !log samtar@deploy1002 samtar and ryankemper: Backport for [[gerrit:833860|elastic: rebalance enwiki_content shard counts (T318270)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[20:41:54] <subbu>	 TheresNoTime, I assume after that will be my patches?
[20:42:10] <TheresNoTime>	 (syncin' 833860)
[20:42:22] <ryankemper>	 thanks!
[20:42:44] <TheresNoTime>	 subbu: I was going to leave yours until last as I believe they can take a little while to merge in comparison to the config patches :)
[20:43:13] <subbu>	 sounds good.
[20:43:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:43:45] <icinga-wm>	 RECOVERY - SSH on ms-be1040.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:44:11] <TheresNoTime>	 depending on if koi is around when this one finishes syncing of course :)
[20:44:47] <koi>	 TheresNoTime: I'm around/
[20:44:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T314041)', diff saved to https://phabricator.wikimedia.org/P34967 and previous config saved to /var/cache/conftool/dbconfig/20220927-204446-ladsgroup.json
[20:44:51] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[20:44:55] <TheresNoTime>	 ^^
[20:45:15] <wikibugs>	 (03PS2) 10Samtar: romdwikimedia: Enable subpages in NS0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835681 (https://phabricator.wikimedia.org/T318491) (owner: 10Stang)
[20:45:25] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host mc-wf1001.mgmt.eqiad.wmnet with reboot policy FORCED
[20:45:58] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host mc-wf1002.mgmt.eqiad.wmnet with reboot policy FORCED
[20:46:31] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:833860|elastic: rebalance enwiki_content shard counts (T318270)]] (duration: 05m 14s)
[20:46:34] <stashbot>	 T318270: Avoid overloading individual Elastic nodes with popular shards - https://phabricator.wikimedia.org/T318270
[20:46:37] <TheresNoTime>	 ryankemper: all sync'd :)
[20:47:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835681 (https://phabricator.wikimedia.org/T318491) (owner: 10Stang)
[20:48:03] <wikibugs>	 (03Merged) 10jenkins-bot: romdwikimedia: Enable subpages in NS0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835681 (https://phabricator.wikimedia.org/T318491) (owner: 10Stang)
[20:48:27] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:835681|romdwikimedia: Enable subpages in NS0 (T318491)]]
[20:48:31] <stashbot>	 T318491: Enable subpages in NS_MAIN on romd.wikimedia.org - https://phabricator.wikimedia.org/T318491
[20:48:51] <logmsgbot>	 !log samtar@deploy1002 samtar and stang: Backport for [[gerrit:835681|romdwikimedia: Enable subpages in NS0 (T318491)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[20:48:56] <TheresNoTime>	 koi: live on mwdebug
[20:49:49] <koi>	 TheresNoTime: subpages in ns0 are correctly shown, so LGTM
[20:49:55] <TheresNoTime>	 syncing
[20:50:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:50:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:51:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:52:36] <TheresNoTime>	 subbu: thank you for waiting, and apologies for keeping you around until the last minute.. I'm going to start with 835593 once this finishes syncing
[20:52:48] <subbu>	 ok.
[20:53:09] <subbu>	 actually lets start with 835594  ... wmf.2
[20:53:15] <TheresNoTime>	 sure :)
[20:53:20] <subbu>	 that lets me verify that the patch actually fixes the bug.
[20:53:41] <subbu>	 wmf.3 isn't on the right wikis yet where this bug kicks in.
[20:53:56] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:835681|romdwikimedia: Enable subpages in NS0 (T318491)]] (duration: 05m 29s)
[20:53:58] <TheresNoTime>	 ack :) and koi, all sync'd
[20:54:00] <stashbot>	 T318491: Enable subpages in NS_MAIN on romd.wikimedia.org - https://phabricator.wikimedia.org/T318491
[20:54:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:54:12] <koi>	 thanks!
[20:54:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/TextExtracts] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/835594 (https://phabricator.wikimedia.org/T318727) (owner: 10Subramanya Sastry)
[20:56:05] <TheresNoTime>	 subbu: I'm happy to keep the deployment window open until your patches are deployed, if you're happy to stick around?
[20:56:12] <subbu>	 yes.
[20:56:22] <subbu>	 thanks! :)
[20:56:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:57:01] <TheresNoTime>	 it's the least I can do :) 835594 is now merging, ~12 minutes
[20:57:34] <wikibugs>	 (03Merged) 10jenkins-bot: Remove figures from text extracts [extensions/TextExtracts] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/835594 (https://phabricator.wikimedia.org/T318727) (owner: 10Subramanya Sastry)
[20:57:38] <jeena>	 In case you want to speed things up in the future, you can +2 ahead of time while your other patches are syncing and still use scap backport to finish the job
[20:57:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:57:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:58:01] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:835594|Remove figures from text extracts (T318727)]]
[20:58:05] <stashbot>	 T318727: Recent update caused image title to appear in text extracts - https://phabricator.wikimedia.org/T318727
[20:58:12] <TheresNoTime>	 (that was a quick 12 minutes...)
[20:58:24] <TheresNoTime>	 jeena: oh good idea, thank you!
[20:58:25] <logmsgbot>	 !log samtar@deploy1002 samtar and ssastry: Backport for [[gerrit:835594|Remove figures from text extracts (T318727)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[20:58:35] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-wf1002.mgmt.eqiad.wmnet with reboot policy FORCED
[20:58:39] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-wf1001.mgmt.eqiad.wmnet with reboot policy FORCED
[20:58:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:58:47] <TheresNoTime>	 subbu: this is live on mwdebug1002, could you test? :)
[20:58:49] <jeena>	 np :)
[20:58:55] <subbu>	 on it.
[20:59:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:59:48] <TheresNoTime>	 !log extending UTC late backport window
[20:59:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:59:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P34968 and previous config saved to /var/cache/conftool/dbconfig/20220927-205953-ladsgroup.json
[21:00:48] <subbu>	 verified fixed.
[21:00:52] <subbu>	 okay to sync.
[21:00:56] <TheresNoTime>	 great, syncing
[21:01:41] <subbu>	 the other one to wmf.3 can be merged and synced as well .. it will just ride the train this week to those affected wikis.
[21:02:10] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "deploy" [extensions/TextExtracts] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835593 (https://phabricator.wikimedia.org/T318727) (owner: 10Subramanya Sastry)
[21:02:24] <TheresNoTime>	 (ack)
[21:03:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:04:12] <wikibugs>	 (03Merged) 10jenkins-bot: Remove figures from text extracts [extensions/TextExtracts] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835593 (https://phabricator.wikimedia.org/T318727) (owner: 10Subramanya Sastry)
[21:05:00] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:835594|Remove figures from text extracts (T318727)]] (duration: 06m 58s)
[21:05:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/TextExtracts] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/835593 (https://phabricator.wikimedia.org/T318727) (owner: 10Subramanya Sastry)
[21:05:42] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:835593|Remove figures from text extracts (T318727)]]
[21:06:06] <logmsgbot>	 !log samtar@deploy1002 samtar and ssastry: Backport for [[gerrit:835593|Remove figures from text extracts (T318727)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[21:06:18] <TheresNoTime>	 subbu: did you want to test 835593 as well, or is there nothing you're able to test on wmf.3 wikis?
[21:06:35] <stashbot>	 T318727: Recent update caused image title to appear in text extracts - https://phabricator.wikimedia.org/T318727
[21:06:37] <subbu>	 no, nothing to test with that one. okay to sync.
[21:06:42] <TheresNoTime>	 ack
[21:08:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:08:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:09:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:10:35] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:835593|Remove figures from text extracts (T318727)]] (duration: 04m 53s)
[21:10:52] <TheresNoTime>	 subbu: all deployed :) thanks again for your patience!
[21:10:59] <subbu>	 \o/ ty
[21:12:10] <TheresNoTime>	 !log closing UTC late backport window
[21:12:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:14:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:14:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:14:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:15:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P34969 and previous config saved to /var/cache/conftool/dbconfig/20220927-211500-ladsgroup.json
[21:15:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:19:29] <wikibugs>	 (03PS1) 10Cmjohnson: adding mc-wf to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835701 (https://phabricator.wikimedia.org/T313963)
[21:21:41] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] adding mc-wf to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835701 (https://phabricator.wikimedia.org/T313963) (owner: 10Cmjohnson)
[21:23:53] <icinga-wm>	 PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:30:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T314041)', diff saved to https://phabricator.wikimedia.org/P34970 and previous config saved to /var/cache/conftool/dbconfig/20220927-213006-ladsgroup.json
[21:30:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[21:30:11] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[21:30:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[21:30:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T314041)', diff saved to https://phabricator.wikimedia.org/P34971 and previous config saved to /var/cache/conftool/dbconfig/20220927-213028-ladsgroup.json
[21:44:14] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mc-wf1001.eqiad.wmnet with OS bullseye
[21:44:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mc-wf1001.eqiad.wmnet with OS bullseye
[21:47:33] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mc-wf1002.eqiad.wmnet with OS bullseye
[21:47:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mc-wf1002.eqiad.wmnet with OS bullseye
[21:55:14] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-wf1001.eqiad.wmnet with reason: host reimage
[21:58:31] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-wf1002.eqiad.wmnet with reason: host reimage
[21:58:33] <wikibugs>	 (03PS1) 10Ebernhardson: dumpcirrussearch.sh: Replace gzip with lbzip2 [puppet] - 10https://gerrit.wikimedia.org/r/835705
[21:58:39] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-wf1001.eqiad.wmnet with reason: host reimage
[22:02:09] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-wf1002.eqiad.wmnet with reason: host reimage
[22:03:12] <wikibugs>	 (03CR) 10Ebernhardson: "I'm not sure if it would be appropriate to maintain both .gz and .bz2 files here (like wikidata dumps do).  Not opposed, but not sure if i" [puppet] - 10https://gerrit.wikimedia.org/r/835705 (owner: 10Ebernhardson)
[22:13:05] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-wf1001.eqiad.wmnet with OS bullseye
[22:13:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mc-wf1001.eqiad.wmnet with OS bullseye completed: - mc-wf1001 (**PASS**...
[22:16:56] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-wf1002.eqiad.wmnet with OS bullseye
[22:17:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mc-wf1002.eqiad.wmnet with OS bullseye completed: - mc-wf1002 (**PASS**...
[22:24:41] <icinga-wm>	 RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:18:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10Cmjohnson)
[23:19:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10Cmjohnson) 05Open→03Resolved @joe all yours, figured it to be the same partman recipe as memcache