[00:16:15] <wikibugs>	 10SRE, 10MW-on-K8s, 10MediaWiki-Platform-Team, 10MediaWiki-extensions-CentralAuth, and 4 others: MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki} - https://phabricator.wikimedia.org/T342201 (10Tgr) >>! In T34220...
[00:18:54] <icinga-wm>	 RECOVERY - Druid overlord on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[00:28:27] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1025.eqiad.wmnet
[00:28:36] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1025.eqiad.wmnet
[00:28:52] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1033.eqiad.wmnet
[00:28:58] <logmsgbot>	 !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1033.eqiad.wmnet
[00:29:31] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1033.eqiad.wmnet
[00:38:05] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1033.eqiad.wmnet
[00:38:28] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962229
[00:38:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962229 (owner: 10TrainBranchBot)
[00:46:32] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:53:46] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962229 (owner: 10TrainBranchBot)
[00:57:15] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[01:03:50] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T347919 (10phaultfinder)
[01:06:27] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1025.eqiad.wmnet
[01:06:32] <logmsgbot>	 !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1025.eqiad.wmnet
[01:15:13] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[01:15:34] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[01:18:15] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1025.eqiad.wmnet
[01:21:32] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1025.eqiad.wmnet
[01:22:58] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[01:23:09] <andrewbogott>	 ugh, sorry cwhite. unlocked now, finally
[01:29:20] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[01:30:32] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add radosgw apis to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/962707 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott)
[01:32:46] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[01:33:38] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1025.eqiad.wmnet
[01:33:39] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1025.eqiad.wmnet
[01:34:38] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1033.eqiad.wmnet
[01:35:16] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1033.eqiad.wmnet
[01:36:52] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[01:48:10] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1033.eqiad.wmnet
[01:48:11] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1033.eqiad.wmnet
[01:49:30] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:53:13] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[01:54:10] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[02:00:04] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T0200)
[02:07:32] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.29 [core] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/962230 (https://phabricator.wikimedia.org/T347080)
[02:07:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.29 [core] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/962230 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot)
[02:07:48] <wikibugs>	 (03PS5) 10Krinkle: noc: Fix various PHP errors that prevent db.php from working locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944355 (https://phabricator.wikimedia.org/T341859)
[02:07:51] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] noc: Fix various PHP errors that prevent db.php from working locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944355 (https://phabricator.wikimedia.org/T341859) (owner: 10Krinkle)
[02:07:55] <wikibugs>	 (03PS3) 10Krinkle: noc: Add new format=json to wiki.php, and JSON button to db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961869 (https://phabricator.wikimedia.org/T343398)
[02:07:57] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] noc: Add new format=json to wiki.php, and JSON button to db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961869 (https://phabricator.wikimedia.org/T343398) (owner: 10Krinkle)
[02:09:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] noc: Add new format=json to wiki.php, and JSON button to db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961869 (https://phabricator.wikimedia.org/T343398) (owner: 10Krinkle)
[02:09:12] <wikibugs>	 (03Merged) 10jenkins-bot: noc: Fix various PHP errors that prevent db.php from working locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944355 (https://phabricator.wikimedia.org/T341859) (owner: 10Krinkle)
[02:09:14] <wikibugs>	 (03Merged) 10jenkins-bot: noc: Add new format=json to wiki.php, and JSON button to db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961869 (https://phabricator.wikimedia.org/T343398) (owner: 10Krinkle)
[02:10:24] <wikibugs>	 (03PS5) 10Krinkle: Remove legacy wgRC2UDPPrefix overrides for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820245
[02:14:32] <wikibugs>	 (03PS1) 10Krinkle: noc: Fix undefined function str_starts_with on db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962742
[02:14:41] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] noc: Fix undefined function str_starts_with on db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962742 (owner: 10Krinkle)
[02:15:29] <wikibugs>	 (03Merged) 10jenkins-bot: noc: Fix undefined function str_starts_with on db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962742 (owner: 10Krinkle)
[02:17:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:17:51] <logmsgbot>	 !log krinkle@deploy2002 Synchronized docroot/noc/: (no justification provided) (duration: 08m 03s)
[02:22:02] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[02:22:34] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.29 [core] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/962230 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot)
[02:22:35] <jinxer-wm>	 (KubernetesAPILatency) resolved: (18) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:25:08] <icinga-wm>	 RECOVERY - Check systemd state on dumpsdata1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:28:07] <wikibugs>	 (03PS1) 10Andrew Bogott: radosgw: include a few missing pieces for eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/962743 (https://phabricator.wikimedia.org/T276961)
[02:30:25] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Profiler: Enable logging of caught Redis exceptions to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962725 (https://phabricator.wikimedia.org/T347916) (owner: 10Krinkle)
[02:30:27] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Remove legacy wgRC2UDPPrefix overrides for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820245 (owner: 10Krinkle)
[02:31:10] <wikibugs>	 (03Merged) 10jenkins-bot: Profiler: Enable logging of caught Redis exceptions to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962725 (https://phabricator.wikimedia.org/T347916) (owner: 10Krinkle)
[02:31:13] <wikibugs>	 (03Merged) 10jenkins-bot: Remove legacy wgRC2UDPPrefix overrides for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820245 (owner: 10Krinkle)
[02:33:15] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] radosgw: include a few missing pieces for eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/962743 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott)
[02:33:38] <logmsgbot>	 !log krinkle@deploy2002 Started scap: (no justification provided)
[02:34:06] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[02:38:47] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:41:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[02:41:12] <logmsgbot>	 !log krinkle@deploy2002 Finished scap: (no justification provided) (duration: 07m 34s)
[02:41:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:44:51] <wikibugs>	 (03PS1) 10Andrew Bogott: Remove profile::cloudceph::client::rbd_glance [puppet] - 10https://gerrit.wikimedia.org/r/962744 (https://phabricator.wikimedia.org/T276961)
[02:46:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (16) High Kubernetes API latency (GET blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:49:00] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Remove profile::cloudceph::client::rbd_glance [puppet] - 10https://gerrit.wikimedia.org/r/962744 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott)
[03:00:06] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T0300)
[03:01:20] <icinga-wm>	 RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:03:53] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:05:42] <icinga-wm>	 PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:11:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[03:21:38] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[03:22:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[03:28:12] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[03:40:14] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[03:41:12] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[03:46:10] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance
[03:46:34] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance
[03:46:41] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T343198)', diff saved to https://phabricator.wikimedia.org/P52802 and previous config saved to /var/cache/conftool/dbconfig/20231003-034640-arnaudb.json
[03:46:44] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[04:05:26] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[04:05:35] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[04:05:44] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[04:07:55] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[04:08:06] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[04:08:14] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[04:09:00] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[04:09:56] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[04:10:48] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[04:11:32] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[04:11:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[04:12:27] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[04:13:17] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[04:16:38] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[04:20:24] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[04:20:41] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[04:20:55] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[04:21:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[04:50:42] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: httpbb(liftwing): remove deprecated servers from tests [puppet] - 10https://gerrit.wikimedia.org/r/962752
[05:46:28] <kart_>	 Is it OK to deploy MinT in a few minutes?
[05:49:30] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:52:08] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on druid1009.eqiad.wmnet with reason: Downtime as we setup the host to join the druid and zookeper cluster
[05:52:22] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on druid1009.eqiad.wmnet with reason: Downtime as we setup the host to join the druid and zookeper cluster
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T0600)
[06:00:04] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T0600).
[06:17:42] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:18:58] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:28:58] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] "Thank you for cleaning these up!" [puppet] - 10https://gerrit.wikimedia.org/r/962752 (owner: 10Ilias Sarantopoulos)
[06:31:16] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:38:53] * kart_ will update MinT; seems OK to go ahead..
[06:40:21] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-09-28-043052-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/961977 (https://phabricator.wikimedia.org/T343450) (owner: 10KartikMistry)
[06:41:18] <wikibugs>	 (03Merged) 10jenkins-bot: Update MinT to 2023-09-28-043052-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/961977 (https://phabricator.wikimedia.org/T343450) (owner: 10KartikMistry)
[06:42:34] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply
[06:42:41] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply
[06:42:59] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[06:45:02] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:45:32] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[06:46:19] <wikibugs>	 (03PS1) 10Ayounsi: Add "Auto-Submitted" header to dbbackup scripts [puppet] - 10https://gerrit.wikimedia.org/r/962940 (https://phabricator.wikimedia.org/T347835)
[06:51:48] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply
[06:56:00] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply
[06:59:28] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and taavi: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T0700).
[07:00:05] <jouncebot>	 Superpes: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:31] <Superpes>	 Hi :)
[07:02:39] <wikibugs>	 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10Bengali-Sites, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10SCherukuwada) The important takeaway from this (as per our discussion) was this bit:  //Google doesn't guarantee that it will cr...
[07:03:18] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply
[07:03:45] <kart_>	 !log Updated MinT to 2023-09-28-043052-production (T343450, T341478)
[07:03:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:03:59] <stashbot>	 T343450: Enable MinT for closely-related languages based on community input - https://phabricator.wikimedia.org/T343450
[07:03:59] <stashbot>	 T341478: Port the markup transfer feature of cxserver to MinT - https://phabricator.wikimedia.org/T341478
[07:04:10] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:11:09] <Superpes>	 Need a prioritary deployment for the throttle exemption patch (Editathon just started) - if someone can deploy please lt me know :) https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/962729
[07:12:00] <wikibugs>	 (03PS6) 10Superpes15: [fiwiki] Add an editautoreviewprotected level protection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960201 (https://phabricator.wikimedia.org/T347069)
[07:13:58] <taavi>	 Superpes: hey, looking
[07:14:31] <Superpes>	 Hi taavi thanks ;)
[07:15:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962729 (https://phabricator.wikimedia.org/T347874) (owner: 10Superpes15)
[07:15:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960201 (https://phabricator.wikimedia.org/T347069) (owner: 10Superpes15)
[07:16:42] <wikibugs>	 (03Merged) 10jenkins-bot: [enwiki] Throttle exemption for Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962729 (https://phabricator.wikimedia.org/T347874) (owner: 10Superpes15)
[07:16:45] <wikibugs>	 (03Merged) 10jenkins-bot: [fiwiki] Add an editautoreviewprotected level protection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960201 (https://phabricator.wikimedia.org/T347069) (owner: 10Superpes15)
[07:17:46] <taavi>	 uhhh
[07:21:09] <taavi>	 scap backport is having an issue, so I'm doing this the old-fashioned way
[07:21:33] <Superpes>	 Uhhhhhhhhh :O
[07:22:11] <Superpes>	 Is something down? I see also "Error while fetching results for wm-zuul-status: TypeError: Failed to fetch" :/
[07:23:51] <taavi>	 that's probably unrelated..
[07:24:53] <Superpes>	 Lol
[07:27:36] <logmsgbot>	 !log taavi@deploy2002 Started scap: T347874 and T347069
[07:27:41] <stashbot>	 T347069: Add new autoreview page protection level for Finnish Wikipedia - https://phabricator.wikimedia.org/T347069
[07:27:41] <stashbot>	 T347874: Mass account creation request throttling exception (3rd October 2023) - https://phabricator.wikimedia.org/T347874
[07:30:18] <taavi>	 ^ I'm fairly sure this will be taking a while
[07:31:33] <Superpes>	 Np :P I'm only doing an internship in an hospital but there are few people now lol
[07:39:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: remove thanos components from thanos::frontend role [puppet] - 10https://gerrit.wikimedia.org/r/956906 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi)
[07:40:01] <logmsgbot>	 !log taavi@deploy2002 taavi: T347874 and T347069 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:40:07] <stashbot>	 T347069: Add new autoreview page protection level for Finnish Wikipedia - https://phabricator.wikimedia.org/T347069
[07:40:08] <stashbot>	 T347874: Mass account creation request throttling exception (3rd October 2023) - https://phabricator.wikimedia.org/T347874
[07:40:15] <taavi>	 Superpes: please test
[07:41:56] <Superpes>	 Looks fine taavi
[07:42:19] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe1001.eqiad.wmnet with OS bullseye
[07:42:27] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was s...
[07:42:39] <logmsgbot>	 !log taavi@deploy2002 taavi: Continuing with sync
[07:48:07] <wikibugs>	 (03PS1) 10Majavah: hieradata: cloud: provision wmf-ca-certificates.crt [puppet] - 10https://gerrit.wikimedia.org/r/962992
[07:49:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:50:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:51:31] <wikibugs>	 10Puppet, 10Cloud-VPS, 10cloud-services-team: Cloud VPS PuppetDB Postgres instances failing with "could not load root certificate file "/etc/ssl/certs/wmf-ca-certificates.crt": No such file or directory" - https://phabricator.wikimedia.org/T347934 (10taavi)
[07:52:12] <wikibugs>	 10Puppet, 10Cloud-VPS, 10cloud-services-team: Cloud VPS PuppetDB Postgres instances failing with "could not load root certificate file "/etc/ssl/certs/wmf-ca-certificates.crt": No such file or directory" - https://phabricator.wikimedia.org/T347934 (10taavi) Seems to be caused by https://gerrit.wikimedia.org/...
[07:52:25] <wikibugs>	 (03PS2) 10Majavah: hieradata: cloud: provision wmf-ca-certificates.crt [puppet] - 10https://gerrit.wikimedia.org/r/962992 (https://phabricator.wikimedia.org/T347934)
[07:53:26] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43821/console" [puppet] - 10https://gerrit.wikimedia.org/r/962992 (https://phabricator.wikimedia.org/T347934) (owner: 10Majavah)
[07:54:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:55:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:55:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] httpbb(liftwing): remove deprecated servers from tests [puppet] - 10https://gerrit.wikimedia.org/r/962752 (owner: 10Ilias Sarantopoulos)
[07:56:58] <logmsgbot>	 !log taavi@deploy2002 Finished scap: T347874 and T347069 (duration: 29m 22s)
[07:57:03] <stashbot>	 T347069: Add new autoreview page protection level for Finnish Wikipedia - https://phabricator.wikimedia.org/T347069
[07:57:03] <stashbot>	 T347874: Mass account creation request throttling exception (3rd October 2023) - https://phabricator.wikimedia.org/T347874
[07:57:10] <taavi>	 Superpes: and it's finally live
[07:57:34] <Superpes>	 Oh wow taavi :O Thanks :3
[07:57:48] <jinxer-wm>	 (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:59:51] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe1001.eqiad.wmnet with reason: host reimage
[08:00:08] <Superpes>	 Just for confirmation... Have you run resetAuthenticationThrottle.php? taavi
[08:00:24] <taavi>	 oh, right, I totally forgot
[08:00:30] <taavi>	 looking
[08:00:33] <jinxer-wm>	 (ProbeDown) resolved: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:00:38] <Superpes>	 Lol https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold
[08:01:41] <taavi>	 !log taavi@mwmaint2002 ~ $ mwscript resetAuthenticationThrottle.php --wiki=enwiki --signup --ip=155.232.7.202 # T347874
[08:01:44] <taavi>	 now I have
[08:01:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:57] <Superpes>	 Thanks again for your time :3
[08:03:04] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe1001.eqiad.wmnet with reason: host reimage
[08:03:59] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:04:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[08:07:25] <icinga-wm>	 PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: fetch-rings-thanos.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:08:51] <godog>	 expected ^ thanos-fe reimage in progress
[08:08:57] <wikibugs>	 (03PS1) 10Majavah: dynamicproxy: delete backends when deleting route [puppet] - 10https://gerrit.wikimedia.org/r/962998
[08:09:42] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[08:09:43] <wikibugs>	 (03PS2) 10Majavah: dynamicproxy: delete backends when deleting route [puppet] - 10https://gerrit.wikimedia.org/r/962998 (https://phabricator.wikimedia.org/T347883)
[08:09:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[08:11:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[08:12:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[08:12:35] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[08:13:02] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[08:14:23] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:15:35] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:15:39] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[08:17:00] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:17:38] <wikibugs>	 10SRE, 10Observability-Metrics: Grafana dashboard doesn't load anything - https://phabricator.wikimedia.org/T347936 (10Fabfur)
[08:17:50] <wikibugs>	 (03PS3) 10Majavah: dynamicproxy: delete backends when deleting route [puppet] - 10https://gerrit.wikimedia.org/r/962998 (https://phabricator.wikimedia.org/T347883)
[08:18:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dynamicproxy: delete backends when deleting route [puppet] - 10https://gerrit.wikimedia.org/r/962998 (https://phabricator.wikimedia.org/T347883) (owner: 10Majavah)
[08:19:26] <wikibugs>	 (03PS1) 10Cathal Mooney: Move 185.15.57.8/29 to netbox-controlled DNS records [dns] - 10https://gerrit.wikimedia.org/r/963000 (https://phabricator.wikimedia.org/T347687)
[08:19:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960125 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[08:19:44] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, optional alternative solution inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond)
[08:19:53] <wikibugs>	 (03PS4) 10Majavah: dynamicproxy: delete backends when deleting route [puppet] - 10https://gerrit.wikimedia.org/r/962998 (https://phabricator.wikimedia.org/T347883)
[08:20:21] <wikibugs>	 (03PS6) 10Ilias Sarantopoulos: ml-alerts: add alert for increased ORESFetchScoreJob [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151)
[08:20:36] <wikibugs>	 (03PS5) 10Majavah: dynamicproxy: delete backends when deleting route [puppet] - 10https://gerrit.wikimedia.org/r/962998 (https://phabricator.wikimedia.org/T347883)
[08:24:11] <wikibugs>	 (03PS1) 10Fabfur: purged: switch to unix socket for varnish requests [puppet] - 10https://gerrit.wikimedia.org/r/963002 (https://phabricator.wikimedia.org/T347837)
[08:24:39] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Move 185.15.57.8/29 to netbox-controlled DNS records [dns] - 10https://gerrit.wikimedia.org/r/963000 (https://phabricator.wikimedia.org/T347687) (owner: 10Cathal Mooney)
[08:25:09] <icinga-wm>	 PROBLEM - Check systemd state on puppetserver1001 is CRITICAL: CRITICAL - degraded: The following units failed: fetch-rings-thanos.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:26:25] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe1001.eqiad.wmnet with OS bullseye
[08:26:30] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin100...
[08:27:40] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ml-staging2001.codfw.wmnet with reason: Check chassis internals for GPU hosting
[08:27:54] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ml-staging2001.codfw.wmnet with reason: Check chassis internals for GPU hosting
[08:30:23] <wikibugs>	 (03Abandoned) 10Fabfur: purged: switch to unix socket for varnish requests [puppet] - 10https://gerrit.wikimedia.org/r/963002 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[08:30:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "The PCC output seems to suggest this will break in deployment-prep, I'm not sure why though." [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[08:30:38] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: Add fon to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/962231 (https://phabricator.wikimedia.org/T347935)
[08:32:57] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:33:59] <wikibugs>	 (03PS1) 10Fabfur: purged: use unix socket for varnish in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/963004 (https://phabricator.wikimedia.org/T347837)
[08:34:45] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Fabfur)
[08:35:16] <wikibugs>	 10SRE, 10Observability-Metrics: Grafana dashboard doesn't load anything: "WebSocket connection failed" - https://phabricator.wikimedia.org/T347936 (10Aklapper)
[08:38:52] <wikibugs>	 (03PS2) 10Elukey: Remove ores.svc.{eqiad,codfw}.wmnet records [dns] - 10https://gerrit.wikimedia.org/r/962642 (https://phabricator.wikimedia.org/T347278)
[08:40:51] <wikibugs>	 (03CR) 10David Caro: "LGTM, is there an easy way to test this?" [puppet] - 10https://gerrit.wikimedia.org/r/962998 (https://phabricator.wikimedia.org/T347883) (owner: 10Majavah)
[08:40:59] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] dynamicproxy: delete backends when deleting route [puppet] - 10https://gerrit.wikimedia.org/r/962998 (https://phabricator.wikimedia.org/T347883) (owner: 10Majavah)
[08:41:09] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Remove ores.svc.{eqiad,codfw}.wmnet records [dns] - 10https://gerrit.wikimedia.org/r/962642 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[08:44:04] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] dynamicproxy: delete backends when deleting route (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/962998 (https://phabricator.wikimedia.org/T347883) (owner: 10Majavah)
[08:45:29] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:48:38] <wikibugs>	 (03PS2) 10Ladsgroup: Add fon to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/962231 (https://phabricator.wikimedia.org/T347935) (owner: 10Gerrit maintenance bot)
[08:48:43] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Add fon to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/962231 (https://phabricator.wikimedia.org/T347935) (owner: 10Gerrit maintenance bot)
[08:50:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[08:54:44] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/962992 (https://phabricator.wikimedia.org/T347934) (owner: 10Majavah)
[08:55:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[08:55:56] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: return 400 when requesting callbacks on ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/963005 (https://phabricator.wikimedia.org/T347663)
[08:58:17] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml-services: return 400 when requesting callbacks on ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/963005 (https://phabricator.wikimedia.org/T347663) (owner: 10Ilias Sarantopoulos)
[08:58:19] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10dcaro)
[09:00:10] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: return 400 when requesting callbacks on ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/963005 (https://phabricator.wikimedia.org/T347663) (owner: 10Ilias Sarantopoulos)
[09:01:00] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: return 400 when requesting callbacks on ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/963005 (https://phabricator.wikimedia.org/T347663) (owner: 10Ilias Sarantopoulos)
[09:05:45] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' .
[09:06:13] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[09:06:39] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[09:07:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/962992 (https://phabricator.wikimedia.org/T347934) (owner: 10Majavah)
[09:07:18] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe1002.eqiad.wmnet with OS bullseye
[09:07:24] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumi...
[09:09:13] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] hieradata: cloud: provision wmf-ca-certificates.crt [puppet] - 10https://gerrit.wikimedia.org/r/962992 (https://phabricator.wikimedia.org/T347934) (owner: 10Majavah)
[09:09:28] <wikibugs>	 (03CR) 10Klausman: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/963007 (owner: 10Klausman)
[09:09:40] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43826/console" [puppet] - 10https://gerrit.wikimedia.org/r/963004 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[09:09:59] <wikibugs>	 (03PS2) 10Klausman: VIPs: add DNS entries for new recommendation-api-ng service [dns] - 10https://gerrit.wikimedia.org/r/963007
[09:13:41] <wikibugs>	 (03CR) 10Elukey: VIPs: add DNS entries for new recommendation-api-ng service (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/963007 (owner: 10Klausman)
[09:14:10] <wikibugs>	 10Puppet, 10Cloud-VPS, 10cloud-services-team: Cloud VPS PuppetDB Postgres instances failing with "could not load root certificate file "/etc/ssl/certs/wmf-ca-certificates.crt": No such file or directory" - https://phabricator.wikimedia.org/T347934 (10taavi) 05Open→03Resolved a:03taavi
[09:14:43] <jbond>	 thanks taavi
[09:15:46] <Amir1_>	 jouncebot: nowandnext
[09:15:46] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 44 minute(s)
[09:15:46] <jouncebot>	 In 0 hour(s) and 44 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T1000)
[09:15:49] <Amir1_>	 cool
[09:16:04] <wikibugs>	 (03PS3) 10Klausman: VIPs: add DNS entries for new recommendation-api-ng service [dns] - 10https://gerrit.wikimedia.org/r/963007
[09:16:21] <wikibugs>	 (03CR) 10Klausman: VIPs: add DNS entries for new recommendation-api-ng service (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/963007 (owner: 10Klausman)
[09:16:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Ditto as the other patch, as-is will conflict but tested and LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[09:16:58] <wikibugs>	 (03PS7) 10Ilias Sarantopoulos: ml-alerts: add alert for increased ORESFetchScoreJob [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151)
[09:18:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ml-alerts: add alert for increased ORESFetchScoreJob [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[09:19:20] <wikibugs>	 (03PS1) 10Ladsgroup: Init config of Fon Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963010 (https://phabricator.wikimedia.org/T347935)
[09:20:56] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Init config of Fon Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963010 (https://phabricator.wikimedia.org/T347935) (owner: 10Ladsgroup)
[09:21:43] <wikibugs>	 (03Merged) 10jenkins-bot: Init config of Fon Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963010 (https://phabricator.wikimedia.org/T347935) (owner: 10Ladsgroup)
[09:22:56] <wikibugs>	 (03CR) 10Klausman: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/963009 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman)
[09:23:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::pdu_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961865 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[09:23:23] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] conftool-data: Add entry for recommendation-api-ng [puppet] - 10https://gerrit.wikimedia.org/r/963009 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman)
[09:23:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Will conflict but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/961863 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[09:23:56] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] conftool-data: Add entry for recommendation-api-ng [puppet] - 10https://gerrit.wikimedia.org/r/963009 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman)
[09:24:01] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "trivial enough" [deployment-charts] - 10https://gerrit.wikimedia.org/r/962694 (https://phabricator.wikimedia.org/T347515) (owner: 10DCausse)
[09:24:37] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe1002.eqiad.wmnet with reason: host reimage
[09:26:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Will conflict but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/961857 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[09:26:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[09:26:44] <claime>	 !log Draining kubernetes2010.codfw.wmnet for reboot to change BIOS setting
[09:26:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[09:27:07] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe1002.eqiad.wmnet with reason: host reimage
[09:27:36] <wikibugs>	 (03PS1) 10Ladsgroup: Add fonwiki to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963012 (https://phabricator.wikimedia.org/T347935)
[09:27:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[09:27:51] <wikibugs>	 (03CR) 10Klausman: ml-alerts: add alert for increased ORESFetchScoreJob (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[09:28:06] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Add fonwiki to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963012 (https://phabricator.wikimedia.org/T347935) (owner: 10Ladsgroup)
[09:28:40] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kubernetes2010.codfw.wmnet with reason: BIOS setting change
[09:28:54] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kubernetes2010.codfw.wmnet with reason: BIOS setting change
[09:29:09] <wikibugs>	 (03CR) 10Klausman: ml-alerts: add alert for increased ORESFetchScoreJob (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[09:29:37] <wikibugs>	 (03CR) 10Elukey: ml-alerts: add alert for increased ORESFetchScoreJob (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[09:29:46] <wikibugs>	 (03Merged) 10jenkins-bot: Add fonwiki to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963012 (https://phabricator.wikimedia.org/T347935) (owner: 10Ladsgroup)
[09:30:28] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap: Creating fonwiki (T347935)
[09:30:33] <stashbot>	 T347935: Create Wikipedia Fon - https://phabricator.wikimedia.org/T347935
[09:31:13] <icinga-wm>	 RECOVERY - Host mr1-esams.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 84.46 ms
[09:32:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[09:34:23] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10Ladsgroup) a:03Ladsgroup I think the name must be created under `glam-eu` instead. See https://meta.wikimedia.org/wiki/Mailing_lists/Standardization and on top glam-us and glam...
[09:35:07] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43827/console" [puppet] - 10https://gerrit.wikimedia.org/r/963004 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[09:35:51] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:36:03] <claime>	 That's me ^
[09:36:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[09:37:05] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:37:19] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:37:57] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:38:03] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap: Creating fonwiki (T347935) (duration: 07m 34s)
[09:38:06] <stashbot>	 T347935: Create Wikipedia Fon - https://phabricator.wikimedia.org/T347935
[09:38:17] <wikibugs>	 (03PS8) 10Ilias Sarantopoulos: ml-alerts: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151)
[09:38:31] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:38:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:39:23] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: bump image version to flink-1.16.1-rdf-0.3.133 [deployment-charts] - 10https://gerrit.wikimedia.org/r/962694 (https://phabricator.wikimedia.org/T347515) (owner: 10DCausse)
[09:39:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ml-alerts: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[09:40:08] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater: bump image version to flink-1.16.1-rdf-0.3.133 [deployment-charts] - 10https://gerrit.wikimedia.org/r/962694 (https://phabricator.wikimedia.org/T347515) (owner: 10DCausse)
[09:40:18] <wikibugs>	 10SRE, 10Observability-Metrics: Grafana dashboard doesn't load anything: "WebSocket connection failed" - https://phabricator.wikimedia.org/T347936 (10fgiunchedi) I checked the dashboard version history and I believe this was caused by the prometheus `global` deprecation from a while back. The easiest fix is to...
[09:41:25] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:42:19] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 173, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:42:23] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[09:42:45] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:42:53] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[09:42:59] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.284 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:43:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:43:59] <wikibugs>	 (03CR) 10Jbond: puppet: Add new PuppetServer class (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond)
[09:44:04] <wikibugs>	 (03PS10) 10Jbond: puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739
[09:45:04] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe1002.eqiad.wmnet with OS bullseye
[09:45:09] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin100...
[09:46:39] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10Clement_Goubert) Hi @Jhancock.wm just a heads up, I rebooted kubernetes2010 to change the CPU power management BIOS setting that was set to BIOS control instead of OS control, which meant we could...
[09:49:24] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubernetes2010.codfw.wmnet
[09:49:24] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes2010.codfw.wmnet
[09:49:30] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:49:55] <claime>	 !log Uncordoned kubernetes2010.codfw.wmnet
[09:50:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:10] <wikibugs>	 10SRE, 10Observability-Metrics: Grafana dashboard doesn't load anything: "WebSocket connection failed" - https://phabricator.wikimedia.org/T347936 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I've fixed the datasource and queries, so the dashboard now loads data again! The labels/legends might need so...
[09:54:47] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43829/console" [puppet] - 10https://gerrit.wikimedia.org/r/963004 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[09:56:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond)
[09:56:27] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10Ladsgroup) If that name is okay with you, let me know and I create the mailing list.
[09:59:01] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43830/console" [puppet] - 10https://gerrit.wikimedia.org/r/963004 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T1000)
[10:00:10] <wikibugs>	 (03Merged) 10jenkins-bot: puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond)
[10:01:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[10:01:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[10:01:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960125 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[10:01:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/960126 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[10:04:31] <wikibugs>	 (03PS14) 10Jbond: prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373)
[10:06:00] <wikibugs>	 (03PS8) 10Jbond: prometheus::class_config: convert to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373)
[10:06:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus::class_config: convert to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[10:09:18] <wikibugs>	 (03PS10) 10Jbond: prometheus::jmx_exporter_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961863 (https://phabricator.wikimedia.org/T341373)
[10:09:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus::jmx_exporter_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961863 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[10:10:54] <wikibugs>	 (03PS15) 10Jbond: prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373)
[10:11:55] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43833/console" [puppet] - 10https://gerrit.wikimedia.org/r/963004 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[10:13:36] <wikibugs>	 (03PS3) 10Majavah: P:cloudceph: cleanup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/961783
[10:14:56] <wikibugs>	 10SRE, 10Observability-Metrics: Grafana dashboard doesn't load anything: "WebSocket connection failed" - https://phabricator.wikimedia.org/T347936 (10Fabfur) Thank you very much!
[10:15:21] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe1003.eqiad.wmnet with OS bullseye
[10:15:26] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumi...
[10:19:30] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.dns.netbox
[10:20:34] <wikibugs>	 (03PS11) 10Jbond: prometheus::jmx_exporter_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961863 (https://phabricator.wikimedia.org/T341373)
[10:22:32] <wikibugs>	 (03PS1) 10Jbond: Revert "get_clusters: remove legacy functions" [puppet] - 10https://gerrit.wikimedia.org/r/962221
[10:22:36] <wikibugs>	 (03PS1) 10Jbond: Revert "prometheus: switch to wmflib::get_clusters" [puppet] - 10https://gerrit.wikimedia.org/r/962222
[10:22:39] <wikibugs>	 (03PS1) 10Jbond: Revert "P:prometheus::ops: convert to using wmflib::get_clusters" [puppet] - 10https://gerrit.wikimedia.org/r/962223
[10:22:42] <wikibugs>	 (03PS1) 10Jbond: Revert "wmflib::get_clusters: create a puppet version of get_clu..." [puppet] - 10https://gerrit.wikimedia.org/r/962224
[10:23:35] <wikibugs>	 (03PS2) 10Stevemunene: druid: Bring druid1010.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/962248 (https://phabricator.wikimedia.org/T336042)
[10:23:37] <wikibugs>	 (03PS2) 10Stevemunene: druid: Bring druid1011.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/962249 (https://phabricator.wikimedia.org/T336042)
[10:23:41] <wikibugs>	 (03PS2) 10Stevemunene: druid: Add druid druid10[09-11] to druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/962250 (https://phabricator.wikimedia.org/T336042)
[10:26:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "get_clusters: remove legacy functions" [puppet] - 10https://gerrit.wikimedia.org/r/962221 (owner: 10Jbond)
[10:30:57] <logmsgbot>	 !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[10:32:23] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.dns.netbox
[10:32:36] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: new_rings.tar.bz2 not found after host reimage - https://phabricator.wikimedia.org/T347964 (10fgiunchedi)
[10:32:43] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe1003.eqiad.wmnet with reason: host reimage
[10:34:51] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Fix katran-test.svc.eqiad.wmnet IP allocation - vgutierrez@cumin1001"
[10:35:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[10:35:59] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe1003.eqiad.wmnet with reason: host reimage
[10:36:03] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Fix katran-test.svc.eqiad.wmnet IP allocation - vgutierrez@cumin1001"
[10:36:03] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:36:21] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: install toolforge-builds-cli [puppet] - 10https://gerrit.wikimedia.org/r/963018 (https://phabricator.wikimedia.org/T334588)
[10:36:52] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "get_clusters: remove legacy functions" [puppet] - 10https://gerrit.wikimedia.org/r/962221 (owner: 10Jbond)
[10:36:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert "prometheus: switch to wmflib::get_clusters" [puppet] - 10https://gerrit.wikimedia.org/r/962222 (owner: 10Jbond)
[10:37:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert "P:prometheus::ops: convert to using wmflib::get_clusters" [puppet] - 10https://gerrit.wikimedia.org/r/962223 (owner: 10Jbond)
[10:37:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert "wmflib::get_clusters: create a puppet version of get_clu..." [puppet] - 10https://gerrit.wikimedia.org/r/962224 (owner: 10Jbond)
[10:37:59] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:38:40] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] P:toolforge: install toolforge-builds-cli [puppet] - 10https://gerrit.wikimedia.org/r/963018 (https://phabricator.wikimedia.org/T334588) (owner: 10Majavah)
[10:38:59] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.254 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:40:07] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] P:toolforge: install toolforge-builds-cli [puppet] - 10https://gerrit.wikimedia.org/r/963018 (https://phabricator.wikimedia.org/T334588) (owner: 10Majavah)
[10:40:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[10:40:50] <wikibugs>	 (03PS2) 10Fabfur: purged: use unix socket for varnish in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/963004 (https://phabricator.wikimedia.org/T347837)
[10:44:03] <wikibugs>	 (03PS1) 10Jbond: Revert "Revert "wmflib::get_clusters: create a puppet version of..." [puppet] - 10https://gerrit.wikimedia.org/r/962225
[10:44:09] <wikibugs>	 (03PS1) 10Jbond: Revert "Revert "P:prometheus::ops: convert to using wmflib::get_..." [puppet] - 10https://gerrit.wikimedia.org/r/963026
[10:44:14] <wikibugs>	 (03PS1) 10Jbond: Revert^2 "prometheus: switch to wmflib::get_clusters" [puppet] - 10https://gerrit.wikimedia.org/r/963027
[10:44:17] <wikibugs>	 (03PS1) 10Jbond: Revert^2 "get_clusters: remove legacy functions" [puppet] - 10https://gerrit.wikimedia.org/r/963028
[10:44:53] <wikibugs>	 (03PS2) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373)
[10:45:13] <wikibugs>	 (03PS2) 10Jbond: prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373)
[10:45:44] <wikibugs>	 (03PS2) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373)
[10:46:07] <wikibugs>	 (03PS2) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373)
[10:47:15] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43837/console" [puppet] - 10https://gerrit.wikimedia.org/r/953200 (owner: 10Clément Goubert)
[10:51:23] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43839/console" [puppet] - 10https://gerrit.wikimedia.org/r/963004 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[10:54:00] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe1003.eqiad.wmnet with OS bullseye
[10:54:05] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin100...
[10:55:14] <wikibugs>	 (03PS4) 10Clément Goubert: P:mw::deployment::server: Don't alert for train-presync [puppet] - 10https://gerrit.wikimedia.org/r/953200
[10:57:39] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43840/console" [puppet] - 10https://gerrit.wikimedia.org/r/953200 (owner: 10Clément Goubert)
[11:00:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:04:29] <icinga-wm>	 RECOVERY - Check systemd state on puppetserver1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:05:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:06:01] <icinga-wm>	 RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:08:28] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: new_rings.tar.bz2 not found after host reimage - https://phabricator.wikimedia.org/T347964 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Followup from IRC, this is expected when reimaging the ring...
[11:08:32] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10fgiunchedi)
[11:08:47] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:09:56] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] P:mw::deployment::server: Don't alert for train-presync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953200 (owner: 10Clément Goubert)
[11:11:43] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye
[11:11:48] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumi...
[11:12:31] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:13:49] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:22:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST certificates) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:24:28] <wikibugs>	 (03PS3) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373)
[11:24:30] <wikibugs>	 (03PS3) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373)
[11:24:32] <wikibugs>	 (03PS3) 10Jbond: prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373)
[11:24:34] <wikibugs>	 (03PS3) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373)
[11:25:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[11:26:42] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.netbox
[11:27:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST certificates) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:29:02] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: openstack - aborrero@cumin1001"
[11:29:50] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe1004.eqiad.wmnet with reason: host reimage
[11:29:52] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: openstack - aborrero@cumin1001"
[11:29:52] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:30:53] <wikibugs>	 (03PS4) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373)
[11:30:55] <wikibugs>	 (03PS4) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373)
[11:30:57] <wikibugs>	 (03PS4) 10Jbond: prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373)
[11:30:59] <wikibugs>	 (03PS4) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373)
[11:31:03] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[11:31:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[11:31:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[11:32:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[11:32:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[11:33:05] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe1004.eqiad.wmnet with reason: host reimage
[11:35:20] <Amir1_>	 jouncebot: nowandnext
[11:35:21] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 24 minute(s)
[11:35:21] <jouncebot>	 In 0 hour(s) and 24 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T1200)
[11:37:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[11:38:15] <wikibugs>	 (03PS5) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373)
[11:39:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[11:44:30] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:44:54] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] purged: use unix socket for varnish in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/963004 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[11:45:16] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10dcaro)
[11:48:28] <wikibugs>	 (03PS2) 10KartikMistry: Update cxserver to 2023-09-28-043003-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/961979 (https://phabricator.wikimedia.org/T343450)
[11:51:12] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe1004.eqiad.wmnet with OS bullseye
[11:51:17] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin100...
[11:51:30] <wikibugs>	 (03PS6) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373)
[11:52:16] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1 C: 03+2] purged: use unix socket for varnish in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/963004 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[11:54:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[11:54:21] <wikibugs>	 (03PS1) 10FNegri: Revert "Revert "[openstack] upgrade codfw1dev to Antelope (2023.1)"" [puppet] - 10https://gerrit.wikimedia.org/r/963029
[11:54:34] <fabfur>	 !log merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/963004 (T347837). `purged` daemon will be restarted by puppet in ulsfo in the next 30m
[11:54:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:37] <stashbot>	 T347837: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837
[11:54:42] <wikibugs>	 (03PS2) 10FNegri: Revert "Revert "[openstack] upgrade codfw1dev to Antelope (2023.1)"" [puppet] - 10https://gerrit.wikimedia.org/r/963029
[11:54:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "Revert "[openstack] upgrade codfw1dev to Antelope (2023.1)"" [puppet] - 10https://gerrit.wikimedia.org/r/963029 (owner: 10FNegri)
[11:55:06] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Fabfur)
[11:55:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "Revert "[openstack] upgrade codfw1dev to Antelope (2023.1)"" [puppet] - 10https://gerrit.wikimedia.org/r/963029 (owner: 10FNegri)
[11:56:51] <wikibugs>	 (03PS3) 10FNegri: Revert "Revert "[openstack] upgrade codfw1dev to Antelope (2023.1)"" [puppet] - 10https://gerrit.wikimedia.org/r/963029
[11:57:46] <wikibugs>	 (03PS4) 10FNegri: Revert "Revert "[openstack] upgrade codfw1dev to Antelope (2023.1)"" [puppet] - 10https://gerrit.wikimedia.org/r/963029 (https://phabricator.wikimedia.org/T341285)
[12:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T1200)
[12:00:56] <wikibugs>	 (03PS2) 10FNegri: Revert "bacula: Add cloudservices2004-dev (openldap) to the monitoring ignoring" [puppet] - 10https://gerrit.wikimedia.org/r/962207 (https://phabricator.wikimedia.org/T339894)
[12:03:12] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Cabling for Eqiad racke E5-7 and F5-7 - https://phabricator.wikimedia.org/T334231 (10ayounsi)
[12:03:40] <wikibugs>	 (03PS1) 10Fabfur: purged: use unix socket for varnish in codfw [puppet] - 10https://gerrit.wikimedia.org/r/963020 (https://phabricator.wikimedia.org/T347837)
[12:05:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) Asked Juniper about their timeline on getting this setup.
[12:06:26] <wikibugs>	 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10netops, and 2 others: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10JAllemandou)
[12:06:48] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43845/console" [puppet] - 10https://gerrit.wikimedia.org/r/963020 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[12:16:18] <wikibugs>	 (03PS7) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373)
[12:16:34] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting Analytics access for Surbhi Gupta - https://phabricator.wikimedia.org/T335657 (10JAllemandou) Hi SRE folks, We'd need @SGupta-WMF to be a member of the analytics-admin group so that she can handle ops-week tasks such as deployment and other restarts. Many thanks
[12:17:20] <wikibugs>	 (03PS1) 10Btullis: Add Surbhi Gupta to the analytics-admins group [puppet] - 10https://gerrit.wikimedia.org/r/963022 (https://phabricator.wikimedia.org/T335657)
[12:19:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[12:19:40] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Analytics access for Surbhi Gupta - https://phabricator.wikimedia.org/T335657 (10BTullis) Thanks @JAllemandou - I'm currently listed as an approver for this group, and I'm happy to approve the request :-)
[12:23:16] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "Thanks Ben :)" [puppet] - 10https://gerrit.wikimedia.org/r/963022 (https://phabricator.wikimedia.org/T335657) (owner: 10Btullis)
[12:23:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[12:23:59] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe2001.codfw.wmnet with OS bullseye
[12:24:05] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumi...
[12:24:30] <wikibugs>	 (03PS2) 10Sharvaniharan: New donor experience stream for apps event schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960124
[12:26:03] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Add Surbhi Gupta to the analytics-admins group [puppet] - 10https://gerrit.wikimedia.org/r/963022 (https://phabricator.wikimedia.org/T335657) (owner: 10Btullis)
[12:28:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[12:34:46] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Analytics access for Surbhi Gupta - https://phabricator.wikimedia.org/T335657 (10BTullis) I have also added `sg912` to the LDAP group `wmf` as set out in T335657#9186606
[12:34:52] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add Surbhi Gupta to the analytics-admins group [puppet] - 10https://gerrit.wikimedia.org/r/963022 (https://phabricator.wikimedia.org/T335657) (owner: 10Btullis)
[12:35:20] <wikibugs>	 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netbox, 10netops: Avoid ghost hosts on the network - https://phabricator.wikimedia.org/T306007 (10ayounsi) I think here the only/best option is to reduce the time delta between when a server is connected and when switch port is configured (line `Run the sr...
[12:36:59] <wikibugs>	 (03PS5) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373)
[12:37:01] <wikibugs>	 (03PS5) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373)
[12:37:03] <wikibugs>	 (03PS5) 10Jbond: prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373)
[12:37:05] <wikibugs>	 (03PS8) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373)
[12:37:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[12:38:40] <wikibugs>	 (03PS2) 10Anzx: arwiki: add importsources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962127 (https://phabricator.wikimedia.org/T347563)
[12:39:15] <TheresNoTime>	 Dreamy_Jazz: FYI, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/962623/ (scheduled for next window) has already been deployed, that expected?
[12:40:00] <Dreamy_Jazz>	 Yeah. I was able to get it deployed before that window but didn't have a chance to remove it yet
[12:40:17] <wikibugs>	 (03PS3) 10Anzx: arwiki: add importsources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962127 (https://phabricator.wikimedia.org/T347563)
[12:40:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[12:40:53] <TheresNoTime>	 ack :-)
[12:41:42] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T343198)', diff saved to https://phabricator.wikimedia.org/P52803 and previous config saved to /var/cache/conftool/dbconfig/20231003-124141-arnaudb.json
[12:41:45] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[12:42:59] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe2001.codfw.wmnet with reason: host reimage
[12:45:10] <wikibugs>	 (03PS6) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373)
[12:45:12] <wikibugs>	 (03PS6) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373)
[12:45:14] <wikibugs>	 (03PS6) 10Jbond: prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373)
[12:45:16] <wikibugs>	 (03PS9) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373)
[12:45:26] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe2001.codfw.wmnet with reason: host reimage
[12:45:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[12:50:16] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1025.eqiad.wmnet with OS bullseye
[12:50:26] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1025.eqiad.wmnet with OS bullseye
[12:50:40] <wikibugs>	 (03PS1) 10Filippo Giunchedi: services: fix xenon/arclamp redis egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/963024 (https://phabricator.wikimedia.org/T347926)
[12:53:34] <wikibugs>	 (03PS1) 10Anzx: add throttle rules for Ada Lovelace Day October 10, 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963025 (https://phabricator.wikimedia.org/T347719)
[12:56:44] <wikibugs>	 (03PS2) 10Anzx: add throttle rules for Ada Lovelace Day October 10, 2023 and fix throttle rule for UIUC Wikipedia edit-a-thon October 13, 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963025 (https://phabricator.wikimedia.org/T347719)
[12:56:48] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P52804 and previous config saved to /var/cache/conftool/dbconfig/20231003-125647-arnaudb.json
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T1300)
[13:00:06] <jouncebot>	 sharvani__ and aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:14] * TheresNoTime can deploy
[13:00:28] <sharvani__>	 Hi... here for deployment of my patch :-)
[13:00:38] <TheresNoTime>	 sharvani__: o/ will start with yours :-)
[13:00:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960124 (owner: 10Sharvaniharan)
[13:00:48] <wikibugs>	 (03PS9) 10Ilias Sarantopoulos: ml-alerts: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151)
[13:00:50] <aanzx>	 o/
[13:00:51] <sharvani__>	 Thank you!
[13:01:15] <TheresNoTime>	 aanzx: quick note, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/962127/ is still marked WIP
[13:01:36] <wikibugs>	 (03Merged) 10jenkins-bot: New donor experience stream for apps event schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960124 (owner: 10Sharvaniharan)
[13:01:52] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:960124|New donor experience stream for apps event schema]]
[13:01:57] <wikibugs>	 (03PS4) 10Samtar: arwiki: add importsources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962127 (https://phabricator.wikimedia.org/T347563) (owner: 10Anzx)
[13:01:58] <aanzx>	 TheresNoTime: marked as active now
[13:02:05] <TheresNoTime>	 :-)
[13:02:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ml-alerts: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[13:03:08] <logmsgbot>	 !log samtar@deploy2002 sharvaniharan and samtar: Backport for [[gerrit:960124|New donor experience stream for apps event schema]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:03:10] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe2001.codfw.wmnet with OS bullseye
[13:03:14] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin100...
[13:03:21] <TheresNoTime>	 sharvani__: can you test this change on mwdebug?
[13:03:30] <sharvani__>	 yes... testing now..
[13:03:47] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1025.eqiad.wmnet with reason: host reimage
[13:03:56] <sharvani__>	 Working perfectly! Ty! :)
[13:04:04] <logmsgbot>	 !log samtar@deploy2002 sharvaniharan and samtar: Continuing with sync
[13:04:50] <wikibugs>	 (03PS1) 10Filippo Giunchedi: wmflib: clarify 'params' service::probe parameter [puppet] - 10https://gerrit.wikimedia.org/r/963049
[13:05:08] <wikibugs>	 10SRE, 10ops-esams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10ayounsi)
[13:05:13] <wikibugs>	 10ops-esams, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic, 10netops: Q4/Q1:knams racking elevations & planning - https://phabricator.wikimedia.org/T331886 (10ayounsi) 05Open→03Resolved a:03RobH I believe this is all done.
[13:07:00] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1025.eqiad.wmnet with reason: host reimage
[13:07:54] <TheresNoTime>	 (unrelated to current deploys) There's a lot of `PHP Warning: RedisException: Connection timed out` -spam in logstash, assuming nothing serious but logged just in case at T347987
[13:07:55] <stashbot>	 T347987: PHP Warning: RedisException: Connection timed out - https://phabricator.wikimedia.org/T347987
[13:08:38] <godog>	 TheresNoTime: I bet that's T347916 and I'm working to fix it in T347926
[13:08:38] <stashbot>	 T347926: Excimer UI profile lost when requested from mw-on-k8s - https://phabricator.wikimedia.org/T347926
[13:08:39] <stashbot>	 T347916: Investigate sharp increase in lost Arc Lamp samples (arclamp_client_error.exception) - https://phabricator.wikimedia.org/T347916
[13:09:04] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt-wdqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:09:19] <TheresNoTime>	 godog: ack, good luck :-) (feel free to close/merge the task if needed)
[13:09:48] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt-wdqs1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:10:18] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:960124|New donor experience stream for apps event schema]] (duration: 08m 26s)
[13:10:22] <TheresNoTime>	 sharvani__: live on prod :)
[13:10:42] <wikibugs>	 (03PS2) 10Filippo Giunchedi: services: fix xenon/arclamp redis egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/963024 (https://phabricator.wikimedia.org/T347926)
[13:10:43] <icinga-wm>	 ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt-wdqs1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 Majavah https://phabricator.wikimedia.org/T346948 - The acknowledgement expires at: 2023-11-04 13:10:27. https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:10:43] <icinga-wm>	 ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt-wdqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 Majavah https://phabricator.wikimedia.org/T346948 - The acknowledgement expires at: 2023-11-04 13:10:27. https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:10:48] <sharvani__>	 Thank you for deploying @TheresNoTime :-)
[13:10:49] <TheresNoTime>	 aanzx: I'm going to do your 962127 and 963025 together if that's okay
[13:10:55] <wikibugs>	 (03PS3) 10Samtar: add throttle rules for Ada Lovelace Day October 10, 2023 and fix throttle rule for UIUC Wikipedia edit-a-thon October 13, 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963025 (https://phabricator.wikimedia.org/T347719) (owner: 10Anzx)
[13:10:59] <aanzx>	 ok, np
[13:11:08] <TheresNoTime>	 s/harvani__: you're welcome!
[13:11:13] <godog>	 TheresNoTime: ack, yeah will link the related tasks and close it, thank you
[13:11:51] <wikibugs>	 (03PS1) 10Sg912: Update mediawiki_history snapshot [puppet] - 10https://gerrit.wikimedia.org/r/963050
[13:11:54] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P52805 and previous config saved to /var/cache/conftool/dbconfig/20231003-131154-arnaudb.json
[13:12:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962127 (https://phabricator.wikimedia.org/T347563) (owner: 10Anzx)
[13:12:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963025 (https://phabricator.wikimedia.org/T347719) (owner: 10Anzx)
[13:12:26] <wikibugs>	 (03PS2) 10Sg912: Updated mediawiki_history snapshot as part of Ops week [puppet] - 10https://gerrit.wikimedia.org/r/963050
[13:12:56] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt-wdqs1002.eqiad.wmnet
[13:13:10] <wikibugs>	 (03Merged) 10jenkins-bot: arwiki: add importsources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962127 (https://phabricator.wikimedia.org/T347563) (owner: 10Anzx)
[13:13:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] wmflib: clarify 'params' service::probe parameter [puppet] - 10https://gerrit.wikimedia.org/r/963049 (owner: 10Filippo Giunchedi)
[13:13:15] <wikibugs>	 (03Merged) 10jenkins-bot: add throttle rules for Ada Lovelace Day October 10, 2023 and fix throttle rule for UIUC Wikipedia edit-a-thon October 13, 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963025 (https://phabricator.wikimedia.org/T347719) (owner: 10Anzx)
[13:13:26] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:962127|arwiki: add importsources (T347563)]], [[gerrit:963025|add throttle rules for Ada Lovelace Day October 10, 2023 and fix throttle rule for UIUC Wikipedia edit-a-thon October 13, 2023 (T347719)]]
[13:13:31] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] VIPs: add DNS entries for new recommendation-api-ng service [dns] - 10https://gerrit.wikimedia.org/r/963007 (owner: 10Klausman)
[13:13:38] <stashbot>	 T347563: Add English Wikipedia to import sources of Arabic Wikipedia - https://phabricator.wikimedia.org/T347563
[13:13:38] <stashbot>	 T347719: Lift IP caps for Ada Lovelace Day (Oct10, 2023) - https://phabricator.wikimedia.org/T347719
[13:14:45] <logmsgbot>	 !log samtar@deploy2002 anzx and samtar: Backport for [[gerrit:962127|arwiki: add importsources (T347563)]], [[gerrit:963025|add throttle rules for Ada Lovelace Day October 10, 2023 and fix throttle rule for UIUC Wikipedia edit-a-thon October 13, 2023 (T347719)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:14:48] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "Thanks Surbhi" [puppet] - 10https://gerrit.wikimedia.org/r/963050 (owner: 10Sg912)
[13:14:49] <aanzx>	 TheresNoTime: testing 
[13:14:51] <TheresNoTime>	 aanzx: ready for testing on mwdebug
[13:14:52] <TheresNoTime>	 ack
[13:16:08] <aanzx>	 TheresNoTime: looks good 
[13:16:15] <TheresNoTime>	 syncing
[13:16:17] <logmsgbot>	 !log samtar@deploy2002 anzx and samtar: Continuing with sync
[13:17:14] <wikibugs>	 (03PS10) 10Ilias Sarantopoulos: ml-alerts: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151)
[13:17:18] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T347919 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact
[13:18:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ml-alerts: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[13:18:53] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Updated mediawiki_history snapshot as part of Ops week [puppet] - 10https://gerrit.wikimedia.org/r/963050 (owner: 10Sg912)
[13:19:31] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.dns.netbox
[13:19:35] <wikibugs>	 (03PS1) 10Ottomata: mw-page-content-change-enrich - bump to 1.27.0, and set codfw replicas to 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/963051 (https://phabricator.wikimedia.org/T347676)
[13:20:38] <wikibugs>	 (03PS4) 10Klausman: VIPs: add DNS entries for new recommendation-api-ng service [dns] - 10https://gerrit.wikimedia.org/r/963007
[13:21:15] <wikibugs>	 (03CR) 10Klausman: [V: 03+2 C: 03+2] VIPs: add DNS entries for new recommendation-api-ng service [dns] - 10https://gerrit.wikimedia.org/r/963007 (owner: 10Klausman)
[13:21:45] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt-wdqs1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - taavi@cumin1001"
[13:22:30] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:962127|arwiki: add importsources (T347563)]], [[gerrit:963025|add throttle rules for Ada Lovelace Day October 10, 2023 and fix throttle rule for UIUC Wikipedia edit-a-thon October 13, 2023 (T347719)]] (duration: 09m 03s)
[13:22:33] <TheresNoTime>	 aanzx: live in prod :)
[13:22:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:22:36] <aanzx>	 TheresNoTime: thank you 
[13:22:42] <stashbot>	 T347563: Add English Wikipedia to import sources of Arabic Wikipedia - https://phabricator.wikimedia.org/T347563
[13:22:42] <stashbot>	 T347719: Lift IP caps for Ada Lovelace Day (Oct10, 2023) - https://phabricator.wikimedia.org/T347719
[13:22:54] * TheresNoTime will be around for another ~15m if there's any other patches
[13:23:20] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt-wdqs1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - taavi@cumin1001"
[13:23:20] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:23:20] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt-wdqs1002.eqiad.wmnet
[13:23:39] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt-wdqs1003.eqiad.wmnet
[13:23:59] <wikibugs>	 (03PS1) 10MVernon: aptrepo: install zip on aptrepo servers [puppet] - 10https://gerrit.wikimedia.org/r/963052 (https://phabricator.wikimedia.org/T304491)
[13:24:51] <wikibugs>	 (03PS7) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373)
[13:24:53] <wikibugs>	 (03PS7) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373)
[13:24:55] <wikibugs>	 (03PS7) 10Jbond: prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373)
[13:24:57] <wikibugs>	 (03PS10) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373)
[13:25:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[13:26:27] <wikibugs>	 (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/963052 (https://phabricator.wikimedia.org/T304491) (owner: 10MVernon)
[13:26:42] <wikibugs>	 (03CR) 10Klausman: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/963013 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman)
[13:27:01] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T343198)', diff saved to https://phabricator.wikimedia.org/P52806 and previous config saved to /var/cache/conftool/dbconfig/20231003-132700-arnaudb.json
[13:27:02] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrich - bump to 1.27.0, and set codfw replicas to 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/963051 (https://phabricator.wikimedia.org/T347676) (owner: 10Ottomata)
[13:27:03] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance
[13:27:06] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[13:27:27] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance
[13:27:34] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2154 (T343198)', diff saved to https://phabricator.wikimedia.org/P52807 and previous config saved to /var/cache/conftool/dbconfig/20231003-132733-arnaudb.json
[13:27:35] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:27:37] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe2002.codfw.wmnet with OS bullseye
[13:27:45] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumi...
[13:30:26] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.dns.netbox
[13:30:28] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[13:30:41] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[13:31:07] <wikibugs>	 (03PS8) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373)
[13:31:09] <wikibugs>	 (03PS8) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373)
[13:31:11] <wikibugs>	 (03PS8) 10Jbond: prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373)
[13:31:13] <wikibugs>	 (03PS11) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373)
[13:32:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[13:32:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[13:33:01] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt-wdqs1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - taavi@cumin1001"
[13:33:22] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:33:57] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1026.eqiad.wmnet
[13:34:02] <logmsgbot>	 !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1026.eqiad.wmnet
[13:34:10] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt-wdqs1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - taavi@cumin1001"
[13:34:10] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:34:11] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt-wdqs1003.eqiad.wmnet
[13:34:20] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1025.eqiad.wmnet with OS bullseye
[13:34:25] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[13:34:31] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1025.eqiad.wmnet with OS bullseye completed: - restbase1025 (...
[13:34:37] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[13:35:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] augeas_core: update augeas_core [puppet] - 10https://gerrit.wikimedia.org/r/962618 (owner: 10Jbond)
[13:36:53] <wikibugs>	 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Add Antoine_Quhen to the deployment group  - https://phabricator.wikimedia.org/T347296 (10Ottomata)
[13:37:05] <wikibugs>	 (03PS1) 10Klausman: Revert "conftool-data: Add entry for recommendation-api-ng" [puppet] - 10https://gerrit.wikimedia.org/r/963033
[13:37:17] <wikibugs>	 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Ottomata) Updated description and tagged #sre-access-requests
[13:37:30] <wikibugs>	 (03PS1) 10DCausse: rdf-streaming-updater: take a savepoint manually [deployment-charts] - 10https://gerrit.wikimedia.org/r/963057 (https://phabricator.wikimedia.org/T342149)
[13:37:36] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/961897 (owner: 10Majavah)
[13:37:39] <wikibugs>	 (03PS9) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373)
[13:37:41] <wikibugs>	 (03PS9) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373)
[13:37:43] <wikibugs>	 (03PS9) 10Jbond: prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373)
[13:37:45] <wikibugs>	 (03PS12) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373)
[13:38:16] <ottomata>	 !log mw-page-content-change-enrich codfw - bump to 1.27.0 and set replicas to 12 while processing backlog - T347676
[13:38:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:20] <stashbot>	 T347676: Partition reassignment on kafka-jumbo negatively impacting mw-page-content-change-enrich - https://phabricator.wikimedia.org/T347676
[13:38:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[13:38:22] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[13:38:31] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[13:39:02] <wikibugs>	 (03PS1) 10Klausman: Revert "VIPs: add DNS entries for new recommendation-api-ng service" [dns] - 10https://gerrit.wikimedia.org/r/963034
[13:39:16] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] AVA: Make score.php not fail with Fatal Error after libphutil removal [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/933907 (https://phabricator.wikimedia.org/T340633) (owner: 10Aklapper)
[13:39:43] <wikibugs>	 (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] AVA: Make score.php not fail with Fatal Error after libphutil removal [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/933907 (https://phabricator.wikimedia.org/T340633) (owner: 10Aklapper)
[13:39:58] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Revert "conftool-data: Add entry for recommendation-api-ng" [puppet] - 10https://gerrit.wikimedia.org/r/963033 (owner: 10Klausman)
[13:40:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Revert "VIPs: add DNS entries for new recommendation-api-ng service" [dns] - 10https://gerrit.wikimedia.org/r/963034 (owner: 10Klausman)
[13:40:22] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] Revert "conftool-data: Add entry for recommendation-api-ng" [puppet] - 10https://gerrit.wikimedia.org/r/963033 (owner: 10Klausman)
[13:40:30] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] Revert "VIPs: add DNS entries for new recommendation-api-ng service" [dns] - 10https://gerrit.wikimedia.org/r/963034 (owner: 10Klausman)
[13:41:05] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: take a savepoint manually [deployment-charts] - 10https://gerrit.wikimedia.org/r/963057 (https://phabricator.wikimedia.org/T342149) (owner: 10DCausse)
[13:41:39] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt-wdqs1001.eqiad.wmnet
[13:41:53] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater: take a savepoint manually [deployment-charts] - 10https://gerrit.wikimedia.org/r/963057 (https://phabricator.wikimedia.org/T342149) (owner: 10DCausse)
[13:42:01] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase1025.eqiad.wmnet
[13:42:02] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1025.eqiad.wmnet
[13:42:33] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[13:43:28] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[13:43:30] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1004.eqiad.wmnet with OS bullseye
[13:43:35] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[13:43:40] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1033.eqiad.wmnet with OS bullseye
[13:43:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye
[13:43:56] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1033.eqiad.wmnet with OS bullseye
[13:44:09] <wikibugs>	 (03CR) 10Fabfur: purged: use unix socket for varnish in codfw [puppet] - 10https://gerrit.wikimedia.org/r/963020 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[13:44:11] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43851/console" [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[13:44:22] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.dns.netbox
[13:44:45] <joal>	 btullis: thanks so much for the unlocking for Surbhi - she still has issues ssh-ing deployment, but ht'at be for tomorrow :)
[13:44:46] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] services: fix xenon/arclamp redis egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/963024 (https://phabricator.wikimedia.org/T347926) (owner: 10Filippo Giunchedi)
[13:46:12] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:46:34] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe2002.codfw.wmnet with reason: host reimage
[13:46:34] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[13:46:48] <wikibugs>	 (03PS10) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373)
[13:46:50] <wikibugs>	 (03PS10) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373)
[13:46:52] <wikibugs>	 (03PS10) 10Jbond: prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373)
[13:46:54] <wikibugs>	 (03PS13) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373)
[13:47:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[13:47:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[13:48:08] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.dns.netbox
[13:48:13] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Revert allocation of LVS VIPs for recommendation-api-ng - klausman@cumin1001"
[13:48:56] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] purged: use unix socket for varnish in codfw [puppet] - 10https://gerrit.wikimedia.org/r/963020 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[13:49:11] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe2002.codfw.wmnet with reason: host reimage
[13:49:40] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:49:41] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt-wdqs1001.eqiad.wmnet
[13:49:46] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host an-master1004
[13:50:36] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Revert allocation of LVS VIPs for recommendation-api-ng - klausman@cumin1001"
[13:50:37] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:50:56] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-master1004
[13:51:03] <wikibugs>	 (03CR) 10Klausman: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/963059 (owner: 10Klausman)
[13:51:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[13:51:56] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43852/console" [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[13:52:32] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] purged: use unix socket for varnish in codfw [puppet] - 10https://gerrit.wikimedia.org/r/963020 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[13:52:46] <wikibugs>	 (03CR) 10Ottomata: k8s config: Provide kafka and zookeeper hostnames (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/960662 (owner: 10Ebernhardson)
[13:52:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[13:54:51] <wikibugs>	 (03PS1) 10Jbond: P:dumps::distribution::ferm: pass array directly do ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/963062 (https://phabricator.wikimedia.org/T346165)
[13:56:12] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43853/console" [puppet] - 10https://gerrit.wikimedia.org/r/963062 (https://phabricator.wikimedia.org/T346165) (owner: 10Jbond)
[13:57:08] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1033.eqiad.wmnet with reason: host reimage
[13:57:48] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host an-master1003
[13:58:09] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to <INSERT LDAP GROUP> for <Sara Campos> - https://phabricator.wikimedia.org/T348001 (10SCampos-WMF)
[13:58:14] <wikibugs>	 (03PS11) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373)
[13:58:17] <wikibugs>	 (03PS11) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373)
[13:58:19] <wikibugs>	 (03PS11) 10Jbond: prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373)
[13:58:21] <wikibugs>	 (03PS14) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373)
[13:58:57] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-master1003
[13:59:36] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[13:59:39] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1003.eqiad.wmnet with OS bullseye
[13:59:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye
[14:01:19] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to <INSERT LDAP GROUP> for <Sara Campos> - https://phabricator.wikimedia.org/T348001 (10RhinosF1) @SCampos-WMF: Can you please link your wikitech account to your phabricator account?  I suspect 'wmf' will be the correct group for you.
[14:01:36] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to <INSERT LDAP GROUP> for Sara Campos - https://phabricator.wikimedia.org/T348001 (10RhinosF1)
[14:01:37] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1033.eqiad.wmnet with reason: host reimage
[14:01:51] <fabfur>	 !log merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/963020 (T347837). `purged` daemon will be restarted by puppet in codfw in the next 30m
[14:01:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:04] <stashbot>	 T347837: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837
[14:02:12] <godog>	 jouncebot: now and next
[14:02:13] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 57 minute(s)
[14:02:20] <wikibugs>	 (03PS5) 10Herron: thanos-rule: add pyrra filesystem operator output dir to search path [puppet] - 10https://gerrit.wikimedia.org/r/930628 (https://phabricator.wikimedia.org/T302995)
[14:03:24] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Sara Campos - https://phabricator.wikimedia.org/T348001 (10RhinosF1)
[14:03:30] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 364 bytes in 9.346 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[14:04:14] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Sara Campos - https://phabricator.wikimedia.org/T348001 (10RhinosF1) I've updated the description to wmf for you @SCampos-WMF as I see you have an @wikimedia.org email and that access allows matomo.  ldap pulled with https://ldap.toolforge.org/user/scampos
[14:04:22] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs1015 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[14:04:24] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Fabfur)
[14:04:32] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1015 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[14:05:04] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 9 NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43854/console" [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[14:06:20] <wikibugs>	 (03CR) 10Ottomata: Pull some flink config down into the chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T346315) (owner: 10Ebernhardson)
[14:07:09] <wikibugs>	 (03CR) 10Herron: [C: 03+1] profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[14:07:54] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe2002.codfw.wmnet with OS bullseye
[14:07:59] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin100...
[14:08:30] <wikibugs>	 (03CR) 10Herron: [C: 03+1] profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[14:09:20] <wikibugs>	 (03CR) 10Herron: [C: 03+1] syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[14:11:41] <wikibugs>	 (03CR) 10Herron: [C: 03+1] profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[14:16:29] <wikibugs>	 10SRE, 10Traffic: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Fabfur)
[14:18:10] <wikibugs>	 (03PS1) 10Ayounsi: CHANGELOG: add changelogs for release v0.6.4 [software/homer] - 10https://gerrit.wikimedia.org/r/963069
[14:19:48] <wikibugs>	 (03CR) 10Volans: "couple of nits, lgtm otherwise" [software/homer] - 10https://gerrit.wikimedia.org/r/963069 (owner: 10Ayounsi)
[14:19:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] CHANGELOG: add changelogs for release v0.6.4 [software/homer] - 10https://gerrit.wikimedia.org/r/963069 (owner: 10Ayounsi)
[14:21:02] <wikibugs>	 (03PS2) 10Ayounsi: CHANGELOG: add changelogs for release v0.6.4 [software/homer] - 10https://gerrit.wikimedia.org/r/963069
[14:21:06] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:23:24] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:23:34] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] CHANGELOG: add changelogs for release v0.6.4 (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/963069 (owner: 10Ayounsi)
[14:23:39] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/963069 (owner: 10Ayounsi)
[14:25:28] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.6.4 [software/homer] - 10https://gerrit.wikimedia.org/r/963069 (owner: 10Ayounsi)
[14:26:16] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[14:31:20] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1033.eqiad.wmnet with OS bullseye
[14:31:31] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1033.eqiad.wmnet with OS bullseye completed: - restbase1033 (...
[14:33:24] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] P:dumps::distribution::ferm: pass array directly do ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/963062 (https://phabricator.wikimedia.org/T346165) (owner: 10Jbond)
[14:33:44] <wikibugs>	 (03PS1) 10Ayounsi: Release v0.6.4 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/963079
[14:34:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] services: fix xenon/arclamp redis egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/963024 (https://phabricator.wikimedia.org/T347926) (owner: 10Filippo Giunchedi)
[14:35:11] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/963079 (owner: 10Ayounsi)
[14:35:22] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[14:35:37] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[14:35:38] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[14:36:03] <wikibugs>	 10SRE, 10Traffic: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Fabfur)
[14:36:06] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[14:36:07] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[14:36:24] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs2022 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 0.366 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[14:36:24] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[14:36:25] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[14:36:42] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[14:36:44] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[14:36:44] <wikibugs>	 (03PS1) 10Ottomata: mw-page-content-change-enrich - set replicas: 2 for eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/963080 (https://phabricator.wikimedia.org/T266798)
[14:36:54] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[14:36:55] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[14:37:12] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[14:37:14] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[14:37:23] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[14:37:24] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[14:37:27] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1003']
[14:37:36] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[14:37:37] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/mw-misc: apply
[14:37:41] <godog>	 ye olde wall of SAL
[14:37:47] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply
[14:37:48] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply
[14:37:56] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1003']
[14:38:00] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply
[14:38:02] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1003']
[14:38:08] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1003']
[14:38:14] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1003']
[14:38:18] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1003']
[14:38:26] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004']
[14:38:42] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1004']
[14:38:47] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:07] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004']
[14:39:09] <wikibugs>	 (03PS1) 10Fabfur: purged: use unix socket for varnish in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/963081 (https://phabricator.wikimedia.org/T347837)
[14:39:10] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1004']
[14:39:16] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1003']
[14:39:19] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1003']
[14:39:29] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1003']
[14:39:32] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1003']
[14:40:07] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1003']
[14:42:05] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43855/console" [puppet] - 10https://gerrit.wikimedia.org/r/963081 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[14:42:31] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1003']
[14:42:33] <logmsgbot>	 !log jclark@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['an-master1003']
[14:42:53] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004']
[14:43:16] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1004']
[14:43:41] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1003']
[14:43:49] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1003']
[14:43:56] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004']
[14:44:03] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1004']
[14:44:52] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004']
[14:45:04] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1004']
[14:45:44] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004']
[14:45:49] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1004']
[14:45:52] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Release v0.6.4 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/963079 (owner: 10Ayounsi)
[14:46:02] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004']
[14:46:08] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1004']
[14:46:33] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004']
[14:46:36] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1004']
[14:46:52] <wikibugs>	 (03PS2) 10Ottomata: mw-page-content-change-enrich - set replicas: 2 for eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/963080 (https://phabricator.wikimedia.org/T347676)
[14:46:57] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-master1003']
[14:47:24] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004']
[14:47:38] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1004']
[14:47:50] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004']
[14:48:32] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.4 - ayounsi@cumin1001
[14:48:47] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:49:39] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004']
[14:49:41] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1004']
[14:49:51] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004']
[14:50:11] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.4 - ayounsi@cumin1001
[14:53:13] <wikibugs>	 (03CR) 10Anzx: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx)
[14:53:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] fonwiki: add wgSiteName, wgMetaNamespace, add project namespace, timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx)
[14:55:22] <wikibugs>	 (03PS3) 10Anzx: fonwiki: add wgSiteName, wgMetaNamespace, add project namespace, timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939)
[14:55:29] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-master1004']
[14:56:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] fonwiki: add wgSiteName, wgMetaNamespace, add project namespace, timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx)
[14:56:58] <wikibugs>	 (03PS4) 10Anzx: fonwiki: add wgSiteName, wgMetaNamespace, add project namespace, timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939)
[14:58:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:00:25] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Sara Campos - https://phabricator.wikimedia.org/T348001 (10Aklapper) >>! In T348001#9220485, @RhinosF1 wrote: > @SCampos-WMF: Can you please link your wikitech account to your phabricator account?  That would welcome hints how to do that, especially if you...
[15:03:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:05:26] <logmsgbot>	 !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on phab1004.eqiad.wmnet with reason: Phabricator deploys
[15:05:40] <logmsgbot>	 !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1004.eqiad.wmnet with reason: Phabricator deploys
[15:05:59] <logmsgbot>	 !log brennen@deploy2002 Started deploy [phabricator/deployment@6f19600]: test deploy to phab2002 for T348007
[15:06:07] <stashbot>	 T348007: Deploy Phabricator/Phorge 2023-10-03 - https://phabricator.wikimedia.org/T348007
[15:06:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "elukey@stat1004:~$ curl "https://recommendation-api-ng.discovery.wmnet:31443/api/spec" -i --http1.1 --resolve recommendation-api-ng.discov" [dns] - 10https://gerrit.wikimedia.org/r/963059 (owner: 10Klausman)
[15:06:31] <logmsgbot>	 !log brennen@deploy2002 Finished deploy [phabricator/deployment@6f19600]: test deploy to phab2002 for T348007 (duration: 00m 32s)
[15:06:53] <logmsgbot>	 !log brennen@deploy2002 Started deploy [phabricator/deployment@6f19600]: deploy to phab1004 for T348007
[15:07:01] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] mwnet: Add CNAMES for recommendation-api-ng running on ml-k8s [dns] - 10https://gerrit.wikimedia.org/r/963059 (owner: 10Klausman)
[15:07:05] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Improve Homer output when Juniper device rejects config - https://phabricator.wikimedia.org/T328747 (10ayounsi) 05Open→03Resolved Homer 0.6.4 released.
[15:07:37] <logmsgbot>	 !log brennen@deploy2002 Finished deploy [phabricator/deployment@6f19600]: deploy to phab1004 for T348007 (duration: 00m 44s)
[15:07:59] <wikibugs>	 (03Abandoned) 10Klausman: hiera/services: add service for recommendation-api-ng [puppet] - 10https://gerrit.wikimedia.org/r/963013 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman)
[15:08:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST replicasets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:09:39] <wikibugs>	 (03PS2) 10Jdrewniak: Promote several Wikipedias to Vector 2022 as default skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962684 (https://phabricator.wikimedia.org/T347321) (owner: 10Jdlrobson)
[15:10:08] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase1033.eqiad.wmnet
[15:10:09] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1033.eqiad.wmnet
[15:10:43] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[15:10:45] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1026.eqiad.wmnet
[15:11:18] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1026.eqiad.wmnet
[15:12:32] <wikibugs>	 (03PS1) 10DLynch: Enable Edit Check on initial partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963084 (https://phabricator.wikimedia.org/T347908)
[15:13:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:13:58] <wikibugs>	 (03CR) 10Ryan Kemper: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper)
[15:15:48] <wikibugs>	 (03PS1) 10Hnowlan: helmfile: add edit-analytics and editor-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/963086 (https://phabricator.wikimedia.org/T336415)
[15:17:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10VRiley-WMF) cp1112 - D 2. U 1. CableID 20220171 port 21  cp1113 - D 4. U 29 CableID 230304500241 port 6 cp1114 - D 4. U 38 CableID 230304500243 port 8 cp1115 - D 7. U 20 CableID 2303045...
[15:17:26] <wikibugs>	 (03PS3) 10Jclark-ctr: add an-master100(3,4) to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/962621 (https://phabricator.wikimedia.org/T342291)
[15:17:28] <wikibugs>	 (03PS1) 10Jclark-ctr: add an-masters1003,4 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963087 (https://phabricator.wikimedia.org/T342291)
[15:18:13] <wikibugs>	 (03CR) 10TChin: [C: 03+1] mw-page-content-change-enrich - set replicas: 2 for eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/963080 (https://phabricator.wikimedia.org/T347676) (owner: 10Ottomata)
[15:19:53] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1003.eqiad.wmnet with OS bullseye
[15:20:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed with error...
[15:20:49] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrich - set replicas: 2 for eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/963080 (https://phabricator.wikimedia.org/T347676) (owner: 10Ottomata)
[15:21:49] <wikibugs>	 (03Merged) 10jenkins-bot: mw-page-content-change-enrich - set replicas: 2 for eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/963080 (https://phabricator.wikimedia.org/T347676) (owner: 10Ottomata)
[15:22:24] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[15:22:53] <wikibugs>	 (03CR) 10Jclark-ctr: [C: 03+2] add an-masters1003,4 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963087 (https://phabricator.wikimedia.org/T342291) (owner: 10Jclark-ctr)
[15:23:06] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10GFontenelle_WMF)
[15:23:22] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1026.eqiad.wmnet
[15:23:24] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1026.eqiad.wmnet
[15:23:27] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[15:23:33] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[15:24:06] <wikibugs>	 (03PS1) 10Cwhite: logstash: increase timeout for curator delete actions [puppet] - 10https://gerrit.wikimedia.org/r/962236 (https://phabricator.wikimedia.org/T347976)
[15:24:08] <ottomata>	 !log mw-page-content-change-enrich - backfill is done, set replicas to 2 in eqiad and codfw
[15:24:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:33] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1026.eqiad.wmnet with OS bullseye
[15:24:42] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1026.eqiad.wmnet with OS bullseye
[15:24:58] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10GFontenelle_WMF) @Ladsgroup: I completely agree! Thank you for letting us know about the standardization and it makes total sense to be similar to the glam-us one. I already talk...
[15:26:41] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[15:26:48] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[15:27:26] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] add an-masters1003,4 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963087 (https://phabricator.wikimedia.org/T342291) (owner: 10Jclark-ctr)
[15:27:30] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:28:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] thanos-rule: add pyrra filesystem operator output dir to search path [puppet] - 10https://gerrit.wikimedia.org/r/930628 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[15:32:55] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[15:33:06] <wikibugs>	 (03CR) 10Herron: [C: 03+2] thanos-rule: add pyrra filesystem operator output dir to search path [puppet] - 10https://gerrit.wikimedia.org/r/930628 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[15:34:16] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:35:05] <wikibugs>	 (03PS2) 10Papaul: add an-masters1003,4 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963087 (https://phabricator.wikimedia.org/T342291) (owner: 10Jclark-ctr)
[15:35:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10RobH)
[15:36:27] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] add an-masters1003,4 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963087 (https://phabricator.wikimedia.org/T342291) (owner: 10Jclark-ctr)
[15:37:27] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1026.eqiad.wmnet with reason: host reimage
[15:37:39] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1004.eqiad.wmnet with OS bullseye
[15:37:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed with error...
[15:39:00] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] helmfile: add edit-analytics and editor-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/963086 (https://phabricator.wikimedia.org/T336415) (owner: 10Hnowlan)
[15:40:39] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1026.eqiad.wmnet with reason: host reimage
[15:42:05] <wikibugs>	 (03CR) 10Jclark-ctr: [C: 03+2] add an-master100(3,4) to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/962621 (https://phabricator.wikimedia.org/T342291) (owner: 10Jclark-ctr)
[15:44:30] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:45:47] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:45:53] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] helmfile: add edit-analytics and editor-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/963086 (https://phabricator.wikimedia.org/T336415) (owner: 10Hnowlan)
[15:46:39] <wikibugs>	 (03PS3) 10Jdrewniak: Promote several Wikipedias to Vector 2022 as default skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962684 (https://phabricator.wikimedia.org/T347321) (owner: 10Jdlrobson)
[15:46:58] <wikibugs>	 (03Merged) 10jenkins-bot: helmfile: add edit-analytics and editor-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/963086 (https://phabricator.wikimedia.org/T336415) (owner: 10Hnowlan)
[15:47:49] <wikibugs>	 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes2054 - https://phabricator.wikimedia.org/T345650 (10Jhancock.wm)
[15:49:30] <jinxer-wm>	 (SystemdUnitFailed) resolved: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:49:42] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1003.eqiad.wmnet with OS bullseye
[15:49:44] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1004.eqiad.wmnet with OS bullseye
[15:49:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye
[15:49:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye
[15:51:52] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply
[15:57:00] <wikibugs>	 (03CR) 10Jon Harald Søby: [C: 04-1] "The wordmark file ("Wikipedya") uses the incorrect W, and has some kerning issues at the Y/A border. Please make a version of it that is b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963066 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx)
[15:57:16] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply
[15:57:22] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply
[15:57:34] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply
[15:59:08] <wikibugs>	 (03PS1) 10DLynch: Edit check: Simplify "experience" config to "maximumEditcount" [extensions/VisualEditor] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963041
[15:59:21] <wikibugs>	 (03PS1) 10DLynch: Edit check: Simplify "experience" config to "maximumEditcount" [extensions/VisualEditor] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963042
[16:00:03] <jinxer-wm>	 (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:00:04] <jouncebot>	 jbond and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T1600).
[16:00:05] <jouncebot>	 dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:21] <wikibugs>	 (03CR) 10Jon Harald Søby: fonwiki: add wgSiteName, wgMetaNamespace, add project namespace, timezone (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx)
[16:00:59] <dancy>	 o/
[16:01:11] <jbond>	 dancy: looking now
[16:01:43] <dancy>	 It's just https://gerrit.wikimedia.org/r/c/operations/puppet/+/961893  . The other one got merged last week.
[16:01:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] logspam-watch: Add refreshing indicator [puppet] - 10https://gerrit.wikimedia.org/r/961893 (owner: 10Ahmon Dancy)
[16:01:53] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply
[16:02:09] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply
[16:03:10] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply
[16:03:11] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply
[16:03:29] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10Ladsgroup) 05Open→03Resolved Done. Just note that I created it as a public mailing list but if you want it private, you can change the settings in https://lists.wikimedia.org...
[16:03:44] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply
[16:04:02] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply
[16:04:47] <jbond>	 dancy: merged and deployed to mwlog
[16:05:05] <jinxer-wm>	 (KubernetesAPILatency) resolved: (11) High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:05:40] <dancy>	 Thanks! It's working properly.
[16:06:04] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply
[16:06:04] <jbond>	 great :)
[16:06:22] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply
[16:07:16] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[16:07:24] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1026.eqiad.wmnet with OS bullseye
[16:08:51] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply
[16:09:00] <wikibugs>	 (03PS2) 10Ryan Kemper: airflow-wmde: Add wmde service user to the Yarn production queue [puppet] - 10https://gerrit.wikimedia.org/r/949019 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[16:09:19] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply
[16:10:00] <wikibugs>	 (03CR) 10Ryan Kemper: "Pushed a patch that attempts to fix this CI error from https://integration.wikimedia.org/ci/job/operations-puppet-tests-buster-docker/7200" [puppet] - 10https://gerrit.wikimedia.org/r/949019 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[16:11:56] <wikibugs>	 (03CR) 10Ottomata: "Related: https://phabricator.wikimedia.org/T336901" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T346315) (owner: 10Ebernhardson)
[16:16:25] <wikibugs>	 (03CR) 10Ahmon Dancy: "The changes look OK to me." [puppet] - 10https://gerrit.wikimedia.org/r/953200 (owner: 10Clément Goubert)
[16:19:43] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply
[16:19:59] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply
[16:20:00] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply
[16:20:24] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply
[16:23:57] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1027.eqiad.wmnet
[16:24:26] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1027.eqiad.wmnet
[16:27:01] <wikibugs>	 (03CR) 10Anzx: fonwiki: add wgSiteName, wgMetaNamespace, add project namespace, timezone (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx)
[16:27:27] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[16:30:42] <wikibugs>	 10SRE, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10serviceops-radar, 10Event-Platform: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088 (10Ottomata)
[16:33:51] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "go go go go" [puppet] - 10https://gerrit.wikimedia.org/r/963029 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri)
[16:36:05] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1027.eqiad.wmnet
[16:36:07] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1027.eqiad.wmnet
[16:37:18] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1027.eqiad.wmnet with OS bullseye
[16:37:29] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1027.eqiad.wmnet with OS bullseye
[16:38:30] <wikibugs>	 (03CR) 10Jon Harald Søby: fonwiki: add wgSiteName, wgMetaNamespace, add project namespace, timezone (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx)
[16:39:12] <wikibugs>	 (03PS5) 10Anzx: fonwiki: add wgSiteName, wgMetaNamespace, add project namespace, timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939)
[16:41:00] <wikibugs>	 (03PS2) 10Anzx: fonwiki: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963066 (https://phabricator.wikimedia.org/T347939)
[16:43:40] <wikibugs>	 (03PS4) 10Majavah: dumps: replace WMCS paging Icinga check with Blackbox probe [puppet] - 10https://gerrit.wikimedia.org/r/961897
[16:44:59] <wikibugs>	 (03PS1) 10Hnowlan: edit-analytics, editor-analytics: correct networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/963104 (https://phabricator.wikimedia.org/T336415)
[16:45:58] <wikibugs>	 (03PS1) 10DCausse: rdf-streaming-updater: simplify parallelism and use newer kafka APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/963105 (https://phabricator.wikimedia.org/T326914)
[16:47:16] <wikibugs>	 (03CR) 10Jon Harald Søby: [C: 03+1] fonwiki: add wgSiteName, wgMetaNamespace, add project namespace, timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx)
[16:49:14] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] dumps: replace WMCS paging Icinga check with Blackbox probe [puppet] - 10https://gerrit.wikimedia.org/r/961897 (owner: 10Majavah)
[16:50:04] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1027.eqiad.wmnet with reason: host reimage
[16:51:21] <wikibugs>	 (03CR) 10Anzx: fonwiki: add logos (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963066 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx)
[16:51:43] <wikibugs>	 (03CR) 10Jon Harald Søby: [C: 03+1] fonwiki: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963066 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx)
[16:52:34] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1027.eqiad.wmnet with reason: host reimage
[16:54:08] <wikibugs>	 (03PS1) 10Hnowlan: wmnet: add records for edit-, editor- and page-analytics [dns] - 10https://gerrit.wikimedia.org/r/963106 (https://phabricator.wikimedia.org/T336391)
[16:54:22] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[16:54:32] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Sara Campos - https://phabricator.wikimedia.org/T348001 (10SCampos-WMF) Thank you for sharing this, it was very useful :D ! @RhinosF1 I was able to link my wikitech account to my phabricator account!
[16:56:37] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add newly racked kubernetes2054 hosts in codfw - jhancock@cumin2002"
[16:57:15] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: increase timeout for curator delete actions [puppet] - 10https://gerrit.wikimedia.org/r/962236 (https://phabricator.wikimedia.org/T347976) (owner: 10Cwhite)
[16:57:26] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add newly racked kubernetes2054 hosts in codfw - jhancock@cumin2002"
[16:57:26] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:58:38] <wikibugs>	 (03CR) 10Bking: [C: 03+1] rdf-streaming-updater: simplify parallelism and use newer kafka APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/963105 (https://phabricator.wikimedia.org/T326914) (owner: 10DCausse)
[16:58:42] <wikibugs>	 (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: simplify parallelism and use newer kafka APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/963105 (https://phabricator.wikimedia.org/T326914) (owner: 10DCausse)
[16:59:48] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[16:59:57] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[17:00:05] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] edit-analytics, editor-analytics: correct networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/963104 (https://phabricator.wikimedia.org/T336415) (owner: 10Hnowlan)
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T1700)
[17:00:55] <wikibugs>	 (03Merged) 10jenkins-bot: edit-analytics, editor-analytics: correct networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/963104 (https://phabricator.wikimedia.org/T336415) (owner: 10Hnowlan)
[17:02:20] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[17:04:16] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1004.eqiad.wmnet with OS bullseye
[17:04:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w...
[17:05:06] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:08:26] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add newly racked kubernetes2054 hosts in codfw - jhancock@cumin2002"
[17:09:15] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add newly racked kubernetes2054 hosts in codfw - jhancock@cumin2002"
[17:09:15] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:09:20] <icinga-wm>	 PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:09:57] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1003.eqiad.wmnet with OS bullseye
[17:10:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w...
[17:10:25] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] "Oops, my bad." [puppet] - 10https://gerrit.wikimedia.org/r/962048 (https://phabricator.wikimedia.org/T331713) (owner: 10Eevans)
[17:11:50] <icinga-wm>	 RECOVERY - BFD status on cr1-drmrs is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:13:06] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Sara Campos - https://phabricator.wikimedia.org/T348001 (10RhinosF1) That should be good for the SRE for the clinic this week to handle then :)
[17:15:04] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:17:48] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1027.eqiad.wmnet with OS bullseye
[17:17:59] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1027.eqiad.wmnet with OS bullseye completed: - restbase1027 (...
[17:21:31] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[17:24:16] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963110 (https://phabricator.wikimedia.org/T347080)
[17:24:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963110 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot)
[17:25:00] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Sara Campos - https://phabricator.wikimedia.org/T348001 (10SCampos-WMF) Great, thank you for the guidance!
[17:27:13] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963110 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot)
[17:27:36] <logmsgbot>	 !log jhuneidi@deploy2002 Started scap: testwikis wikis to 1.41.0-wmf.29  refs T347080
[17:27:40] <stashbot>	 T347080: 1.41.0-wmf.29 deployment blockers - https://phabricator.wikimedia.org/T347080
[17:28:25] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase1027.eqiad.wmnet
[17:28:25] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1027.eqiad.wmnet
[17:33:23] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1003.eqiad.wmnet with OS bullseye
[17:33:28] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1004.eqiad.wmnet with OS bullseye
[17:33:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye
[17:33:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye
[17:33:55] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1004.eqiad.wmnet with OS bullseye
[17:34:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w...
[17:34:34] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1004.eqiad.wmnet with OS bullseye
[17:34:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye
[17:34:58] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1004.eqiad.wmnet with OS bullseye
[17:35:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w...
[17:37:34] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[17:38:01] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] install_server: utilize reuse recipe for restbase2027 [puppet] - 10https://gerrit.wikimedia.org/r/962048 (https://phabricator.wikimedia.org/T331713) (owner: 10Eevans)
[17:49:32] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[17:58:00] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10GFontenelle_WMF) Thank you so much, @Ladsgroup! We really appreciate this and have started to share it with folks already. 🙌
[17:58:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations: sre.hardware.upgrade-firmware cookbook: product slug parsing - https://phabricator.wikimedia.org/T348036 (10Eevans)
[17:59:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:00:05] <jouncebot>	 jeena and dduvall: (Dis)respected human, time to deploy MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T1800). Please do the needful.
[18:04:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:11:00] <logmsgbot>	 !log jhuneidi@deploy2002 Finished scap: testwikis wikis to 1.41.0-wmf.29  refs T347080 (duration: 43m 24s)
[18:11:04] <stashbot>	 T347080: 1.41.0-wmf.29 deployment blockers - https://phabricator.wikimedia.org/T347080
[18:12:11] <wikibugs>	 (03Abandoned) 10Varnent: [foundationwiki] Grant translation admin rights to 'sysop' and 'global-sysop' groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959354 (https://phabricator.wikimedia.org/T346187) (owner: 10Varnent)
[18:13:17] <logmsgbot>	 !log jhuneidi@deploy2002 Pruned MediaWiki: 1.41.0-wmf.27 (duration: 02m 14s)
[18:15:01] <wikibugs>	 (03PS1) 10Krinkle: Remove wgResourceLoaderStorageVersion override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963124 (https://phabricator.wikimedia.org/T343407)
[18:16:51] <thcipriani>	 jouncebot: now
[18:16:51] <jouncebot>	 For the next 1 hour(s) and 43 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T1800)
[18:16:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh)
[18:17:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh)
[18:17:13] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh)
[18:17:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) p:05Triage→03Medium
[18:17:49] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963125 (https://phabricator.wikimedia.org/T347080)
[18:17:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963125 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot)
[18:18:28] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963125 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot)
[18:20:27] <wikibugs>	 (03Abandoned) 10Ebernhardson: k8s config: Provide kafka and zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662 (owner: 10Ebernhardson)
[18:21:35] <wikibugs>	 (03Abandoned) 10Ebernhardson: flink-app: Provide kafka hosts as properties file [deployment-charts] - 10https://gerrit.wikimedia.org/r/959066 (owner: 10Ebernhardson)
[18:21:46] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10GFontenelle_WMF) Hi @Ladsgroup. I'm sorry for reopening the ticket again but someone just flagged to me that "EU" can be problematic because it could mean only countries within t...
[18:21:59] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10GFontenelle_WMF) 05Resolved→03Open
[18:23:53] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10Ladsgroup) Rename is not that easily possible. I can delete the mailing list and create it again and mass subscribe previous members. That means all settings changes will be gone...
[18:25:02] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10GFontenelle_WMF) That's fine and no problem at all from our side, @Ladsgroup! That would help us a lot actually. Thank you so much!
[18:25:16] <wikibugs>	 (03PS6) 10Ebernhardson: Pull some flink config down into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901)
[18:25:20] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission frauth2001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T340153 (10Papaul) 05Open→03Resolved `  papaul@fasw-c-codfw# show |compare [edit interfaces interface-range disabled]      member "ge-[0-1]/0/16" { ... } +    member "ge-[0-1]/0/17";...
[18:25:22] <wikibugs>	 (03CR) 10Ebernhardson: "To keep things moving I've narrowed down the scope of this patch, removing the functionality to source zookeeper host/port based on a clus" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) (owner: 10Ebernhardson)
[18:25:46] <logmsgbot>	 !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.29  refs T347080
[18:25:50] <stashbot>	 T347080: 1.41.0-wmf.29 deployment blockers - https://phabricator.wikimedia.org/T347080
[18:26:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:27:02] <wikibugs>	 (03CR) 10Jdrewniak: "This change is ready for review." [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963043 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak)
[18:30:55] <wikibugs>	 (03PS1) 10Ebernhardson: flink-app chart: Add zookeeper to egress_enabled fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/963130
[18:31:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (7) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:31:34] <wikibugs>	 (03PS7) 10Ebernhardson: Pull some flink config down into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901)
[18:31:36] <wikibugs>	 (03CR) 10Ebernhardson: Pull some flink config down into the chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) (owner: 10Ebernhardson)
[18:31:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) We can and probably should have a backup static routes for each of `ns[01]` but it can be to a single host instead of al...
[18:37:49] <wikibugs>	 (03PS8) 10Ebernhardson: Pull some flink config down into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901)
[18:38:05] <wikibugs>	 (03CR) 10Ebernhardson: "With the scope reduced, i think the main question remaining here is if these opinionated paths are the ones we want to use going forward. " [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) (owner: 10Ebernhardson)
[18:48:18] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1003.eqiad.wmnet with OS bullseye
[18:48:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w...
[18:48:47] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:52:11] <wikibugs>	 (03CR) 10Bking: [C: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking)
[18:52:23] <wikibugs>	 (03CR) 10Bking: [C: 03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking)
[18:52:38] <wikibugs>	 (03PS9) 10Ryan Kemper: elastic: remove elastic10[48-52] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking)
[18:53:21] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking)
[18:55:35] <wikibugs>	 (03CR) 10Bking: [C: 03+2] elastic: remove elastic10[48-52] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking)
[19:02:26] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: remove rack info for old hosts [puppet] - 10https://gerrit.wikimedia.org/r/963134 (https://phabricator.wikimedia.org/T316728)
[19:02:37] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10Ladsgroup) 05Open→03Resolved {{done}}
[19:05:46] <wikibugs>	 (03PS2) 10Ryan Kemper: elastic: remove rack info for old hosts [puppet] - 10https://gerrit.wikimedia.org/r/963134 (https://phabricator.wikimedia.org/T316728)
[19:09:02] <wikibugs>	 (03CR) 10Bking: [C: 03+1] elastic: remove rack info for old hosts [puppet] - 10https://gerrit.wikimedia.org/r/963134 (https://phabricator.wikimedia.org/T316728) (owner: 10Ryan Kemper)
[19:09:08] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] elastic: remove rack info for old hosts [puppet] - 10https://gerrit.wikimedia.org/r/963134 (https://phabricator.wikimedia.org/T316728) (owner: 10Ryan Kemper)
[19:10:44] <DannyS712>	 Hi. jelto do you have a few minutes?
[19:15:25] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1003.eqiad.wmnet with OS bullseye
[19:15:27] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1004.eqiad.wmnet with OS bullseye
[19:15:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye
[19:15:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye
[19:15:51] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1003.eqiad.wmnet with OS bullseye
[19:15:53] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1004.eqiad.wmnet with OS bullseye
[19:15:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w...
[19:15:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w...
[19:16:31] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1003.eqiad.wmnet with OS bullseye
[19:16:33] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1004.eqiad.wmnet with OS bullseye
[19:16:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye
[19:16:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye
[19:23:11] <wikibugs>	 (03CR) 10C. Scott Ananian: [C: 03+1] Re-enable Extension:ParserMigration on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey)
[19:31:18] <wikibugs>	 (03CR) 10Ryan Kemper: "Despite the CirrusSearch patch referenced in my last comment, we're not seeing any metrics for MediaWiki.CirrusSearch.eqiad.backend_failur" [puppet] - 10https://gerrit.wikimedia.org/r/830240 (https://phabricator.wikimedia.org/T316712) (owner: 10Ebernhardson)
[19:38:02] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1004.eqiad.wmnet with OS bullseye
[19:38:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w...
[19:38:50] <wikibugs>	 (03PS3) 10Jdrewniak: Web typography prototype survey [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963043 (https://phabricator.wikimedia.org/T347208)
[19:38:58] <wikibugs>	 (03PS1) 10Jdrewniak: Correct a recently-added message [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963137
[19:39:06] <wikibugs>	 (03PS1) 10Jdrewniak: [Prototype] Change i18n message [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963138 (https://phabricator.wikimedia.org/T347208)
[19:41:47] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1004.eqiad.wmnet with OS bullseye
[19:41:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye
[19:52:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10BBlack) Looks about right to me!
[19:57:15] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] purged: use unix socket for varnish in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/963081 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T2000).
[20:00:05] <jouncebot>	 jdrewniak and sbailey: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:29] <sbailey>	 I am here with cscott
[20:00:55] * jan_drewniak o/
[20:04:01] * jan_drewniak sbailey: if the regular deployers don't show up, I can do the deploy
[20:04:32] <sbailey>	 ok
[20:05:47] <jan_drewniak>	 sbailey: I can do yours first since it's just a config change
[20:06:33] <sbailey>	 ok, :-)
[20:08:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey)
[20:09:19] <wikibugs>	 (03PS8) 10Jdrewniak: Re-enable Extension:ParserMigration on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey)
[20:09:35] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey)
[20:10:04] <cscott>	 hello, all.
[20:11:16] <wikibugs>	 (03Merged) 10jenkins-bot: Re-enable Extension:ParserMigration on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey)
[20:11:40] <cscott>	 https://www.mediawiki.org/wiki/Help:Extension:ParserMigration shows (interalia) how to test that this is working correctly on labs
[20:11:49] <logmsgbot>	 !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:944978|Re-enable Extension:ParserMigration on labs (T333179)]]
[20:12:03] <stashbot>	 T333179: (Re)deploy ParserMigration extension to production - https://phabricator.wikimedia.org/T333179
[20:16:04] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1 C: 03+2] purged: use unix socket for varnish in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/963081 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[20:16:26] <cscott>	 https://en.wikipedia.beta.wmflabs.org/wiki/Special:Version should eventually show ParserMigration as well
[20:16:57] <fabfur>	 !log merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/963081 (T347837). `purged` daemon will be restarted by puppet in eqsin in the next 30m
[20:17:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:17:01] <stashbot>	 T347837: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837
[20:17:30] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Fabfur)
[20:18:28] <cscott>	 https://en.wikipedia.beta.wmflabs.org/wiki/Special:Version#mw-version-ext-specialpage-ParserMigration
[20:20:11] <wikibugs>	 (03PS1) 10Fabfur: purged: use unix socket for varnish in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/963147
[20:23:30] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43857/console" [puppet] - 10https://gerrit.wikimedia.org/r/963147 (owner: 10Fabfur)
[20:23:46] <cscott>	 jan_drewniak: has the ParserMigration config been synced or are we still waiting for it?
[20:24:52] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10GFontenelle_WMF) @Ladsgroup: Thank you so much!
[20:25:48] <jan_drewniak>	 cscott: still waiting...
[20:29:52] <sbailey>	 Tested looking good, thanks
[20:34:06] <logmsgbot>	 !log jdrewniak@deploy2002 jdrewniak and sbailey: Backport for [[gerrit:944978|Re-enable Extension:ParserMigration on labs (T333179)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:34:14] <stashbot>	 T333179: (Re)deploy ParserMigration extension to production - https://phabricator.wikimedia.org/T333179
[20:34:39] * jan_drewniak sbailey, cscott: ok finally, its on mwdebug
[20:35:01] <jan_drewniak>	 sbailey, cscott: ok finally, its on mwdebug
[20:35:01] <sbailey>	 We both tested it, looking good :-). Thanks Jan
[20:35:06] <logmsgbot>	 !log jdrewniak@deploy2002 jdrewniak and sbailey: Continuing with sync
[20:43:26] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: increase timeout for curator delete actions [puppet] - 10https://gerrit.wikimedia.org/r/962236 (https://phabricator.wikimedia.org/T347976) (owner: 10Cwhite)
[20:44:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (9) High Kubernetes API latency (GET blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:46:32] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Promote several Wikipedias to Vector 2022 as default skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962684 (https://phabricator.wikimedia.org/T347321) (owner: 10Jdlrobson)
[20:47:50] <wikibugs>	 (03Merged) 10jenkins-bot: Promote several Wikipedias to Vector 2022 as default skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962684 (https://phabricator.wikimedia.org/T347321) (owner: 10Jdlrobson)
[20:49:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (9) High Kubernetes API latency (GET blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:49:40] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1003.eqiad.wmnet with OS bullseye
[20:49:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w...
[20:50:42] <logmsgbot>	 !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:944978|Re-enable Extension:ParserMigration on labs (T333179)]] (duration: 38m 52s)
[20:50:45] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Web typography prototype survey [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963043 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak)
[20:50:45] <stashbot>	 T333179: (Re)deploy ParserMigration extension to production - https://phabricator.wikimedia.org/T333179
[20:50:53] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Correct a recently-added message [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963137 (owner: 10Jdrewniak)
[20:50:58] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] [Prototype] Change i18n message [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963138 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak)
[20:51:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963043 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak)
[20:51:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963137 (owner: 10Jdrewniak)
[20:52:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963138 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak)
[20:56:34] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1004.eqiad.wmnet with OS bullseye
[20:56:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w...
[21:03:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963043 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak)
[21:03:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963137 (owner: 10Jdrewniak)
[21:03:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963138 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak)
[21:07:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Web typography prototype survey [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963043 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak)
[21:07:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Correct a recently-added message [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963137 (owner: 10Jdrewniak)
[21:07:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [Prototype] Change i18n message [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963138 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak)
[21:08:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963043 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak)
[21:08:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963137 (owner: 10Jdrewniak)
[21:08:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963138 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak)
[21:09:09] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] "recheck" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963043 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak)
[21:13:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963043 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak)
[21:13:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963137 (owner: 10Jdrewniak)
[21:13:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963138 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak)
[21:23:14] <logmsgbot>	 !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:962684|Promote several Wikipedias to Vector 2022 as default skin (T347321)]]
[21:23:18] <stashbot>	 T347321: Deploy Vector 2022 as the default on next set of wikis - https://phabricator.wikimedia.org/T347321
[21:24:37] <logmsgbot>	 !log jdrewniak@deploy2002 jdlrobson and jdrewniak: Backport for [[gerrit:962684|Promote several Wikipedias to Vector 2022 as default skin (T347321)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:25:21] <wikibugs>	 (03PS1) 10EoghanGaffney: [gitlab/switchover] Change profile::gitlab::service_name for switchover [puppet] - 10https://gerrit.wikimedia.org/r/963160 (https://phabricator.wikimedia.org/T345531)
[21:26:20] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:26:23] <logmsgbot>	 !log jdrewniak@deploy2002 jdlrobson and jdrewniak: Continuing with sync
[21:27:11] <wikibugs>	 (03Merged) 10jenkins-bot: Web typography prototype survey [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963043 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak)
[21:27:19] <wikibugs>	 (03Merged) 10jenkins-bot: Correct a recently-added message [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963137 (owner: 10Jdrewniak)
[21:28:52] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] "recheck" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963138 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak)
[21:29:04] <wikibugs>	 (03PS1) 10EoghanGaffney: [gitlab/switchover] Update DNS for gitlab/gitlab-replica [dns] - 10https://gerrit.wikimedia.org/r/963161 (https://phabricator.wikimedia.org/T345531)
[21:32:40] <logmsgbot>	 !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:962684|Promote several Wikipedias to Vector 2022 as default skin (T347321)]] (duration: 09m 26s)
[21:32:45] <stashbot>	 T347321: Deploy Vector 2022 as the default on next set of wikis - https://phabricator.wikimedia.org/T347321
[21:33:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963138 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak)
[21:33:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:38:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:43:05] <wikibugs>	 (03Merged) 10jenkins-bot: [Prototype] Change i18n message [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963138 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak)
[21:43:36] <logmsgbot>	 !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:963043|Web typography prototype survey (T347208)]], [[gerrit:963137|Correct a recently-added message]], [[gerrit:963138|[Prototype] Change i18n message (T347208)]]
[21:43:40] <stashbot>	 T347208: Launch Community Prototype - https://phabricator.wikimedia.org/T347208
[21:49:18] <wikibugs>	 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netbox, 10netops: Avoid ghost hosts on the network - https://phabricator.wikimedia.org/T306007 (10wiki_willy) ++ @Papaul , who's going to dig around a bit and provide some feedback
[22:01:56] <logmsgbot>	 !log jdrewniak@deploy2002 jdrewniak: Backport for [[gerrit:963043|Web typography prototype survey (T347208)]], [[gerrit:963137|Correct a recently-added message]], [[gerrit:963138|[Prototype] Change i18n message (T347208)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:02:09] <stashbot>	 T347208: Launch Community Prototype - https://phabricator.wikimedia.org/T347208
[22:11:09] <logmsgbot>	 !log jdrewniak@deploy2002 jdrewniak: Continuing with sync
[22:16:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:21:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (13) High Kubernetes API latency (GET blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:22:44] <logmsgbot>	 !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:963043|Web typography prototype survey (T347208)]], [[gerrit:963137|Correct a recently-added message]], [[gerrit:963138|[Prototype] Change i18n message (T347208)]] (duration: 39m 08s)
[22:22:48] <stashbot>	 T347208: Launch Community Prototype - https://phabricator.wikimedia.org/T347208
[22:23:24] <icinga-wm>	 PROBLEM - Disk space on testreduce1001 is CRITICAL: DISK CRITICAL - free space: /srv/data 0 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=testreduce1001&var-datasource=eqiad+prometheus/ops
[22:25:40] <wikibugs>	 (03PS1) 10Volans: setup.py: upper limit for types-requests [cookbooks] - 10https://gerrit.wikimedia.org/r/963188
[22:25:42] <wikibugs>	 (03PS1) 10Volans: sre.hosts.reimage: fix confctl repool command [cookbooks] - 10https://gerrit.wikimedia.org/r/963189 (https://phabricator.wikimedia.org/T347954)
[22:32:33] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Self-merging to unblock CI on the other CRs. Happy to adapt if there is any post-merge comment." [cookbooks] - 10https://gerrit.wikimedia.org/r/963188 (owner: 10Volans)
[22:35:06] <wikibugs>	 (03Merged) 10jenkins-bot: setup.py: upper limit for types-requests [cookbooks] - 10https://gerrit.wikimedia.org/r/963188 (owner: 10Volans)
[22:36:51] <wikibugs>	 (03PS15) 10Volans: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper)
[22:36:58] <wikibugs>	 (03PS2) 10Volans: [gitlab/failover] Increase alert downtime duration [cookbooks] - 10https://gerrit.wikimedia.org/r/962636 (owner: 10EoghanGaffney)
[22:48:47] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:58:03] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T343198)', diff saved to https://phabricator.wikimedia.org/P52808 and previous config saved to /var/cache/conftool/dbconfig/20231003-225803-arnaudb.json
[22:58:07] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[23:12:54] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 1.027e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[23:13:10] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P52809 and previous config saved to /var/cache/conftool/dbconfig/20231003-231309-arnaudb.json
[23:28:16] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P52810 and previous config saved to /var/cache/conftool/dbconfig/20231003-232815-arnaudb.json
[23:43:22] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T343198)', diff saved to https://phabricator.wikimedia.org/P52811 and previous config saved to /var/cache/conftool/dbconfig/20231003-234322-arnaudb.json
[23:43:24] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance
[23:43:26] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[23:43:37] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance
[23:43:44] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2162 (T343198)', diff saved to https://phabricator.wikimedia.org/P52812 and previous config saved to /var/cache/conftool/dbconfig/20231003-234343-arnaudb.json
[23:49:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[23:50:45] <wikibugs>	 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Htriedman) Hi all! I've made updates to the codebase to better comply with @Eevans' feedback, resulting in a greatly simplified int...
[23:54:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency