[00:02:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T328817)', diff saved to https://phabricator.wikimedia.org/P71392 and previous config saved to /var/cache/conftool/dbconfig/20241129-000211-ladsgroup.json [00:02:14] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [00:02:17] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [00:02:27] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [00:02:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T328817)', diff saved to https://phabricator.wikimedia.org/P71393 and previous config saved to /var/cache/conftool/dbconfig/20241129-000234-ladsgroup.json [00:05:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T328817)', diff saved to https://phabricator.wikimedia.org/P71394 and previous config saved to /var/cache/conftool/dbconfig/20241129-000533-ladsgroup.json [00:07:25] FIRING: [5x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:12:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:17:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:20:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P71395 and previous config saved to /var/cache/conftool/dbconfig/20241129-002040-ladsgroup.json [00:22:06] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:35:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P71396 and previous config saved to /var/cache/conftool/dbconfig/20241129-003547-ladsgroup.json [00:38:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1099086 [00:38:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1099086 (owner: 10TrainBranchBot) [00:50:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T328817)', diff saved to https://phabricator.wikimedia.org/P71397 and previous config saved to /var/cache/conftool/dbconfig/20241129-005054-ladsgroup.json [00:50:56] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance [00:50:59] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [00:51:10] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance [00:51:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T328817)', diff saved to https://phabricator.wikimedia.org/P71398 and previous config saved to /var/cache/conftool/dbconfig/20241129-005117-ladsgroup.json [00:53:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T328817)', diff saved to https://phabricator.wikimedia.org/P71399 and previous config saved to /var/cache/conftool/dbconfig/20241129-005328-ladsgroup.json [00:57:52] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1099086 (owner: 10TrainBranchBot) [00:59:26] FIRING: [3x] SystemdUnitFailed: kafka-mirror-main-codfw_to_main-eqiad@0.service on kafka-main1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:08:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1099088 [01:08:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1099088 (owner: 10TrainBranchBot) [01:08:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P71400 and previous config saved to /var/cache/conftool/dbconfig/20241129-010835-ladsgroup.json [01:11:56] PROBLEM - MariaDB Replica SQL: s3 on db1223 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table pagelinks is corrupt: try to repair it on query. Default database: arwiktionary. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:19:56] PROBLEM - MariaDB Replica Lag: s3 on db1223 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 623.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:23:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P71401 and previous config saved to /var/cache/conftool/dbconfig/20241129-012343-ladsgroup.json [01:26:50] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1099088 (owner: 10TrainBranchBot) [01:38:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T328817)', diff saved to https://phabricator.wikimedia.org/P71402 and previous config saved to /var/cache/conftool/dbconfig/20241129-013850-ladsgroup.json [01:38:52] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2193.codfw.wmnet with reason: Maintenance [01:38:55] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [01:39:05] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2193.codfw.wmnet with reason: Maintenance [01:39:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T328817)', diff saved to https://phabricator.wikimedia.org/P71403 and previous config saved to /var/cache/conftool/dbconfig/20241129-013912-ladsgroup.json [01:41:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T328817)', diff saved to https://phabricator.wikimedia.org/P71404 and previous config saved to /var/cache/conftool/dbconfig/20241129-014124-ladsgroup.json [01:56:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P71405 and previous config saved to /var/cache/conftool/dbconfig/20241129-015631-ladsgroup.json [02:07:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:11:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P71406 and previous config saved to /var/cache/conftool/dbconfig/20241129-021138-ladsgroup.json [02:12:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:26:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T328817)', diff saved to https://phabricator.wikimedia.org/P71407 and previous config saved to /var/cache/conftool/dbconfig/20241129-022645-ladsgroup.json [02:26:47] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2197.codfw.wmnet with reason: Maintenance [02:26:50] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [02:27:00] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2197.codfw.wmnet with reason: Maintenance [02:28:02] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2217.codfw.wmnet with reason: Maintenance [02:28:16] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2217.codfw.wmnet with reason: Maintenance [02:28:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T328817)', diff saved to https://phabricator.wikimedia.org/P71408 and previous config saved to /var/cache/conftool/dbconfig/20241129-022822-ladsgroup.json [02:31:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T328817)', diff saved to https://phabricator.wikimedia.org/P71409 and previous config saved to /var/cache/conftool/dbconfig/20241129-023118-ladsgroup.json [02:35:44] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:36:48] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:39:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098915 (owner: 10Tim Starling) [02:39:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098916 (owner: 10Tim Starling) [02:39:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098917 (https://phabricator.wikimedia.org/T380726) (owner: 10Tim Starling) [02:39:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099065 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling) [02:39:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099066 (owner: 10Tim Starling) [02:39:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099064 (owner: 10Tim Starling) [02:39:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099061 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling) [02:40:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099059 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling) [02:40:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099060 (owner: 10Tim Starling) [02:40:37] (03Merged) 10jenkins-bot: addWiki.php tweaks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098915 (owner: 10Tim Starling) [02:40:40] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:40:40] (03Merged) 10jenkins-bot: Run dumpInterwiki.php locally with no changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098916 (owner: 10Tim Starling) [02:40:42] (03Merged) 10jenkins-bot: Prepare id.wikivoyage.org for installation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098917 (https://phabricator.wikimedia.org/T380726) (owner: 10Tim Starling) [02:42:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:42:33] (03Merged) 10jenkins-bot: Convert addWiki.php to a wrapper around core installPreConfigured.php [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099061 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling) [02:42:47] (03Merged) 10jenkins-bot: addWiki: Add UpdateSearchIndexConfig [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099064 (owner: 10Tim Starling) [02:42:47] (03Merged) 10jenkins-bot: dumpInterwiki: read from preinstall.dblist [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099065 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling) [02:46:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P71410 and previous config saved to /var/cache/conftool/dbconfig/20241129-024625-ladsgroup.json [02:50:02] (03CR) 10Tim Starling: "Another CI failure due to "Could not resolve host: gerrit.wikimedia.org". I'll just force submit it when the other jobs pass. The failing " [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099059 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling) [02:57:39] (03Merged) 10jenkins-bot: installer: Fix failure to install blobs table [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099060 (owner: 10Tim Starling) [02:58:01] (03CR) 10CI reject: [V:04-1] Fix various installPreConfigured bugs [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099059 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling) [02:58:01] (03CR) 10CI reject: [V:04-1] addWiki: Move DB_ADMIN to core [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099066 (owner: 10Tim Starling) [02:59:58] (03CR) 10Tim Starling: [V:03+2] Fix various installPreConfigured bugs [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099059 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling) [03:00:27] (03CR) 10Tim Starling: [C:03+2] addWiki: Move DB_ADMIN to core [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099066 (owner: 10Tim Starling) [03:00:40] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:01:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P71411 and previous config saved to /var/cache/conftool/dbconfig/20241129-030133-ladsgroup.json [03:03:03] (03Merged) 10jenkins-bot: addWiki: Move DB_ADMIN to core [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099066 (owner: 10Tim Starling) [03:04:42] !log tstarling@deploy2002 scap failed: '1 dbs from /srv/mediawiki-staging/wikiversions.json are missing from /srv/mediawiki-staging/dblists/all.dblist: idwikivoyage' (scap version: 4.129.0) (duration: 00m 00s) [03:09:26] FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:16:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T328817)', diff saved to https://phabricator.wikimedia.org/P71412 and previous config saved to /var/cache/conftool/dbconfig/20241129-031642-ladsgroup.json [03:16:44] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2224.codfw.wmnet with reason: Maintenance [03:16:48] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [03:16:58] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2224.codfw.wmnet with reason: Maintenance [03:17:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2224 (T328817)', diff saved to https://phabricator.wikimedia.org/P71413 and previous config saved to /var/cache/conftool/dbconfig/20241129-031705-ladsgroup.json [03:20:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T328817)', diff saved to https://phabricator.wikimedia.org/P71414 and previous config saved to /var/cache/conftool/dbconfig/20241129-032002-ladsgroup.json [03:21:14] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 9070 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [03:35:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P71415 and previous config saved to /var/cache/conftool/dbconfig/20241129-033509-ladsgroup.json [03:39:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:50:09] (03PS1) 10Tim Starling: Revert "Prepare id.wikivoyage.org for installation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099094 [03:50:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P71416 and previous config saved to /var/cache/conftool/dbconfig/20241129-035016-ladsgroup.json [03:53:37] (03CR) 10Tim Starling: [C:03+2] Revert "Prepare id.wikivoyage.org for installation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099094 (owner: 10Tim Starling) [03:54:20] (03Merged) 10jenkins-bot: Revert "Prepare id.wikivoyage.org for installation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099094 (owner: 10Tim Starling) [04:01:44] !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1098915|addWiki.php tweaks]], [[gerrit:1098916|Run dumpInterwiki.php locally with no changes]], [[gerrit:1098917|Prepare id.wikivoyage.org for installation (T380726 T352113)]], [[gerrit:1099065|dumpInterwiki: read from preinstall.dblist (T352113)]], [[gerrit:1099066|addWiki: Move DB_ADMIN to core]], [[gerrit:1099064|addWiki: Add UpdateSearchIndexConf [04:01:44] ig]], [[gerrit:1099061|Convert addWiki.php to a wrapper around core installPreConfigured.php (T352113)]], [[gerrit:1099059|Fix various installPreConfigured bugs (T352113)]], [[gerrit:1099060|installer: Fix failure to install blobs table]], [[gerrit:1099094|Revert "Prepare id.wikivoyage.org for installation"]] [04:01:50] T380726: Create Wikivoyage Indonesian - https://phabricator.wikimedia.org/T380726 [04:01:50] T352113: Move the addWiki.php maintenance script from WikimediaMaintenance into MediaWiki core - https://phabricator.wikimedia.org/T352113 [04:05:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T328817)', diff saved to https://phabricator.wikimedia.org/P71417 and previous config saved to /var/cache/conftool/dbconfig/20241129-040523-ladsgroup.json [04:05:26] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2229.codfw.wmnet with reason: Maintenance [04:05:28] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [04:05:40] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2229.codfw.wmnet with reason: Maintenance [04:05:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2229 (T328817)', diff saved to https://phabricator.wikimedia.org/P71418 and previous config saved to /var/cache/conftool/dbconfig/20241129-040547-ladsgroup.json [04:07:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:08:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T328817)', diff saved to https://phabricator.wikimedia.org/P71419 and previous config saved to /var/cache/conftool/dbconfig/20241129-040846-ladsgroup.json [04:12:18] !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1098915|addWiki.php tweaks]], [[gerrit:1098916|Run dumpInterwiki.php locally with no changes]], [[gerrit:1098917|Prepare id.wikivoyage.org for installation (T380726 T352113)]], [[gerrit:1099065|dumpInterwiki: read from preinstall.dblist (T352113)]], [[gerrit:1099066|addWiki: Move DB_ADMIN to core]], [[gerrit:1099064|addWiki: Add UpdateSearchIndexConfig]], [[gerrit [04:12:18] :1099061|Convert addWiki.php to a wrapper around core installPreConfigured.php (T352113)]], [[gerrit:1099059|Fix various installPreConfigured bugs (T352113)]], [[gerrit:1099060|installer: Fix failure to install blobs table]], [[gerrit:1099094|Revert "Prepare id.wikivoyage.org for installation"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [04:12:23] T380726: Create Wikivoyage Indonesian - https://phabricator.wikimedia.org/T380726 [04:12:24] T352113: Move the addWiki.php maintenance script from WikimediaMaintenance into MediaWiki core - https://phabricator.wikimedia.org/T352113 [04:12:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:12:33] !log tstarling@deploy2002 tstarling: Continuing with sync [04:17:26] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:17:44] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:20:17] !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098915|addWiki.php tweaks]], [[gerrit:1098916|Run dumpInterwiki.php locally with no changes]], [[gerrit:1098917|Prepare id.wikivoyage.org for installation (T380726 T352113)]], [[gerrit:1099065|dumpInterwiki: read from preinstall.dblist (T352113)]], [[gerrit:1099066|addWiki: Move DB_ADMIN to core]], [[gerrit:1099064|addWiki: Add UpdateSearchIndexCon [04:20:17] fig]], [[gerrit:1099061|Convert addWiki.php to a wrapper around core installPreConfigured.php (T352113)]], [[gerrit:1099059|Fix various installPreConfigured bugs (T352113)]], [[gerrit:1099060|installer: Fix failure to install blobs table]], [[gerrit:1099094|Revert "Prepare id.wikivoyage.org for installation"]] (duration: 18m 32s) [04:20:22] T380726: Create Wikivoyage Indonesian - https://phabricator.wikimedia.org/T380726 [04:20:23] T352113: Move the addWiki.php maintenance script from WikimediaMaintenance into MediaWiki core - https://phabricator.wikimedia.org/T352113 [04:20:26] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:20:44] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:22:07] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:23:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P71420 and previous config saved to /var/cache/conftool/dbconfig/20241129-042355-ladsgroup.json [04:39:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P71421 and previous config saved to /var/cache/conftool/dbconfig/20241129-043902-ladsgroup.json [04:42:48] (03PS3) 10Tim Starling: Create id.wikivoyage.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098918 (https://phabricator.wikimedia.org/T380726) [04:54:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T328817)', diff saved to https://phabricator.wikimedia.org/P71422 and previous config saved to /var/cache/conftool/dbconfig/20241129-045409-ladsgroup.json [04:54:14] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [04:59:27] FIRING: [3x] SystemdUnitFailed: kafka-mirror-main-codfw_to_main-eqiad@0.service on kafka-main1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:34:14] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.064e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [05:37:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:42:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:47:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:52:45] !log taavi@cumin1002 dbctl commit (dc=all): 'depool db1223, replication broken', diff saved to https://phabricator.wikimedia.org/P71423 and previous config saved to /var/cache/conftool/dbconfig/20241129-055245-taavi.json [05:58:56] RECOVERY - MariaDB Replica SQL: s3 on db1223 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:05:10] (03PS1) 10Marostegui: common.yaml: Add arbcom_zhwiki [puppet] - 10https://gerrit.wikimedia.org/r/1099102 (https://phabricator.wikimedia.org/T381086) [06:06:07] (03CR) 10Marostegui: [C:03+2] common.yaml: Add arbcom_zhwiki [puppet] - 10https://gerrit.wikimedia.org/r/1099102 (https://phabricator.wikimedia.org/T381086) (owner: 10Marostegui) [06:12:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:24:56] RECOVERY - MariaDB Replica Lag: s3 on db1223 is OK: OK slave_sql_lag Replication lag: 0.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:26:55] !log marostegui@cumin2002 START - Cookbook sre.mysql.pool db1223 quickly with 2 steps - Fixed corruption [06:27:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:27:26] !log marostegui@cumin2002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1223 quickly with 2 steps - Fixed corruption [06:28:34] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repool', diff saved to https://phabricator.wikimedia.org/P71424 and previous config saved to /var/cache/conftool/dbconfig/20241129-062833-marostegui.json [06:32:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:36:48] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:40:32] PROBLEM - LDAP -writable server- on serpens is CRITICAL: Could not bind to the LDAP server https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [06:48:02] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1223 (re)pooling @ 50%: Repooling after corruption', diff saved to https://phabricator.wikimedia.org/P71425 and previous config saved to /var/cache/conftool/dbconfig/20241129-064801-root.json [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241129T0700) [07:03:34] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1223 (re)pooling @ 75%: Repooling after corruption', diff saved to https://phabricator.wikimedia.org/P71426 and previous config saved to /var/cache/conftool/dbconfig/20241129-070333-root.json [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:06:53] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@656d6df]: Generate canary events faster in Airflow [07:07:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:26] FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:10:08] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@656d6df]: Generate canary events faster in Airflow (duration: 03m 15s) [07:11:56] (03CR) 10Muehlenhoff: [C:03+2] ganeti1018: Update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1098981 (owner: 10Muehlenhoff) [07:12:25] FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:15:14] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 5975 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [07:19:06] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1223 (re)pooling @ 100%: Repooling after corruption', diff saved to https://phabricator.wikimedia.org/P71427 and previous config saved to /var/cache/conftool/dbconfig/20241129-071905-root.json [07:35:31] RECOVERY - LDAP -writable server- on serpens is OK: LDAP OK - 0.097 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [07:37:25] FIRING: [4x] SystemdUnitFailed: logrotate.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:39:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:42:25] FIRING: [4x] SystemdUnitFailed: logrotate.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:47:33] RECOVERY - Disk space on serpens is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=serpens&var-datasource=codfw+prometheus/ops [07:49:05] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Update cloudcephmon secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1098988 (https://phabricator.wikimedia.org/T364870) (owner: 10Muehlenhoff) [07:49:36] (03CR) 10Muehlenhoff: [C:03+2] Add component/maps [puppet] - 10https://gerrit.wikimedia.org/r/1098984 (owner: 10Muehlenhoff) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241129T0800) [08:11:12] (03CR) 10Brouberol: [C:03+1] Move hourly gobblin event start-time later [puppet] - 10https://gerrit.wikimedia.org/r/1099010 (https://phabricator.wikimedia.org/T376144) (owner: 10Joal) [08:11:25] (03Abandoned) 10Physikerwelt: Fix: handling of nullary macros [extensions/Math] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1097840 (https://phabricator.wikimedia.org/T380184) (owner: 10Physikerwelt) [08:16:07] !log imported mapbox-geometry_2.0.3-1~wmf12u1 to component/maps T216826 [08:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:11] T216826: Move Kartotherian to Kubernetes - https://phabricator.wikimedia.org/T216826 [08:16:43] (03CR) 10Majavah: [C:03+1] cloudweb/codfw1dev: Use firewall::service for firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/1098952 (owner: 10Muehlenhoff) [08:17:19] (03CR) 10Majavah: [C:03+2] P:toolforge: mail: Drop support for .wmflabs VM names [puppet] - 10https://gerrit.wikimedia.org/r/1095189 (https://phabricator.wikimedia.org/T380679) (owner: 10Majavah) [08:21:34] (03PS1) 10Muehlenhoff: Add build hook for component/maps [puppet] - 10https://gerrit.wikimedia.org/r/1099151 (https://phabricator.wikimedia.org/T216826) [08:22:07] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:24:00] (03CR) 10Muehlenhoff: [C:03+2] Add build hook for component/maps [puppet] - 10https://gerrit.wikimedia.org/r/1099151 (https://phabricator.wikimedia.org/T216826) (owner: 10Muehlenhoff) [08:26:10] (03PS1) 10KartikMistry: Update recommendation-api to 2024-11-28-163815-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099152 (https://phabricator.wikimedia.org/T380838) [08:30:47] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:51:43] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:51:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:54:09] !log imported mapbox-polylabel 2.0.1-1~wmf12u1 to component/maps T216826 [08:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:13] T216826: Move Kartotherian to Kubernetes - https://phabricator.wikimedia.org/T216826 [08:59:27] FIRING: [3x] SystemdUnitFailed: kafka-mirror-main-codfw_to_main-eqiad@0.service on kafka-main1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:02:35] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [09:03:40] (03CR) 10Brouberol: [C:04-1] Enable pod-scoped "external services" network policies for airflow (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094422 (https://phabricator.wikimedia.org/T377926) (owner: 10Stevemunene) [09:05:53] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2082.codfw.wmnet with OS bullseye [09:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:18:36] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [09:21:32] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [09:29:09] (03PS13) 10Stevemunene: Enable pod-scoped "external services" network policies for airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094422 (https://phabricator.wikimedia.org/T377926) [09:33:31] (03PS1) 10Alexandros Kosiaris: mwdebug: Enable retries [puppet] - 10https://gerrit.wikimedia.org/r/1099159 (https://phabricator.wikimedia.org/T380598) [09:34:03] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1099159 (https://phabricator.wikimedia.org/T380598) (owner: 10Alexandros Kosiaris) [09:34:07] (03CR) 10CI reject: [V:04-1] mwdebug: Enable retries [puppet] - 10https://gerrit.wikimedia.org/r/1099159 (https://phabricator.wikimedia.org/T380598) (owner: 10Alexandros Kosiaris) [09:35:00] (03Abandoned) 10Alexandros Kosiaris: mediawiki: set idle timeout for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/587734 (owner: 10Giuseppe Lavagetto) [09:35:53] 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10367177 (10MoritzMuehlenhoff) The underlying failing check is defined in the headers, but not otherwise used in... [09:36:35] (03PS2) 10Alexandros Kosiaris: mwdebug: Enable retries [puppet] - 10https://gerrit.wikimedia.org/r/1099159 (https://phabricator.wikimedia.org/T380598) [09:37:12] (03PS3) 10Alexandros Kosiaris: mwdebug: Enable retries [puppet] - 10https://gerrit.wikimedia.org/r/1099159 (https://phabricator.wikimedia.org/T380598) [09:37:41] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1099159 (https://phabricator.wikimedia.org/T380598) (owner: 10Alexandros Kosiaris) [09:39:34] 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10367186 (10MoritzMuehlenhoff) But with perccli the battery is reported to be fine (command is /opt/MegaRAID/percc... [09:40:51] 06SRE, 10SRE-tools, 10Data-Platform-SRE (2024.11.09 - 2024.11.29), 03Discovery-Search (Current work): Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507#10367189 (10Gehel) 05Resolved→03Declined [09:43:30] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2082.codfw.wmnet with OS bullseye [09:45:11] (03PS2) 10Muehlenhoff: Add ferm macro/nftables set for loadbalancer nodes [puppet] - 10https://gerrit.wikimedia.org/r/1098936 [09:46:31] (03CR) 10Muehlenhoff: Add ferm macro/nftables set for loadbalancer nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1098936 (owner: 10Muehlenhoff) [09:46:39] (03PS14) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) [09:48:11] (03CR) 10Stevemunene: Enable pod-scoped "external services" network policies for airflow (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094422 (https://phabricator.wikimedia.org/T377926) (owner: 10Stevemunene) [09:50:03] (03CR) 10Brouberol: [C:03+1] Enable pod-scoped "external services" network policies for airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094422 (https://phabricator.wikimedia.org/T377926) (owner: 10Stevemunene) [09:55:48] (03CR) 10Alexandros Kosiaris: [C:03+2] mwdebug: Enable retries [puppet] - 10https://gerrit.wikimedia.org/r/1099159 (https://phabricator.wikimedia.org/T380598) (owner: 10Alexandros Kosiaris) [09:57:38] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2083.codfw.wmnet with OS bullseye [10:02:29] (03PS16) 10Fabfur: benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) [10:04:44] (03PS15) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) [10:05:19] (03PS16) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) [10:10:15] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2083.codfw.wmnet with reason: host reimage [10:13:38] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2083.codfw.wmnet with reason: host reimage [10:32:22] (03CR) 10Stevemunene: [C:03+2] Enable pod-scoped "external services" network policies for airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094422 (https://phabricator.wikimedia.org/T377926) (owner: 10Stevemunene) [10:33:50] (03Merged) 10jenkins-bot: Enable pod-scoped "external services" network policies for airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094422 (https://phabricator.wikimedia.org/T377926) (owner: 10Stevemunene) [10:35:30] (03CR) 10Vgutierrez: [C:04-1] benthos: add benthos for haproxy debug functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:36:00] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2083.codfw.wmnet with OS bullseye [10:36:48] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:41:18] (03PS17) 10Fabfur: benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) [10:41:46] (03CR) 10Fabfur: benthos: add benthos for haproxy debug functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:42:14] (03PS17) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) [10:45:47] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2084.codfw.wmnet with OS bullseye [10:57:08] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2084.codfw.wmnet with reason: host reimage [11:00:24] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2214.codfw.wmnet with reason: Maintenance [11:00:37] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2214.codfw.wmnet with reason: Maintenance [11:01:17] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2084.codfw.wmnet with reason: host reimage [11:04:25] (03CR) 10Vgutierrez: [C:03+1] benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [11:04:44] (03CR) 10Vgutierrez: [C:03+1] hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [11:08:27] (03PS1) 10Stevemunene: Bump the chart version to pick up new changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099164 (https://phabricator.wikimedia.org/T377926) [11:09:26] FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:10:47] (03CR) 10Brouberol: [C:03+1] Bump the chart version to pick up new changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099164 (https://phabricator.wikimedia.org/T377926) (owner: 10Stevemunene) [11:11:23] (03CR) 10Stevemunene: [C:03+2] Bump the chart version to pick up new changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099164 (https://phabricator.wikimedia.org/T377926) (owner: 10Stevemunene) [11:12:55] (03Merged) 10jenkins-bot: Bump the chart version to pick up new changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099164 (https://phabricator.wikimedia.org/T377926) (owner: 10Stevemunene) [11:15:34] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [11:15:48] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [11:15:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1173 (T376905)', diff saved to https://phabricator.wikimedia.org/P71428 and previous config saved to /var/cache/conftool/dbconfig/20241129-111554-ladsgroup.json [11:18:47] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:19:14] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:24:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T376905)', diff saved to https://phabricator.wikimedia.org/P71429 and previous config saved to /var/cache/conftool/dbconfig/20241129-112447-ladsgroup.json [11:27:37] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2084.codfw.wmnet with OS bullseye [11:29:05] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2085.codfw.wmnet with OS bullseye [11:31:48] !log Started MediaModeration scanning scripts to scan all wikis [11:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:08] (03CR) 10Ladsgroup: [C:03+1] Create id.wikivoyage.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098918 (https://phabricator.wikimedia.org/T380726) (owner: 10Tim Starling) [11:39:31] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10367413 (10elukey) Reimaged 208[2-5] too (2084 was left unconfigured for some reason, I have probably missed it, good that I rechecked :D). [11:39:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:39:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P71430 and previous config saved to /var/cache/conftool/dbconfig/20241129-113954-ladsgroup.json [11:40:43] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2085.codfw.wmnet with reason: host reimage [11:42:25] FIRING: [2x] SystemdUnitFailed: logrotate.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:43:53] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2085.codfw.wmnet with reason: host reimage [11:44:57] !log imported mapnik_4.0.3+ds2~wmf12u1 to component/maps T216826 [11:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:02] T216826: Move Kartotherian to Kubernetes - https://phabricator.wikimedia.org/T216826 [11:46:20] (03CR) 10Muehlenhoff: Add the mapnik image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1093935 (https://phabricator.wikimedia.org/T327396) (owner: 10Elukey) [11:50:58] jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [11:55:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P71431 and previous config saved to /var/cache/conftool/dbconfig/20241129-115501-ladsgroup.json [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241129T0800) [12:00:05] eoghan, jelto, arnoldokoth, and mutante: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for GitLab version upgrades . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241129T1200). [12:04:47] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2085.codfw.wmnet with OS bullseye [12:06:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [12:06:56] FIRING: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:10:00] (03PS1) 10Ilias Sarantopoulos: ml-services: update recapi liveness prob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099171 [12:10:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T376905)', diff saved to https://phabricator.wikimedia.org/P71432 and previous config saved to /var/cache/conftool/dbconfig/20241129-121010-ladsgroup.json [12:11:56] RESOLVED: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:12:25] FIRING: [3x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:13:28] (03CR) 10Klausman: [C:03+2] Update recommendation-api to 2024-11-28-163815-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099152 (https://phabricator.wikimedia.org/T380838) (owner: 10KartikMistry) [12:14:27] (03Merged) 10jenkins-bot: Update recommendation-api to 2024-11-28-163815-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099152 (https://phabricator.wikimedia.org/T380838) (owner: 10KartikMistry) [12:17:04] 06SRE, 10decommission-hardware: decommission ganeti1015 / ganeti1021 - https://phabricator.wikimedia.org/T381157 (10MoritzMuehlenhoff) 03NEW [12:18:21] (03PS2) 10Ilias Sarantopoulos: ml-services: update recapi liveness prob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099171 [12:18:30] (03PS3) 10Ilias Sarantopoulos: ml-services: update recapi liveness prob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099171 [12:18:45] 06SRE, 10decommission-hardware: decommission ganeti1015 / ganeti1021 - https://phabricator.wikimedia.org/T381157#10367574 (10MoritzMuehlenhoff) [12:20:04] (03CR) 10Klausman: ml-services: update recapi liveness prob (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099171 (owner: 10Ilias Sarantopoulos) [12:20:41] (03CR) 10Klausman: [C:03+1] Revert "ml-services: recapi increase readiness prob in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098911 (owner: 10Ilias Sarantopoulos) [12:20:48] (03CR) 10Klausman: [C:03+1] Revert "ml-services: increase readiness prob" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098912 (owner: 10Ilias Sarantopoulos) [12:21:05] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1015.eqiad.wmnet [12:22:07] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:22:19] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10367583 (10MoritzMuehlenhoff) [12:23:27] (03PS2) 10Ilias Sarantopoulos: Revert "ml-services: increase readiness prob" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098912 [12:23:49] (03PS2) 10Ilias Sarantopoulos: Revert "ml-services: recapi increase readiness prob in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098911 [12:24:49] (03CR) 10KartikMistry: [C:03+2] Revert "ml-services: increase readiness prob" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098912 (owner: 10Ilias Sarantopoulos) [12:24:59] (03CR) 10KartikMistry: [C:03+2] Revert "ml-services: recapi increase readiness prob in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098911 (owner: 10Ilias Sarantopoulos) [12:25:55] (03Merged) 10jenkins-bot: Revert "ml-services: increase readiness prob" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098912 (owner: 10Ilias Sarantopoulos) [12:25:58] (03Merged) 10jenkins-bot: Revert "ml-services: recapi increase readiness prob in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098911 (owner: 10Ilias Sarantopoulos) [12:27:01] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:27:14] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [12:27:28] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [12:27:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T376905)', diff saved to https://phabricator.wikimedia.org/P71433 and previous config saved to /var/cache/conftool/dbconfig/20241129-122735-ladsgroup.json [12:28:54] (03PS4) 10Ilias Sarantopoulos: ml-services: update recapi liveness prob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099171 [12:29:01] (03CR) 10CI reject: [V:04-1] ml-services: update recapi liveness prob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099171 (owner: 10Ilias Sarantopoulos) [12:30:00] (03PS5) 10Ilias Sarantopoulos: ml-services: update recapi liveness prob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099171 [12:31:01] (03CR) 10CI reject: [V:04-1] ml-services: update recapi liveness prob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099171 (owner: 10Ilias Sarantopoulos) [12:31:06] (03PS6) 10Ilias Sarantopoulos: ml-services: update recapi liveness prob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099171 [12:31:34] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1015.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:32:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1015.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:32:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:32:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1015.eqiad.wmnet [12:32:14] 06SRE, 10decommission-hardware: decommission ganeti1015 / ganeti1021 - https://phabricator.wikimedia.org/T381157#10367599 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti1015.eqiad.wmnet` - ganeti1015.eqiad.wmnet (**PASS**) - Downtimed host on Icinga/Alertm... [12:35:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T376905)', diff saved to https://phabricator.wikimedia.org/P71434 and previous config saved to /var/cache/conftool/dbconfig/20241129-123549-ladsgroup.json [12:38:42] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:39:11] (03CR) 10Klausman: [C:03+2] ml-services: update recapi liveness prob (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099171 (owner: 10Ilias Sarantopoulos) [12:39:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 2.408 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:40:10] (03Merged) 10jenkins-bot: ml-services: update recapi liveness prob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099171 (owner: 10Ilias Sarantopoulos) [12:42:09] !log klausman@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:50:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P71436 and previous config saved to /var/cache/conftool/dbconfig/20241129-125057-ladsgroup.json [12:53:30] (03PS1) 10KartikMistry: Fix API_CONCURRENCY_LIMIT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099176 [12:56:02] (03CR) 10Klausman: [C:03+2] Fix API_CONCURRENCY_LIMIT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099176 (owner: 10KartikMistry) [12:57:04] (03Merged) 10jenkins-bot: Fix API_CONCURRENCY_LIMIT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099176 (owner: 10KartikMistry) [12:57:43] !log klausman@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:59:27] FIRING: [3x] SystemdUnitFailed: kafka-mirror-main-codfw_to_main-eqiad@0.service on kafka-main1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:10] cleaned up that ^ [13:06:02] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1021.eqiad.wmnet [13:06:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P71437 and previous config saved to /var/cache/conftool/dbconfig/20241129-130604-ladsgroup.json [13:09:27] FIRING: [3x] SystemdUnitFailed: kafka-mirror-main-codfw_to_main-eqiad@0.service on kafka-main1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:13:07] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:13:39] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1098936 (owner: 10Muehlenhoff) [13:14:10] (03CR) 10Elukey: [C:04-1] Add the mapnik image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1093935 (https://phabricator.wikimedia.org/T327396) (owner: 10Elukey) [13:14:15] (03Abandoned) 10Elukey: Add the mapnik image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1093935 (https://phabricator.wikimedia.org/T327396) (owner: 10Elukey) [13:17:53] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1021.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:21:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T376905)', diff saved to https://phabricator.wikimedia.org/P71438 and previous config saved to /var/cache/conftool/dbconfig/20241129-132111-ladsgroup.json [13:21:16] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [13:21:30] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [13:21:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T376905)', diff saved to https://phabricator.wikimedia.org/P71439 and previous config saved to /var/cache/conftool/dbconfig/20241129-132136-ladsgroup.json [13:22:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1021.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:22:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:22:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1021.eqiad.wmnet [13:26:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10367696 (10Gehel) [13:28:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T376905)', diff saved to https://phabricator.wikimedia.org/P71440 and previous config saved to /var/cache/conftool/dbconfig/20241129-132848-ladsgroup.json [13:29:42] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:29:47] (03PS1) 10Muehlenhoff: Update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1099185 (https://phabricator.wikimedia.org/T381157) [13:30:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:35:34] 07sre-alert-triage, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Alert in need of triage: SmartNotHealthy (instance stat1011:9100) - https://phabricator.wikimedia.org/T380835#10367751 (10Gehel) [13:38:31] 06SRE, 10decommission-hardware, 13Patch-For-Review: decommission ganeti1015 / ganeti1021 - https://phabricator.wikimedia.org/T381157#10367781 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti1021.eqiad.wmnet` - ganeti1021.eqiad.wmnet (**PASS**) - Downtimed... [13:40:35] 06SRE, 10decommission-hardware, 13Patch-For-Review: decommission ganeti1015 / ganeti1021 - https://phabricator.wikimedia.org/T381157#10367817 (10MoritzMuehlenhoff) [13:41:07] 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#10367833 (10Ladsgroup) >>! In T353891#10365419, @Krd wrote: > Please unbreak now. This seems to be a different issue. The cases reported in this... [13:41:13] (03CR) 10Muehlenhoff: [C:03+2] Update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1099185 (https://phabricator.wikimedia.org/T381157) (owner: 10Muehlenhoff) [13:43:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti1015 / ganeti1021 - https://phabricator.wikimedia.org/T381157#10367857 (10MoritzMuehlenhoff) [13:43:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P71441 and previous config saved to /var/cache/conftool/dbconfig/20241129-134355-ladsgroup.json [13:44:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10367786 (10Gehel) [13:45:03] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti1015 / ganeti1021 - https://phabricator.wikimedia.org/T381157#10367859 (10MoritzMuehlenhoff) [13:52:54] (03PS1) 10Muehlenhoff: Deprecate system::role for remaining Data Engineering roles [puppet] - 10https://gerrit.wikimedia.org/r/1099190 [13:59:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P71442 and previous config saved to /var/cache/conftool/dbconfig/20241129-135902-ladsgroup.json [14:05:19] (03PS1) 10Muehlenhoff: grafana: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1099192 [14:07:02] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1099192 (owner: 10Muehlenhoff) [14:09:13] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10368115 (10Ladsgroup) Since this happened yesterday and has happened in the past too. Maybe we should just throw a bit of hardware at it? Specially maybe some vertical expansion. 15... [14:09:28] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 3 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10368116 (10Arnoldokoth) a:05eoghan→03Arnoldokoth [14:11:28] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10368120 (10eoghan) a:05eoghan→03None [14:14:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T376905)', diff saved to https://phabricator.wikimedia.org/P71443 and previous config saved to /var/cache/conftool/dbconfig/20241129-141409-ladsgroup.json [14:14:14] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [14:14:27] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [14:19:22] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1231.eqiad.wmnet with reason: Maintenance [14:19:24] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1231.eqiad.wmnet with reason: Maintenance [14:19:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T376905)', diff saved to https://phabricator.wikimedia.org/P71444 and previous config saved to /var/cache/conftool/dbconfig/20241129-141931-ladsgroup.json [14:23:07] (03PS1) 10Muehlenhoff: an-web: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1099195 [14:23:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1099195 (owner: 10Muehlenhoff) [14:25:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T376905)', diff saved to https://phabricator.wikimedia.org/P71445 and previous config saved to /var/cache/conftool/dbconfig/20241129-142540-ladsgroup.json [14:30:41] (03PS1) 10Brouberol: postgresql-airflow-wmde: add helmfiles and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099196 (https://phabricator.wikimedia.org/T380613) [14:30:42] (03PS1) 10Brouberol: airflow-wmde: point to the cloudnative-pg cluster in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099197 (https://phabricator.wikimedia.org/T380613) [14:31:57] (03PS6) 10Tiziano Fogli: blackbox/icmp: deployment sites controlled by input parameter instead of ::site [puppet] - 10https://gerrit.wikimedia.org/r/1079531 (https://phabricator.wikimedia.org/T370506) [14:31:57] (03CR) 10Tiziano Fogli: "@dcaro@wikimedia.org This is my refactor proposal for moving the ICMP checks from Icinga to the cloudgw role and iterating over data cente" [puppet] - 10https://gerrit.wikimedia.org/r/1079531 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [14:36:46] (03CR) 10Alexandros Kosiaris: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1094493 (owner: 10Majavah) [14:36:48] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:40:08] (03CR) 10Effie Mouzeli: [C:03+1] "please kill it" [puppet] - 10https://gerrit.wikimedia.org/r/1094493 (owner: 10Majavah) [14:40:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P71446 and previous config saved to /var/cache/conftool/dbconfig/20241129-144047-ladsgroup.json [14:42:21] (03CR) 10Alexandros Kosiaris: [C:04-1] "Actually, on looking again, maybe" [puppet] - 10https://gerrit.wikimedia.org/r/1094493 (owner: 10Majavah) [14:44:36] (03CR) 10Alexandros Kosiaris: [C:03+1] "Disregard, it's in the chain of commits, I2be06eda485be4070da32cf56e36f7b8022682a7." [puppet] - 10https://gerrit.wikimedia.org/r/1094493 (owner: 10Majavah) [14:45:53] (03Abandoned) 10Alexandros Kosiaris: openstack::nutcracker: Remove redis support [puppet] - 10https://gerrit.wikimedia.org/r/902074 (owner: 10Alexandros Kosiaris) [14:55:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P71447 and previous config saved to /var/cache/conftool/dbconfig/20241129-145554-ladsgroup.json [15:00:45] (03PS1) 10Muehlenhoff: Blacklist erofs [puppet] - 10https://gerrit.wikimedia.org/r/1099204 [15:01:46] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:02:00] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:05:24] 06SRE: Upload slow - https://phabricator.wikimedia.org/T372217#10368268 (10Yann) I checked again today, and from a fast DSL connection, uploading to Internet Archive is about 10 faster than to Wikimedia Commons (10 MB/s vs. 1 MB/s). This concerns big videos, which I upload with Rillke chunked upload tool: https:... [15:06:42] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:08:32] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 08 Feb 2025 11:19:52 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:09:26] FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:09:37] 10SRE-swift-storage, 06Commons, 10UploadWizard: internal_api_error_UploadChunkFileException - https://phabricator.wikimedia.org/T381093#10368278 (10Yann) Yes, I uploaded it OK today. [15:09:52] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp1004.eqiad.wmnet [15:10:18] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp2004.codfw.wmnet [15:11:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T376905)', diff saved to https://phabricator.wikimedia.org/P71448 and previous config saved to /var/cache/conftool/dbconfig/20241129-151101-ladsgroup.json [15:11:42] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:13:34] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 08 Feb 2025 11:19:52 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:13:36] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:13:50] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52922 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:14:44] 06SRE, 10SRE-swift-storage: The file "XXX" is in an inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T291137#10368287 (10Ladsgroup) FWIW, I ran a check on all containers of commons and their ACLs and none were a black swan. [15:16:09] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1004.eqiad.wmnet [15:16:50] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2004.codfw.wmnet [15:18:07] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 06serviceops: Decommission mc-gp100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T381173 (10jijiki) 03NEW [15:18:10] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops: Decommission mc-gp200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T381174 (10jijiki) 03NEW [15:18:20] 10SRE-swift-storage, 06Commons, 10UploadWizard: internal_api_error_UploadChunkFileException - https://phabricator.wikimedia.org/T381093#10368288 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon [15:18:55] (03PS1) 10Effie Mouzeli: site.pp: decomm mc-gp100[1-3], mc-gp200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1099212 (https://phabricator.wikimedia.org/T381174) [15:21:06] (03PS1) 10Máté Szabó: Prep pilot wiki config for IRS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099213 (https://phabricator.wikimedia.org/T374105) [15:21:58] (03CR) 10CI reject: [V:04-1] Prep pilot wiki config for IRS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099213 (https://phabricator.wikimedia.org/T374105) (owner: 10Máté Szabó) [15:22:47] (03PS2) 10Máté Szabó: Prep pilot wiki config for IRS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099213 (https://phabricator.wikimedia.org/T374105) [15:28:02] (03CR) 10Kosta Harlan: Prep pilot wiki config for IRS (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099213 (https://phabricator.wikimedia.org/T374105) (owner: 10Máté Szabó) [15:29:12] (03CR) 10Máté Szabó: [C:04-2] Prep pilot wiki config for IRS (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099213 (https://phabricator.wikimedia.org/T374105) (owner: 10Máté Szabó) [15:29:47] (03CR) 10Vgutierrez: Add ferm macro/nftables set for loadbalancer nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1098936 (owner: 10Muehlenhoff) [15:31:58] (03PS3) 10Muehlenhoff: Add ferm macro/nftables set for loadbalancer nodes [puppet] - 10https://gerrit.wikimedia.org/r/1098936 [15:32:00] (03PS1) 10Alexandros Kosiaris: rest-gateway: Comment about forwash slashes in [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099214 (https://phabricator.wikimedia.org/T379097) [15:33:05] (03CR) 10Muehlenhoff: Add ferm macro/nftables set for loadbalancer nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1098936 (owner: 10Muehlenhoff) [15:34:30] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp2005.codfw.wmnet [15:34:39] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp1005.eqiad.wmnet [15:37:04] (03CR) 10Clément Goubert: [C:03+1] site.pp: decomm mc-gp100[1-3], mc-gp200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1099212 (https://phabricator.wikimedia.org/T381174) (owner: 10Effie Mouzeli) [15:37:17] 06SRE, 06Infrastructure-Foundations, 10netops: Homer trying to delete BGP peerings for VMs on new Eqiad ganeti nodes - https://phabricator.wikimedia.org/T381175 (10cmooney) 03NEW p:05Triage→03High [15:38:14] (03CR) 10Alexandros Kosiaris: [C:03+2] rest-gateway: order mw-api-int paths strictly (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087528 (https://phabricator.wikimedia.org/T379097) (owner: 10Hnowlan) [15:39:36] (03PS2) 10Alexandros Kosiaris: rest-gateway: Comment about forwash slashes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099214 (https://phabricator.wikimedia.org/T379097) [15:39:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:40:42] 06SRE, 06Infrastructure-Foundations, 10netops: Homer trying to delete BGP peerings for VMs on new Eqiad ganeti nodes - https://phabricator.wikimedia.org/T381175#10368381 (10cmooney) [15:40:59] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1005.eqiad.wmnet [15:41:01] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2005.codfw.wmnet [15:44:57] (03CR) 10Amire80: [C:03+1] Add new namespaces to hsb wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090502 (https://phabricator.wikimedia.org/T373634) (owner: 10Srishakatux) [15:47:59] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731#10368384 (10MoritzMuehlenhoff) [15:50:20] 06SRE, 06Infrastructure-Foundations: Reboots of Bookworm systems which use 6.1.115 - https://phabricator.wikimedia.org/T380731#10368385 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done! Thanks everyone for the quick turnaround [15:56:26] 06SRE, 06Infrastructure-Foundations, 10netops: Homer trying to delete BGP peerings for VMs on new Eqiad ganeti nodes - https://phabricator.wikimedia.org/T381175#10368397 (10cmooney) [16:01:24] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:03:12] (03PS1) 10Vgutierrez: liberica: liberica got renamed to libericad [puppet] - 10https://gerrit.wikimedia.org/r/1099216 [16:11:11] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1099192 (owner: 10Muehlenhoff) [16:12:25] FIRING: [2x] SystemdUnitFailed: logrotate.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:15:43] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp1006.eqiad.wmnet [16:15:54] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp2006.codfw.wmnet [16:21:59] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1006.eqiad.wmnet [16:22:07] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:22:25] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2006.codfw.wmnet [16:41:35] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10368480 (10MatthewVernon) That's an interesting graph, but not what you see if you look at that node during the incident I linked to - e.g. https://grafana.wikimedia.org/goto/mNw6Ge7... [16:44:15] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10368495 (10MatthewVernon) [err, which is not to say we shouldn't be looking at further frontend capacity] [16:45:42] (03PS1) 10JMeybohm: Remove tlsproxy global_cert_name used for noc.d.w [puppet] - 10https://gerrit.wikimedia.org/r/1099217 (https://phabricator.wikimedia.org/T341859) [16:45:53] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1099217 (https://phabricator.wikimedia.org/T341859) (owner: 10JMeybohm) [16:46:30] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:46:32] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mediawiki group for ABreault (WMF) - https://phabricator.wikimedia.org/T381123#10368509 (10MSantos) As the Product Manager responsible for the MediaWiki Release process, I approve this request. [16:46:45] 06SRE, 10SRE-Access-Requests, 06MediaWiki-Engineering, 13Patch-For-Review: Requesting access to releasers-mediawiki group for ABreault (WMF) - https://phabricator.wikimedia.org/T381123#10368510 (10MSantos) [16:47:41] (03CR) 10Clément Goubert: [C:03+1] Remove tlsproxy global_cert_name used for noc.d.w [puppet] - 10https://gerrit.wikimedia.org/r/1099217 (https://phabricator.wikimedia.org/T341859) (owner: 10JMeybohm) [16:47:57] (03CR) 10JMeybohm: [C:03+2] Remove tlsproxy global_cert_name used for noc.d.w [puppet] - 10https://gerrit.wikimedia.org/r/1099217 (https://phabricator.wikimedia.org/T341859) (owner: 10JMeybohm) [16:54:16] (03PS1) 10Máté Szabó: Remove unused IRS config variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099221 (https://phabricator.wikimedia.org/T381178) [16:55:46] !log puppet ca destroy mwmaint.discovery.wmnet - T341859 [16:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:51] T341859: Move noc.wikimedia.org to kubernetes - https://phabricator.wikimedia.org/T341859 [17:01:48] RESOLVED: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:09:27] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:16:52] (03PS1) 10Cathal Mooney: Adjust how we build list of server BGP peerings for CRs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1099225 (https://phabricator.wikimedia.org/T381175) [17:19:14] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Homer trying to delete BGP peerings for VMs on new Eqiad ganeti nodes - https://phabricator.wikimedia.org/T381175#10368561 (10cmooney) The above patch determines what devices need to peer with CRs based on vlan membership (and the vlan nami... [17:21:34] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:31:37] 06SRE, 06Infrastructure-Foundations, 10netops: Change codfw dns hosts BGP peering to top-of-rack switch - https://phabricator.wikimedia.org/T376894#10368575 (10cmooney) [17:33:54] 06SRE, 06Infrastructure-Foundations, 10netops: Change codfw dns hosts BGP peering to top-of-rack switch - https://phabricator.wikimedia.org/T376894#10368581 (10cmooney) Just a note that with the changes made under T381175 we are now creating the list of devices CRs need to peer with based on vlan membership.... [17:35:50] (03PS2) 10Cathal Mooney: Adjust how we build list of server BGP peerings for CRs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1099225 (https://phabricator.wikimedia.org/T381175) [17:37:02] (03CR) 10CI reject: [V:04-1] Adjust how we build list of server BGP peerings for CRs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1099225 (https://phabricator.wikimedia.org/T381175) (owner: 10Cathal Mooney) [17:43:35] (03CR) 10Cathal Mooney: "hmm don't really understand the CI error here. code works just fine, error is:" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1099225 (https://phabricator.wikimedia.org/T381175) (owner: 10Cathal Mooney) [17:52:41] 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10368609 (10cmooney) [17:53:02] 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10368610 (10cmooney) [17:54:01] 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10368612 (10cmooney) [18:10:28] PROBLEM - Host mr1-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [18:11:16] PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [18:12:04] PROBLEM - Host ps1-b13-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [18:12:22] PROBLEM - Host ps1-b12-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [18:13:36] PROBLEM - Host mr1-drmrs IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [18:15:42] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:23:06] RECOVERY - Host ps1-b13-drmrs is UP: PING OK - Packet loss = 0%, RTA = 86.42 ms [18:23:08] RECOVERY - Host ps1-b12-drmrs is UP: PING OK - Packet loss = 0%, RTA = 86.36 ms [18:23:08] RECOVERY - Host mr1-drmrs is UP: PING OK - Packet loss = 0%, RTA = 86.08 ms [18:23:50] RECOVERY - Host mr1-drmrs IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.87 ms [18:25:42] RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:26:42] RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.37 ms [18:46:44] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [19:01:45] RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [19:06:28] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:09:26] FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:26:28] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:29:50] (03PS3) 10Cathal Mooney: Adjust how we build list of server BGP peerings for CRs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1099225 (https://phabricator.wikimedia.org/T381175) [19:29:55] (03CR) 10Pppery: ACMEChiefConfig: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092931 (owner: 10Ncmonitor) [19:31:08] (03CR) 10CI reject: [V:04-1] Adjust how we build list of server BGP peerings for CRs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1099225 (https://phabricator.wikimedia.org/T381175) (owner: 10Cathal Mooney) [19:39:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:12:25] FIRING: [2x] SystemdUnitFailed: logrotate.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:22:07] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:03:13] (03PS1) 10Daimona Eaytoy: Drop $wgWikimediaCampaignEventsEnableCommunityList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099233 (https://phabricator.wikimedia.org/T380075) [21:03:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099233 (https://phabricator.wikimedia.org/T380075) (owner: 10Daimona Eaytoy) [21:09:27] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:13:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:17:22] PROBLEM - MariaDB Replica Lag: s2 on clouddb1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 640.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:17:22] PROBLEM - MariaDB Replica Lag: s2 on clouddb1018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 640.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:01:30] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:06:32] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:56:30] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:06:30] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:09:26] FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:11:30] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:34:12] FIRING: [13x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:37:07] FIRING: [13x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:39:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:41:30] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:46:32] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status